Squeak - Performance and Optimization

The Story of Squeak

Performance and Optimization

Thanks to today's fast processors, Squeak's performance was satisfactory from the moment the translator produced its first C translation of the virtual machine. Since this debut, Squeak's performance has improved steadily, and the current version, 1.18, executes about four million byte codes or 173 thousand message sends per second on a 110 MHz Power PC Mac 8100. Table 5 shows the improvement in Squeak's performance over its first year. Two simple benchmarks from the release were used to track the approximate byte code execution rate ("10 benchmark") and the cost of full method activation and return ("26 benchFib"). Note that the latter benchmark measures the worst case; not all message sends require a full activation.

Date	byte codes/sec	sends/sec
Apr. 14	458K	22,928
May 20	1,111K	60,287
May 23	1,522K	69,319
July 9	2,802K	134,717
Aug. 1	2,726K	130,945
Sept. 23	3,528K	141,155
Nov. 12	3,156K	133,164
Dec. 12	3,410K	169,617
Jan. 21	4,108K	173,360

Table 5: Squeak performance over time

The rapid early leaps in performance were due partly to removal of scaffolding—such as assertion checks and range checks on memory references—and partly to improving the runtime model of the translator. For example, object references were originally represented as offsets relative to the base of the object memory rather than as true direct pointers. After May, however, the easy changes had all been made and improvements came in smaller increments, sometimes only a few percent at a time. The most significant of these optimizations include:

recycling method contexts (this cut the allocation rate by a factor of 10)
managing the frequency of checks for user and timer interrupts
keeping the instruction and stack pointers (IP and SP) in registers
making the IP and SP be direct pointers, rather than offsets into their base object
patching the dispatch loop to eliminate an unneeded compiler-generated range check
eliminating store-checks when storing into the active and home contexts
comparing small integers as oops rather than converting them into integers first
peeking for and doing a jump-if-false byte code that follows a compare

Table 6 compares Squeak's current performance over a small suite of benchmarks with that of several commercial Smalltalk implementations that cover a cross-section of implementation technologies, including a bytecode interpreter similar to the original Smalltalk-80 virtual machine (Apple Smalltalk), an aggressively optimized interpreter (ST/V Mac 1.1), and two implementations using dynamic translation to native code (ParcPlace Smalltalk 2.3 and 2.5). In order to draw meaningful comparisons between Squeak and these 68K-based virtual machines, all timings except those in the last column were taken on a Duo 230 (33Mhz 68030). Since Squeak runs significantly better on modern processors with instruction caches and a generous supply of registers, the final column of the table, SqueakPPC, shows Squeak's performance relative to C on a Power PC-based Macintosh.

	AppleST	ST/V	PP2.3	PP2.5	Squeak	SqueakPPC
IntegerSum	185.00	32.00	7.58	6.92	62.34	72.56
VectorSum	99.00	30.00	10.30	11.50	61.70	41.01
PrimeSieve	53.00	40.00	16.07	12.10	70.53	51.57
BubbleSort	88.23	35.29	21.35	13.98	80.29	63.12
TreeSort	43.90	5.00	20.29	1.98	16.33	7.31
MatrixMult	40.79	6.00	22.80	2.94	18.00	36.74
Recurse	28.26	9.47	3.73	2.08	50.26	35.19

Table 6: Virtual machine performance relative to optimized, platform-native C for various benchmarks. Smaller numbers are better. A result of 1.0 would indicate that a benchmark ran exactly as fast as optimized C.

So far in the design of Squeak, we have emphasized simplicity, portability, and small memory footprint over high performance. Much better performance is possible. The PP2.3 and PP2.5 columns of Table 6 are examples of Deutsch-Schiffman-style dynamic translation (or "JIT") virtual machines [Deut84]. Dynamic translation avoids the overhead of byte code dispatch by translating methods into native instructions kept in a size-bounded cache. The Self project [ChUn91] [Hölz94] broke new ground in high performance by investing more compilation time in heavily used methods, using inlining to eliminate expensive calls and enable further optimizations. This work, which was later extended to Smalltalk and Java [Anim96], shows that one can obtain performance approaching half the speed of optimized C without compromising the semantics of a clean language. Unfortunately both of these approaches have resulted in virtual machine implementations that are, by Squeak standards, unapproachable and difficult to port.

We believe that Squeak can enjoy the same performance as commercial Smalltalk implementations without compromising malleability and portability. In our experience the byte code basis of the Smalltalk-80 standard [Inga78] is hard to beat for compactness and simplicity, and for the programming tools that have grown around it. Therefore dynamic translation is a natural avenue to high performance. The Squeak philosophy implies that both the dynamic translator and its target code sequences should be written and debugged in Smalltalk, then automatically translated into C to build the production virtual machine. By representing translated methods as ordinary Smalltalk objects, experiments with Self-style inlining and other optimizations could be done at the Smalltalk level. This approach is currently being explored as a way to improve Squeak's performance without adversely affecting its portability.