The Story of Squeak


Performance and Optimization

Thanks to today's fast processors, Squeak's performance was satisfactory from the moment the translator produced its first C translation of the virtual machine. Since this debut, Squeak's performance has improved steadily, and the current version, 1.18, executes about four million byte codes or 173 thousand message sends per second on a 110 MHz Power PC Mac 8100. Table 5 shows the improvement in Squeak's performance over its first year. Two simple benchmarks from the release were used to track the approximate byte code execution rate ("10 benchmark") and the cost of full method activation and return ("26 benchFib"). Note that the latter benchmark measures the worst case; not all message sends require a full activation.

Date byte codes/sec sends/sec
Apr. 14 458K 22,928
May 20 1,111K 60,287
May 23 1,522K 69,319
July 9 2,802K 134,717
Aug. 1 2,726K 130,945
Sept. 23 3,528K 141,155
Nov. 12 3,156K 133,164
Dec. 12 3,410K 169,617
Jan. 21 4,108K 173,360

Table 5: Squeak performance over time

The rapid early leaps in performance were due partly to removal of scaffolding—such as assertion checks and range checks on memory references—and partly to improving the runtime model of the translator. For example, object references were originally represented as offsets relative to the base of the object memory rather than as true direct pointers. After May, however, the easy changes had all been made and improvements came in smaller increments, sometimes only a few percent at a time. The most significant of these optimizations include:

Table 6 compares Squeak's current performance over a small suite of benchmarks with that of several commercial Smalltalk implementations that cover a cross-section of implementation technologies, including a bytecode interpreter similar to the original Smalltalk-80 virtual machine (Apple Smalltalk), an aggressively optimized interpreter (ST/V Mac 1.1), and two implementations using dynamic translation to native code (ParcPlace Smalltalk 2.3 and 2.5). In order to draw meaningful comparisons between Squeak and these 68K-based virtual machines, all timings except those in the last column were taken on a Duo 230 (33Mhz 68030). Since Squeak runs significantly better on modern processors with instruction caches and a generous supply of registers, the final column of the table, SqueakPPC, shows Squeak's performance relative to C on a Power PC-based Macintosh.

  AppleST ST/V PP2.3 PP2.5 Squeak SqueakPPC
IntegerSum 185.00 32.00 7.58 6.92 62.34 72.56
VectorSum 99.00 30.00 10.30 11.50 61.70 41.01
PrimeSieve 53.00 40.00 16.07 12.10 70.53 51.57
BubbleSort 88.23 35.29 21.35 13.98 80.29 63.12
TreeSort 43.90 5.00 20.29 1.98 16.33 7.31
MatrixMult 40.79 6.00 22.80 2.94 18.00 36.74
Recurse 28.26 9.47 3.73 2.08 50.26 35.19

Table 6: Virtual machine performance relative to optimized, platform-native C for various benchmarks. Smaller numbers are better. A result of 1.0 would indicate that a benchmark ran exactly as fast as optimized C.

So far in the design of Squeak, we have emphasized simplicity, portability, and small memory footprint over high performance. Much better performance is possible. The PP2.3 and PP2.5 columns of Table 6 are examples of Deutsch-Schiffman-style dynamic translation (or "JIT") virtual machines [Deut84]. Dynamic translation avoids the overhead of byte code dispatch by translating methods into native instructions kept in a size-bounded cache. The Self project [ChUn91] [Hölz94] broke new ground in high performance by investing more compilation time in heavily used methods, using inlining to eliminate expensive calls and enable further optimizations. This work, which was later extended to Smalltalk and Java [Anim96], shows that one can obtain performance approaching half the speed of optimized C without compromising the semantics of a clean language. Unfortunately both of these approaches have resulted in virtual machine implementations that are, by Squeak standards, unapproachable and difficult to port.

We believe that Squeak can enjoy the same performance as commercial Smalltalk implementations without compromising malleability and portability. In our experience the byte code basis of the Smalltalk-80 standard [Inga78] is hard to beat for compactness and simplicity, and for the programming tools that have grown around it. Therefore dynamic translation is a natural avenue to high performance. The Squeak philosophy implies that both the dynamic translator and its target code sequences should be written and debugged in Smalltalk, then automatically translated into C to build the production virtual machine. By representing translated methods as ordinary Smalltalk objects, experiments with Self-style inlining and other optimizations could be done at the Smalltalk level. This approach is currently being explored as a way to improve Squeak's performance without adversely affecting its portability.