SqueakProcessor

Of the many attempts to build Smalltalk Computers, the early ones depended heavily on microcoded architectures. For many years the Dorado was considered the ideal personal computer for Smalltalk, though the fact that it cost over $100 thousand to build and was so hot and loud that it had to be kept in a separate room meant that not many people got to actually use one.

The switch from bipolar to MOS technology for processors made them slower than main memory. In this environment dense instruction encoding no longer was important and simple hardwired architectures in the form of RISC won the day. Clock speeds for the new CMOS processors increased rapidly and soon zoomed past what even the expensive ECL machines had achieved and they just kept going up until recently. Memory density also increased exponentially but its speed is not that much faster than in the old days. So it seems that the time has come to once again consider microcode.

The fourth part of the "blue book" ("Smalltalk-80: The Language and Its Implementation") is a complete description of the Smalltalk virtual machine in the form of a Smalltalk program. Though actually executable, it was written in non object oriented style meant to be easy to translate into Pascal or C (which several people have done over the years) and to be understandable.

Several chapters of the "green book" ("Smalltalk-80: Bits of History, Words of Advice") describe the difference between the specification from the blue book and practical implementations of Smalltalk virtual machines. It is particularly interesting to read Chapter 7 (pages 113 to 126), which describes the Dorado implementation, while looking at the machine description and microcode listing. Some key parts were done in microcode while the rest was implemented in Alto machine language (a variation of the Data General Nova machine language) which was also implemented in microcode. Alternative schemes were discussed.

Dual Image

The "Back to the Future" paper describes the history and implementation of Squeak. There is a mostly complete description of the whole system in Smalltalk itself. What is missing can easily be seen by searching for the senders of #cCode: and friends. Things like the actual code for the sin primitive were not written and either punt to the underlying Smalltalk implementation or depend on a linked C library to provide the actual code.

The system described in the paper has a normal Smalltalk virtual machine and image and inside that you run the Squeak VM code written in Smalltalk which uses one large Array object inside of which is the Squeak image. So there are two VMs and two images. This is used for development and debugging - for normal use the Squeak VM code is translated into C, compiled into a native code standalone application.

The underlying Smalltalk VM and image in the development mode had to be a non Squeak system in the very first bootstrapping pass (obviously) but now they can be Squeak itself. The underlying image can be extremely reduced if it is going to only be used for this application and nothing more. The application can be divided into three parts: the bytecode interpreter, the memory system and the primitives. An interesting variation would be to short circuit the bytecode interpreter so that the underlying VM's interpreter is used instead while keeping the other two parts as they are. This would significantly speed up the simulated Squeak. If this were done on a machine whose native instruction set happens to be Squeak bytecodes, then the simulated Squeak would be fast enough that it could be used as the actual Squeak seen by the user. The underlying Squeak would be hidden away as part of the machine's firmware - the very reduced system image could even be stored mostly in ROM. In theory this system image would need its own VM to be able to run, but in practice a clever scheme could allow it to share the VM inside itself with the user image. Something like

External Image

Microcode

A three stage pipeline overlaps the fetching of a bytecode with the fetching of the microinstruction for the previous one (in parallel with selecting the address of the following microinstruction) and the execution of the one before that. Microinstructions that get additional data from the bytecode stream (either additional bytes or literals) may cause bubbles in the pipeline.

The address of the first microinstruction corresponding to each bytecode is calculated by simply shifting the bytecode left by one bit. This allows the most common cases of instructions that execute in just one or two cycles to avoid a jump. This scheme also avoids an extra stage with a mapping table at the cost of wasting microcode memory.

The data field in the microinstruction is normally used for forming the address of the following microinstruction, but for some kinds of branches this is not needed and can be used as a larger constant (compared to the offset field) for the memory address or for other things (care must be taken not to enable more than one use in single microinstruction). This is an example of a bytecode that takes four cycles to execute:

address	operation	branch	data
00 bytecode 0	controls	unext	constant1
00 bytecode 1	controls	jumph	label1
label1	controls	unext	constant2
label1+1	controls	next	constant3
00 bytecode2 0	...	...	...

The format of the microinstruction is:

field	description
write	0 = memory to stack, 1 = stack to memory
pop	0 = push from memory (or store to memory), 1 = combine top elements (or pop to memory)
object	00 = context, 01 = method, 10 = receiver, 11 = pointer on stack
immediate	0 = offset is encoded, 1 = offset is a 3 bit value
offset	when encoded, 000 = on stack, 001 = data field, 010 = fetch byte, 011 = fetch byte and mask by data field, 100 = nil, 101 = true, 110 = tag for SmallInteger, 111 = no memory access
alu	four bit logic rule, but the constant rules are replaced: 0000 = add, 1111 = sub
shift	0 = no shift, 1 = shift/rotate as indicated by the data field
branch	000 = next (execute following bytecode), 001 = unext (increment uPC), 010 = jumph (data replaces upper 8 bits of uPC), 011 = jumpl (data replaces lowest 8 bits of uPC - jump within page), 100 = dispatch2 (data replaces upper bits of uPC and ir[7:6] replaces the lowest bits), 101 = dispatch3, 110 = jumpz (data replaces lowest 8 bits of uPC if this microinstruction generated a zero result), 111 = jumpnz
data	8 bits normally used for jumps but can also be an offset

Most microinstructions involve the stack and a memory address indicated by object/offset. If the write and pop fields indicate a combine top elements operation then there is no memory access and the 6 bits normally used for object/offset are ignored - it might be interesting to use them to activate extra hardware instead. Memory access can also be disabled with an encoded offset of 111, but that is different in that everything happens exactly as if the access were going to happen but the actual changing of memory is not executed. This allows things to be tested with the conditional jumps.

Everything operates normally when object is the method as a source (used for fetching literals), but when used as a destination the write does not happen and the PC is set to the offset. The only way to write to a method object is to have a pointer to it on the stack or even for it to be the receiver.