Hardware Recompilation

The architecture can be called "Snow White and the Seven Dwarves" since it involves two very different types of processors, with a larger number of the second type and normally just one of the first kind.

Here is what they have different and in common:

all processors have local caches and can exchange data with other processors via a high speed network
Snow White (SW) processors have a direct connection to external memories, while Dwarf (DW) processors don't and must access such memories indirectly through a SW
SWs can directly execute Smalltalk bytecodes and produce not only the desired results but also an optimized version of the code it executed
DWs can only execute fully optimized code and if they ever encounter anything they can't handle (which shouldn't be present in fully optimized code) then they abort execution which must be continued by a SW

Bytecode-to-bytecode translation in hardware

The SW always creates a copy of every code it executes. The copy lives in a separate instruction space that "shadows" the same addresses so that the cache can fetch it (instead of the original) on the next execution.

The optimization unit creates the copies, but introduces small changes while doing so. Two instructions might be exchanged, for example, or an instruction which is effectively a NOP but includes information might be inserted. Unlike a software optimizer, this hardware can't look dozens or even hundreds of instructions ahead so all of its changes are very local. But like in the bubble sort algorithm, many passes with local effects can have global effects. The next time that a changed copy is executed, a copy of the copy (with even more changes) is made. If a given code is not executed again, then it wasn't too important and the fact that we won't have a chance to optimize it further isn't a problem. The system automatically concentrates its effort on the "hot spots" of the code.

When the optimizer can't find anything to change when creating a copy of a code fragment, then it is tagged as "fully optimized" (in contrast to just optimized) and from now on it will be executed by the DWs. Since the optimizer only works on the code branches seen so far, this tag is really just a hint and unoptimized stuff might be lurking in there. DWs can't handle that, but it isn't a problem as they just give up and a SW can pick up the code and continue from where the other processor had stopped.