View this PageEdit this PageUploads to this PageHistory of this PageHomeRecent ChangesSearchHelp Guide


Several alternative designs were considered for the I/O Processor, including ones vaguely inspired by the ARM and then by the Data General Nova before the current PDP-11/68000 like instruction set was defined. The focus was on having the smallest possible program size, specially when manipulating subsets of bits of i/o ports. Below are other designs which would have simpler implementations at the cost of requiring more instructions to accomplish the same tasks.

Inspired by the Data General Nova

It was surprising to me as I studied less well known microprocessors from the 1970s and early 1980s how popular the Nova instruction set was as a source of inspiration. It was probably second only to the PDP-11 in influence. An early version of the I/O Processor borrowed many elements from the Nova but the design described here is far closer to the original. The fields have been moved around in an attempt to make the system as a whole even more uniform.

15 14 13 12 11_10 9_8 7_6 5_4 3 2 1 0

Bits 14 through 12 determine the function: ADD=1 means that the destination is added to the source, nINV=0 means that the source is bitwise inverted and INC=1 means that the source is incremented by 1. The combination with all three bits set is redundant since the same effect can be achieved using the CARRY field, so it is redefined to mean the bitwise AND between the destination and source. The other combinations are, from 0 to 6 respectively: not (COM), negate (NEG), move (MOV), increment (INC), add inverted (ADC), subtract (SUB) and add (ADD).

The SHFT field encodes: pass through, rotate left (L), rotate right (R) and swap bytes (S). The CARRY field encodes: old carry, clear carry (Z), set carry (O) and invert carry (C). When nSAVE=0 (indicated by a lack of "#" after the instruction name) the result is actually written into destination register. The Test Result bit will cause the next instruction to be skipped if zero is (or would have been) written into the destination. The Test Carry bit is similar but checks if the resulting carry of the operation was zero. Finally, the False bit inverts the sense of the test so that the next instruction is not skipped when the tested condition is true. The non zero combinations of bits 2 through 0 are: SKP, SZC, SNC, SZR, SNR, SEZ and SBN.

15_14 13 12 11_10 9_8 7_6_5_4_3_2_1_0

Here the memory and I/O instructions have been made the same. The IND bit indicates that an indirect access will be made (shown as a "@" at the assembler level). The W bit indicates that REG will be written to memory, otherwise REG is the destination of a read from memory.

Like on the Nova, IDX can indicate the first two registers as the base for the address or it can encode that the base should be zero or the PC. The OFFSET is always a signed 8 bit number.

15_14_13 12 10_11 9_8 7_6_5_4_3_2_1_0

This is the jump and link instruction. The previous value of PC is saved in the indicated register (unless that is AC0, in which case the value is discarded and the instruction is a simple jump) while the effective address is stored in PC.

15_14_13 12 10_11 9_8 7_6_5_4_3_2_1_0

This is the increment (or decrement) and skip if zero instruction. The INC field (as a two bit signed constant) is added to the indicated memory position and if the result is zero then the following instruction is skipped. The possible values of INC are 0, 1, -2 and -1. Adding a 0 doesn't change the value in memory but can be used to test it. Note that this instruction adds complexity to the design (a register, an execution state and an operand format that are not used by any other instruction) and can be replaced by a sequence of five of the other instructions, but it is faster, avoids trashing two visible registers and strongly reflects the spirit of the original Nova so that the costs are almost certainly worth it.

Any additional funcionality (stack operations, for example) can be implemented in the form of special I/O locations.

Inspired by Jan Gray's gr0040

Actually, the main inspiration from the gr0040 was the idea that you could have a three address ADD instruction in an otherwise two address machine. There are 15 registers with register 0 hardcoded to the value zero.

15 14_13_12_11_10_9_8_7_6_5_4_3_2_1_0

When bit 15 is clear the instruction is CALL to the address indicated by the remaining 15 bits and the PC (register 15) is saved in register 14. All other instructions have bit 15 set.

Bit 14 is set for the memory instructions and clear for register instructions.

15_14 13 12 11_10_9_8 7_6_5_4 3_2_1_0

The W bit is set when REG will be written to memory and clear for reads. IMM=0 means that OFFSET is a regiser, otherwise it is an immediate value as described below. IDX is the register with the base address. For immediate values the OFFSET is encoded:

3 2 1_0

Bits 1 and 0 can encode the values 1 through 4. If the value zero is needed then the non immediate format with register zero can be used. INC=1 means that VALUE should be added to IDX, otherwise it is subtracted. SAVE=1 means that after the addition or subtraction the IDX field receives the new value, otherwise it remains as it was.

Note that if a read immediate instruction has IDX=15 and OFFSET=12 then the 16 bits following the instruction will be loaded into the destination register and the PC will be incremented by one, skipping the value that was just loaded.

15_14_13_12 11_10_9_8 7_6_5_4 3_2_1_0
1 0 0 0 SRC2 DEST SRC1

This is the three address ADD instruction. This allows additions (which are the most common non memory instruction) to avoid overwritting one of the operands. For two address math instructions we have:

15_14_13_12 11 10 9 8 7_6_5_4 3_2_1_0

The ADD, nINV and INC bits work exactly as in the previous design. When IMM=1 the SRC is a four bit signed operand instead of indicating a register.

15_14_13_12 11_10_9_8 7_6_5_4 3_2_1_0

Here RULE defines the 16 possible logical results of combining DEST and SRC. Bits 11 and 10 indicate the result when a bit in DEST is set while bits 9 and 8 for when it is clear. In the same way bits 11 and 9 indicate the result when a bit in SRC is set while bits 10 and 8 when it is clear.

15_14_13_12 11_10_9_8 7_6_5_4_3_2_1_0

The BRANCH instructions include an eight bit signed offset to be added to the PC when COND is true. The 16 possible values of COND are the same as on the PDP-11 or the ARM. This implies four condition bits (carry, zero, overflow, negative) which could be saved to or restored from memory when REG=0.

32 bit extension

This design could be changed to use 32 bit registers, but then some of its limitations would become even more obvious. There are no shifts or rotations and memory accesses only deal with words so handling bytes would be very awkward. One non standard solution would be a variation of the tagged pointers idea mentioned as a possible extension for the I/O Processor. But in this case addresses would always point to individual bits (so byte addresses would end with 000 and word addresses with 00000) and the top five bits would indicate the size of what was being addressed (with zero redefined to mean 32 bits). The memory access instructions then would have to do shifting and masking as part of their job and the offsets could be properly scaled. This would give the processor a little of the flavor of the old TMS34010 graphics processors.

32 bit processor inspired by Jan Gray's gr0040

This one was far closer to the original (though the introduction of cascades changed that somewhat) in terms of the various instruction formats and the use of a prefix to get larger immediate values. But here there is a second prefix to allow 32 bit constants and not just 16 bit ones. The PC is separate from the other 16 registers. Register 0 reads as zero and ignores writes except for memory instructions, where it acts as the status register as in the above design. Memory operations:

15 14 13 12 11_10_9_8 7_6_5_4 3_2_1_0
1 Byte Save Store IDX REG OFFSET

Byte indicates that only 8 bits should be loaded or stored instead of 32 bits. In all cases the addresses always point to bytes and so should end in 00 when pointing to words. The address is obtained by adding OFFSET (a four bit signed values, properly scaled when dealing with words) to the register specified by the IDX field. Save indicates that the resulting address should be written back to the register specified by IDX. When saving and using a positive offset (bit 13=1 and bit 3=0) the original content of IDX is actually used to address memory (post increment mode) while in all other cases the address is the result of the addition. The store bit indicates the direction of the data transfer.

The remaining instructions (or instruction groups):

15_14_13_12 Name Description
0 0 0 0 prefix12 extends the immediate value of the next instruction
0 0 0 1 prefix28 extends the immediate value of the next instruction
0 0 1 0 branch depending on a condition adds the 8 bit signed offset to the PC
0 0 1 1 cascade combines the two sources and sends the result to the following instruction
0 1 0 0 logic combines the destination and source via a logic operation
0 1 0 1 math combines the destination and source via a math operation
0 1 1 0 shift shifts or rotates the destination register as indicated by the source
0 1 1 1 jumpAndLink similar format as memory instructions, saves the old PC in REG and loads the address into PC

While most instructions are 16 bits long, those that start with prefix12 are 32 bits long and those that start with prefix28 are 48 bits in size.

For the logic instructions bits 11 through 8 directly specify the results for all four combinations of destination and source bits (as described for the previous design).

For math instructions bits 11 through 9 encode ADD, nINV and INC respectively as in the previous designs while bit 8 indicates IMM - that the source is not a register number but a 4 bit unsigned value instead. Given that the 111 combination isn't very interesting (but it isn't needed for AND like in the Nova given the logical instructions) it is redefined to be exactly like 101 (SUB) but without writting the result, so it is a comparison instruction that only affects the status bits.

For shifts bits 11 through 9 encode LARGE, SIGNED and LEFT respectively while bit 8 indicates IMM. LARGE means that 16 should be added to the source value, which makes sense for IMM but not for registers so LARGE=1 and IMM=0 indicates a multiplication instead of shift. SIGNED means that bit 31 gets replicated on a shift right instead of zeros being inserted. A signed shift left doesn't make sense so it is interpreted to mean a rotate left instead.

Cascade instructions always take two register sources and their result is sent directly to the next instruction. Normal instructions are of the type rD := rD OP rS but when they follow a cascade the destination is not used as a source but is replaced by the result of the cascade. So the total effect of the two instructions is r4 := (r1 OP1 r2) OP2 r3 where r1 and r2 come from the cascade and r3 and r4 are the registers in the following instruction.

For cascades bits 11 through 8 are 0, nINV, INC and 0 respectively for math, 0, SIGNED, LEFT and 1 for shifts and 1, INV, AND and MOD for logic instructions. The bits have exactly the same meanings for the math (with an implied ADD=1 and IMM=0) and shift (with an implied LARGE=0 and IMM=0) cascades as for the normal variations. For logic INV means that the result is inverted, AND means that an AND operation is used instead of an OR and MOD modifies an AND by inverting the second operand and modifies an OR by making it into an exclusive OR. None of the logic combinations where one or both of the sources aren't used are possible with this scheme but these wouldn't make any sense with a cascade anyway. The AND combination with the first source inverted isn't possible either, but just exchanging the two sources gets the same effect.

Branches normally have a four bit condition field and a signed 8 bit offset just as described in the previous design. An optional feature is to let the "never" condition indicate a skip instruction instead of a branch. In that case bits 7 to 4 indicate the actual condition (with exactly the same encoding as bits 11 to 8 in normal branch instructions) and bits 3 to 0 represent the value to be ORed conditionally into the four skip bits of the status register. These bits are shifted up on each instruction (except for prefix and cascade) and if the bit that was shifted out was 1 then the instruction is treated as a NOP. On branches and jumps the skip bits are cleared.


This instruction encoding uses all possible combinations of 16 bits, so there are practically no "holes" left for coprocessor instructions and other possible expansions. But when we consider the prefix instructions, then there are available encodings since these make no sense except when the following instruction has an immediate operand. So we could simply define that a prefix followed by any non immediate instruction is a coprocessor instruction. We can imagine the coprocessor receiving an operand from the main processor and possibly returning a result, so it would logically sit between the output of the ALU and the register bank input. If no operand is needed then the coprocessor simply ignores it and when no result is generated then the coprocessor copies its input to its output becoming effectively transparent to the main processor. The decoding of the prefix bits is up to the coprocessor and so can specify a number of coprocessor registers that should take part of the operation.

24 bit variation on the I/O Processor

The instruction formats are essentially the same, but with 16 registers (register zero can be the PC) and a new "extra" field:

23_22_21_20 19_18 17_16 15_14_13_12 11_10_9_8 7_6 5_4 3_2_1_0
opcode cond dest mode destination extra extra mode source mode source
X X 1 1 cond dest mode destination 12 bit immediate value

Bit 23 selects between math and logic instructions and for the former bit 22 selects between add and subtract. For logic instructions we have:

. source = 0 source = 1
dest = 0 0 not i22
dest = 1 not i21 i21 or i20

Replacing the "subtract and set overflow" and the "set destination to destination" instructions with "shift" and "rotate", we have:

. X_X_0_0 X_X_0_1 X_X_1_0 X_X_1_1

The cond field:

0 0 always
0 1 never
1 0 if flag is true
1 1 if flag is false

The flag is normally set when the previous result was zero, but it can be set when there was a carry/borrow in the previous result (ADC and SBB instructions) or when there was an overflow in the previous result (ADV instruction). When the condition is false then the destination is not changed, but the calculation is actually done and the flag is affected. So "never" can be used to test things.

The destination and source modes are:

0 0 register
0 1 indirect
1 0 post increment
1 1 pre decrement

The mode for the extra field is:

0 0 add to source address
0 1 add to destination address
1 0 extra register is second source
1 1 extra register is destination

The first two modes of the extra field don't make sense when combined with the first mode of the source or destination. When an offset is not desired for neither the source nor the destination the extra field can be set to zero and the first two modes will have no effect.

Nibbler - a variable sized Forth processor based on dietST

The basic instruction format is four bits:

3 2 1_0
op mode value

The "op" field selects between "push literal" (0) and "call" (1). The "mode" indicates "in the instruction" (0) or "following the instruction" (1). The last two bits is the operand in mode 0, or the size of the operand in mode 1. A size of 00 is not valid and indicates that the following nibble has the actual size (where 0 is defined to mean 16). This allows operands of up to 64 bits.

The registers are PC (instruction pointer), RP (return stack pointer) and DP (data stack pointer). The stacks are in memory and grow upwards. Since variable sized data can be pushed on the stack, the nibble pointed to by RP or DP has the size of the item just under it. The registers are wide enough to address all memory but their actual size isn't visible to the programmer.

Calls with mode 0 (instructions 1000 to 1011) or with operands 4 bits wide (instructions 1101 0000 to 1101 1111) invoke the built in primitives, so we have 1 (push) + 1 (call) + 4 + 16 = 22 instructions in all. By carefully selecting the four most frequent instructions as the mode 0 ones the average size shouldn't be that much over the 5 bits of MISCs. The load and store instructions should have an extra argument saying the size of the operand in main memory. Another version of load and store could leave the size tag as used on the stacks.

16 bit design inspired on Hack

The educational processor described in the talk shown in this video is very interesting, but very primitive. That isn't much of a problem since it is only meant as the target for a compiler translating from Java-like bytecodes, but the variation described below is a usable machine language on its own. Here is the instruction format:

15 14 13 12 11 10 9 8 7 6_5_4 3 2_1_0
add zd zs io id is tz tn it d m s

There are eight registers, and d selects one as the destination and one of the sources while s selects the other source. Register 7 is also the program counter.

The ALU's output, destination input and source input are inverted as indicated by the io, id and is bits respectively.

When add is selected then zd and zs zero the destination input or the source input before the inversions (if selected). But these bits are not as useful for logical operations and since modern FPGAs have nice multiplication circuits they are redefined as:

add zd zs operation with multipliers alternative without multipliers
1 x x addition addition
0 0 0 multiply high unsigned shift left unsigned
0 0 1 multiply high signed shift left signed
0 1 0 multiply low unsigned swap bytes
0 1 1 and and

Note that though the table above says "addition", it is possible to get many interesting operations by zeroing or inverting the inputs and outputs. These are not always obvious - it isn't easy to see how subtraction can be done, for example. But since inverting an input results in -IN-1 using 2's complement notation, inverting both the destination input and also the output gives us -((-D-1)+S)-1 which is +D+1-S-1 resulting in D-S. See the table below.

Multiplying by two to the power of N is the same as shifting left by N bits if the lowest 16 bits are output or shifting right by 16-N bits if the highest 16 bits of the result are used. Both signed and unsigned versions make sense in the latter case.

The word following the current instruction can either be the next instruction or the address of the next instruction. The tz and tn test if the result was zero or negative, respectively. The it bit inverts the sense of the test, causing a jump if the test was false. So the following combinations are possible:

tz tn it = 0 it = 1
0 0 next instruction jump
0 1 jump if less jump if greater or equal
1 0 jump if equal jump if different
1 1 jump if less or equal jump if greater

When m is not selected then any of the eight registers can be used as source and destination but when the source is register 6 its original value is replaced by the word following the instruction. When both this immediate mode and a jump are used the instruction is 48 bits long. If m is selected, only the first four registers are available as the source (and address) since the high bit of its field selects the direction of the memory transfer:

s2 operation
0 mem(s) := d op s
1 d := d op mem(s)

Instructions can take from two to five clock cycles to execute as shown (note that we would normally have mem(r3) in states immediate and jump but the hardware changes that to mem(r7) instead, and the setting of ir in fetch can't be done in a normal instruction):

state execute equivalent instruction operation next state
fetch 1011 1100 0111 0111 r7 := r7 + 1, ir := mem(r7) if s=r6 then immediate else exec
immediate 1100 0000 0110 1111 r6 := mem(r7) skip imm
skip imm 1011 1100 0111 0111 r7 := r7 + 1 exec
exec ir . if cond then jump else fetch
jump 1100 0000 0111 1111 r7 := mem(r7) fetch

Here are the 40 math and logic operations that can be selected using the six highest bits of the instruction (the 17 repeated or normally uninteresting ones are indicated in italics):

. 011xxxxx 100xxxxx 101xxxxx 110xxxxx 111xxxxx
xxx000xx d and s d + s d s 0
xxx001xx d and not s d - s - 1 d - 1 - s - 1 - 1
xxx010xx not d and s s - d - 1 - d - 1 s - 1 - 1
xxx011xx d nor s - d - s - 2 - d - 2 - s - 2 - 2
xxx100xx d nand s - d - s - 1 - d - 1 - s - 1 - 1
xxx101xx not d or s s - d - d s 0
xxx110xx d or not s d - s d - s 0
xxx111xx d or s d + s + 1 d + 1 s + 1 1

Superscalar based on WM

This is a good design for running sequential C code with lots of floating point operations. The basic idea is a set of relatively independent units that communicate using FIFOs. Sequential semantics is preserved by having each unit work strictly in order, but there is no need for the units to deal with instructions in order between themselves. Instead, the sequential semantics is enforced by the FIFOs.

Most of the papers don't seem to be available, and the evaluation one that is doesn't give too many details about the instruction set. It probably wasn't too different from what is proposed here:

31 to 27 26 to 22 21 to 17 16 to 12 11 to 7 6 to 2 1 0
Rdest Rsrc offset 0 0
lower upper incr start count op 0 1
Fdest F1 F2 F3 op1 op2 1 0
Rdest R1 R2 R3 op1 op2 1 1

The first instruction type is the jumpAndLink, which has the semantics of "Rdest := oldPC; PC := [Rsrc]+offset", where offset is a 22 bit value which always has the bottom two bits as zero. R0 is always zero, so using it as Rdest makes the instruction a simple jump while as Rsrc means jump to an absolute location in the first 4MB of memory. Using R1 as Rdest means jumpIfFalse while R2 makes it jumpIfTrue. These variations consume a bit from the condition FIFO. When R1 is used as Rsrc we have a relative jump.

The second group of instructions is the address calculation. The op code means:

bit when 0 when 1
6 single stream
5 byte word
4 integer float
3 read write

The stream operations need five register values. An area in memory delimited by "lower" and "upper" is used, with the initial pointer at "start" and incremented by "incr" for each value read or written for a total of "count" times. The pointer wraps around from "upper" to "lower". This is typical for DSPs and can be used to scan arrays in memories in various interesting ways. The Integer unit has two input FIFOs for data and two output FIFOs, and so does the Float unit for a total of eight addressable FIFOs.

The single value memory operations could be implemented with the previous type by just setting "count" to 1, but it is nice to have a slightly different format similar to the one for jumpAndLink.

The last two groups of instructions are identical in everything except one is executed in the Float unit and the other in the Integer unit. The semantics are "Rdest := (R1 op1 R2) op2 R3", which is like two cascaded instructions in RISC42. Though "op1" and "op2" use the same encoding, they are executed by different ALUs in two successive pipeline stages. Note that R0 is always zero, while R1 is a value from input FIFO1 (or to output FIFO1) and R2 means FIFO2. Comparison OPs copy one of their inputs to the output while placing one bit into the condition FIFO.


Links for a Computer Architecture course