View this PageEdit this PageUploads to this PageHistory of this PageHomeRecent ChangesSearchHelp Guide

I/O Processor

Though particularly interesting for ASIC implementations where the hardware is unchangeable and all flexibility must come from the software, the I/O Processor is also a good option for FPGA designs by replacing random logic with shared uses of block RAMs.


16 tasks run in parallel, with the highest priority task that is not blocked having full use of the processor in the next clock cycle. Task 15 has the highest priority of all and 0 the lowest.

The RAM stores the 16 program counters, 16 wake up values, the actual machine code for all the threads and all the state for these thread. A second port for this RAM (read-only) sends data to the video circuits.

On the Oliver truck terminal, the following tasks are used (when the USB port is not available, though it is likely that the USB will get its own I/O processor):

Task Name Description
15 wait makes the single comparison circuit do the work of 16. The other tasks make requests to be wakened at a specific clock cycle (as indicated by a 16 bit counter running at 54MHz, which cycles every 1.21 ms). The comparison is set to the value of the soonest requested cycle and then the task goes to sleep. When it wakes up it also wakes up the indicated task (though since that has a lower priority it will have to wait for task 15 to finish its job first), and it finds the next soonest requested cycle and sets the comparison register to that.
14 addWait receives a request from other tasks to be woken at a given clock cycle and then changes the tables used by task 15 so this will happen. Care is taken so that being interrupted by task 15 doesn't cause a conflict
13 videoBuf reads four words from memory into the video buffer
12 videoV generates the vertical timing for the video. It can also change the program counter for task 11 (which is suspended at this point) to select between different kinds of horizontal lines
11 videoH generates the horizontal timing for the video. It wakes up task 12 once per line
10 videoChroma accepts requests from the main processor to change the setting for the chroma frequency
8 barcode detects pulse widths in the bar code reader input
7 lcd sends data to the liquid crystal display
6 sound reads a byte of memory and sends it to the audio DAC
5 com2RX receive data for serial port 2
4 com1RX receive data for serial port 1
3 com2TX transmit data for serial port 2
2 com1TX transmit data for serial port 1
1 keyboard scans the keys to check for any change
0 rtc keeps track of the real time


The processor has a register based architecture (2 address). Each task has its own 8 register bank, with register 7 being the program counter (bit 15 of register 7 is the flag which normally indicates whether the previous result was zero, but can have other meanings for some instructions). So 128 words of memory (3F80 to 3FFF) are registers, though the registers for any unused task can be used for data or instructions.

The two variations of the instruction format are:

15_14_13_12 11_10 9_8 7_6_5 4_3 2_1_0
operation save destination mode destination register source mode source register
xx11 save destination mode destination register immediate unsigned 5 bit value

Where operation codes are:

0000 MOV move source to destination
0001 ADC add and set flag to value of carry
0010 ADD add source to destination
0011 ADI add immediate to destination
1000 NEG move complement of source to destination
0101 SBB subtract and set flag to value of borrow
0110 SUB subtract source from destination
0111 SBI subtract immediate from destination
1000 MULL multiply source with destination and save low word
1001 XOR exclusive or source with destination
1010 OR inclusive or source with destination
1011 ORI or immediate with destination
1100 MULH multiply source with destination and save high word
1101 BIC bit clear - and inverted source with destination
1110 AND and source with destination
1111 ANI and immediate with destination

Note that the logical operations are structured like this:

. source = 0 source = 1
destination = 0 0 not i14
destination = 1 not i14 or not i13 i13

If multipĺiers are not easily available, the MULL and MULH instructions can be replaced with LSHR (logical shift right by 1) and ASHR (arithmetic shift right by 1) respectively.

The save field is interpreted as:

00 . always save result to destination
01 ns never save result to destination
10 ifz save to destination if flag is zero
11 ifnz save to destination if flag is not zero

The mode field (destination and source) is interpreted as:

00 Rx register the register itself is used as a source or destination
01 *Rx index the register is used as the address of a source or destination in memory
10 *Rx++ post increment like index, but the register is incremented after being used as the address
11 *--Rx pre decrement like index, but the register is decremented before being used as the address

I/O Ports

A number of special memory positions is set aside as input and output ports. These are at the bottom of the memory map where their addresses can easily be loaded into a register with with immediate instructions. Here is the example from Oliver:

address width description (read) description (write)
0 16 task ready - each bit corresponds to a task and if it is 0 then the task is suspended and if it is 1 then the task is ready to run clears the corresponding bits of the task ready
1 16 task ready sets the corresponding bits of the task ready
2 16 next word from ring network sends next word to ring network
3 1/16 last word indication from ring network sends last word to ring network
4 16 counter - is incremented at 54MHz trigger - when this matches the counter, task 15 is set to run
8 16 sleep set sleep value and put current task to sleep
9 4 previous task (set by writing to sleep) .
10 1 serial tx1 sets tx1
11 1 serial rx1 sets change in rx1 to wake up task 4
12 1 serial tx2 sets tx2
13 1 serial rx2 sets change in rx2 to wake up task 5
14 1 bar code reader sets change in bar code reader to wake up task 8
15 10 keyboard columns sets keyboard column value for one clock cycle (used for precharge)
16 8 . sets the sound DAC
18 8 LCD data sets LCD data
19 7 . sets LCD control (e, w, rs, backlight, cs1, cs2, cs3)
1A 16 . sets chromaLow - bottom 16 bits of the 32 bit counter that generates a 3.58MHz sine wave
1B 16 . sets chromaHigh - top 16 bits of the 32 bit counter that generates a 3.58MHz sine wave
1C 5 . sets the video control (vsync, hsync, blank, burst1, burst0)
Internal ports (0 to F) for which the read value is not defined will actually return the value of the task ready register while external ports (10 to 1F) will return the value of the keyboard columns.


The worst case is when switching from another task to an instruction with pre-decrement or post-increment in both source and destination:
  • save PC/flag (old task)
  • load PC/flag (new task)
  • decode instruction and address registers
  • update source register
  • update destination register
  • fetch source/destination from memory
  • execute and (possibly) save result
but only three cycles when the current task is the same as the previous one and both source and destination are registers.

At 54MHz this means the total processing power is around 18 MIPS, which is a respectable speed for the simple tasks that it must handle even when divided among 16 coroutines.


  • The reason why task must specify when they want to wake up in terms of absolute time instead of a delay value is that they might take an arbitrary time to actually start executing once the desired time arrives and this execution might proceed at an uneven pace as higher priority tasks interfere. With absolute time value the next request can simply add a number to the previous requested value and so eliminate a huge source of jitter.
  • When any instruction uses register 7 as the destination, the old (but already incremented) value of the PC is saved in register 6. So there are no jumps but only jumps to subroutine. With care, register 6 can still be used to hold temporary values between control flow changes.


The goal of the above design was to use up as few FPGA resources as possible. So it has little logic (using a single block RAM for both main memory and all registers, including the PC) and a relatively dense instruction set. Its 16 bit size is well suited to working with only internal FPGA memory and extending the instruction word, including caches and external memory and other such improvements probably wouldn't make much sense. Some more modest extensions, however, could make it suitable for new applications without eliminating the features that make it an interesting architecture in the first place.

Prefix Instruction

Some obvious limitations of the instruction set are the number of registers, the number of modes and the use of the destination as one of the sources. These limitations aren't significant most of the time so one solution would be to have a prefix instruction that modified the following instruction:

15_14_13_12 11 10 9_8 7_6_5 4_3_2_1_0
1 1 1 1 1 MEb Mb Rb REb MEd REd MEs REs

This uses the same encoding as the DBI instruction except there it never makes sense for bits 10 and 11 not to be zero (always save) so here bit 11 is one instead. Mb and Rb are exactly like the destination mode and destination register in regular instructions but here they refer to a second operand which is distinct from the destination, turning the following instruction from a two address type to a three address one. REb, REd and REs extend the register field of the second source, destination and source, respectively, so that 16 registers can be addresses instead of just 8. In the same way, MEb, MEd and MEs extend the mode field so that there are 8 address modes instead of the original 4. When these bits are zero you get the original modes and when they are one:

ME Mode syntax name description
1 0 0 N[R] array a word following the instruction has the base address of an array indexed by the register
1 0 1 R->N offset a word following the instruction has a 16 bit offset in the structure pointed to by the register
1 1 0 *++R pre increment like index, but the register is incremented before being used as an address
1 1 1 *R-- post descrement like index, but the register is decremented after being used as an address

In the case of a 16 bit implementation the first two extra modes are exactly the same. On a 32 or 64 bit implementation they would be different since the first always has a base address the size of the registers while the second always uses a 16 bit offset. Note that the extended instruction is effectively 32 bits long.

For the immediate type of instructions MEs and REs would extend the 5 bit immediate source to 7 bits.

Tagged Pointers

The basic architecture only deals with 16 bit words and must use the masking and rotation instructions to handle bytes. A 32 bit version would have the same limitation and would actually be a little more awkward since now 16 bit operands wouldn't be natural for it and yet that would still be the size of the instructions. An interesting extension would be to have the highest bits in a pointer indicate the size of the object it points to:

0xxxxxxxx.... address of an 8 bit object
10xxxxxxx.... address of a 16 bit object
110xxxxxx.... address of a 32 bit object
1110xxxxx.... address of a 64 bit object
1111xxxxx.... address of an internal register or i/o port

The different kinds of pointers address the same ammount of memory and can be converted from one kind to any other. While the C language was created with this kind of thing in mind, most currently available open source C compilers assume that converting between pointer types doesn't generate any code (a quick look at gcc didn't make it very clear if this is the case for it as well) so that:
 ptr = (long *) 64; cp = (char *) ptr; ++cp; --ptr;
will be compiled to something like:
 mov R2,#64
 mov R3,R2
 add R3,R3,#1
 sub R2,R2,#4
So we see that the type casts generate no code but the compiler still has to keep track of the types so it can know how much to add or subtract in the last two instructions. With the tagged pointers we would have code like:
 mvi R2,#64 ; char pointer
 adi R2,#3
 rot R2,*R7++
 #-2 ; converted to long pointer
 mov R3,R2
 lsh R3,*R7++
 #2 ; converted to char pointer
 adi R3,#1
 sbi R2,#1
The code generated by the type casts looks a bit complicated but this isn't a very common operation. Note that the last two instructions don't make the compiler deal with different increment/decrement sizes. In this example it doesn't make a difference, but if the automatic increment/decrement addressing modes were being used then it would make a significant difference.

Since the instruction pointer always deals with 16 bit objects, we can ignore the top two bits of R7 and act as if they were always 1 and 0. This allows us to actually store the flag in the top bit of R7 as before and we also get an extra bit of status which could, for example, indicate a user/supervisor execution mode (probably more useful in a version with a single thread and with interrupts).

Link to this Page

  • Other last edited on 1 April 2011 at 8:48:28 pm by