Though particularly interesting for ASIC implementations where the hardware is unchangeable and all flexibility must come from the software, the I/O Processor is also a good option for FPGA designs by replacing random logic with shared uses of block RAMs.
16 tasks run in parallel, with the highest priority task that is not blocked having full use of the processor in the next clock cycle. Task 15 has the highest priority of all and 0 the lowest.
The RAM stores the 16 program counters, 16 wake up values, the actual machine code for all the threads and all the state for these thread. A second port for this RAM (read-only) sends data to the video circuits.
On the Oliver truck terminal, the following tasks are used (when the USB port is not available, though it is likely that the USB will get its own I/O processor):
| Task || Name || Description|
| 15 || wait || makes the single comparison circuit do the work of 16. The other tasks make requests to be wakened at a specific clock cycle (as indicated by a 16 bit counter running at 54MHz, which cycles every 1.21 ms). The comparison is set to the value of the soonest requested cycle and then the task goes to sleep. When it wakes up it also wakes up the indicated task (though since that has a lower priority it will have to wait for task 15 to finish its job first), and it finds the next soonest requested cycle and sets the comparison register to that.|
| 14 || addWait || receives a request from other tasks to be woken at a given clock cycle and then changes the tables used by task 15 so this will happen. Care is taken so that being interrupted by task 15 doesn't cause a conflict|
| 13 || videoBuf || reads four words from memory into the video buffer|
| 12 || videoV || generates the vertical timing for the video. It can also change the program counter for task 11 (which is suspended at this point) to select between different kinds of horizontal lines|
| 11 || videoH || generates the horizontal timing for the video. It wakes up task 12 once per line|
| 10 || videoChroma || accepts requests from the main processor to change the setting for the chroma frequency|
| 9 |
| 8 || barcode || detects pulse widths in the bar code reader input|
| 7 || lcd || sends data to the liquid crystal display|
| 6 || sound || reads a byte of memory and sends it to the audio DAC|
| 5 || com2RX || receive data for serial port 2|
| 4 || com1RX || receive data for serial port 1|
| 3 || com2TX || transmit data for serial port 2|
| 2 || com1TX || transmit data for serial port 1|
| 1 || keyboard || scans the keys to check for any change|
| 0 || rtc || keeps track of the real time|
The processor has a register based architecture (2 address). Each task has its own 8 register bank, with register 7 being the program counter (bit 15 of register 7 is the flag which normally indicates whether the previous result was zero, but can have other meanings for some instructions). So 128 words of memory (3F80 to 3FFF) are registers, though the registers for any unused task can be used for data or instructions.
The two variations of the instruction format are:
| 15_14_13_12 || 11_10 || 9_8 || 7_6_5 || 4_3 || 2_1_0|
| operation || save || destination mode || destination register || source mode || source register|
| xx11 || save || destination mode || destination register || immediate unsigned 5 bit value|
Where operation codes are:
| 0000 || MOV || move source to destination|
| 0001 || ADC || add and set flag to value of carry|
| 0010 || ADD || add source to destination|
| 0011 || ADI || add immediate to destination|
| 1000 || NEG || move complement of source to destination|
| 0101 || SBB || subtract and set flag to value of borrow|
| 0110 || SUB || subtract source from destination|
| 0111 || SBI || subtract immediate from destination|
| 1000 || MULL || multiply source with destination and save low word|
| 1001 || XOR || exclusive or source with destination|
| 1010 || OR || inclusive or source with destination|
| 1011 || ORI || or immediate with destination|
| 1100 || MULH || multiply source with destination and save high word|
| 1101 || BIC || bit clear - and inverted source with destination|
| 1110 || AND || and source with destination|
| 1111 || ANI || and immediate with destination|
Note that the logical operations are structured like this:
| . || source = 0 || source = 1|
| destination = 0 || 0 || not i14|
| destination = 1 || not i14 or not i13 || i13|
If multipĺiers are not easily available, the MULL and MULH instructions can be replaced with LSHR (logical shift right by 1) and ASHR (arithmetic shift right by 1) respectively.
The save field is interpreted as:
| 00 || . || always save result to destination|
| 01 || ns || never save result to destination|
| 10 || ifz || save to destination if flag is zero|
| 11 || ifnz || save to destination if flag is not zero|
The mode field (destination and source) is interpreted as:
| 00 || Rx || register || the register itself is used as a source or destination|
| 01 || *Rx || index || the register is used as the address of a source or destination in memory|
| 10 || *Rx++ || post increment || like index, but the register is incremented after being used as the address|
| 11 || *--Rx || pre decrement || like index, but the register is decremented before being used as the address|
A number of special memory positions is set aside as input and output ports. These are at the bottom of the memory map where their addresses can easily be loaded into a register with with immediate instructions. Here is the example from Oliver:
Internal ports (0 to F) for which the read value is not defined will actually return the value of the task ready register while external ports (10 to 1F) will return the value of the keyboard columns.
| address || width || description (read) || description (write)|
| 0 || 16 || task ready - each bit corresponds to a task and if it is 0 then the task is suspended and if it is 1 then the task is ready to run || clears the corresponding bits of the task ready|
| 1 || 16 || task ready || sets the corresponding bits of the task ready|
| 2 || 16 || next word from ring network || sends next word to ring network|
| 3 || 1/16 || last word indication from ring network || sends last word to ring network |
| 4 || 16 || counter - is incremented at 54MHz || trigger - when this matches the counter, task 15 is set to run|
| 8 || 16 || sleep || set sleep value and put current task to sleep|
| 9 || 4 || previous task (set by writing to sleep) || .|
| 10 || 1 || serial tx1 || sets tx1|
| 11 || 1 || serial rx1 || sets change in rx1 to wake up task 4|
| 12 || 1 || serial tx2 || sets tx2|
| 13 || 1 || serial rx2 || sets change in rx2 to wake up task 5|
| 14 || 1 || bar code reader || sets change in bar code reader to wake up task 8|
| 15 || 10 || keyboard columns || sets keyboard column value for one clock cycle (used for precharge)|
| 16 || 8 || . || sets the sound DAC|
| 18 || 8 || LCD data || sets LCD data|
| 19 || 7 || . || sets LCD control (e, w, rs, backlight, cs1, cs2, cs3)|
| 1A || 16 || . || sets chromaLow - bottom 16 bits of the 32 bit counter that generates a 3.58MHz sine wave|
| 1B || 16 || . || sets chromaHigh - top 16 bits of the 32 bit counter that generates a 3.58MHz sine wave|
| 1C || 5 || . || sets the video control (vsync, hsync, blank, burst1, burst0)|
The worst case is when switching from another task to an instruction with pre-decrement or post-increment in both source and destination:
but only three cycles when the current task is the same as the previous one and both source and destination are registers.
- save PC/flag (old task)
- load PC/flag (new task)
- decode instruction and address registers
- update source register
- update destination register
- fetch source/destination from memory
- execute and (possibly) save result
At 54MHz this means the total processing power is around 18 MIPS, which is a respectable speed for the simple tasks that it must handle even when divided among 16 coroutines.
- The reason why task must specify when they want to wake up in terms of absolute time instead of a delay value is that they might take an arbitrary time to actually start executing once the desired time arrives and this execution might proceed at an uneven pace as higher priority tasks interfere. With absolute time value the next request can simply add a number to the previous requested value and so eliminate a huge source of jitter.
- When any instruction uses register 7 as the destination, the old (but already incremented) value of the PC is saved in register 6. So there are no jumps but only jumps to subroutine. With care, register 6 can still be used to hold temporary values between control flow changes.
The goal of the above design was to use up as few FPGA resources as possible. So it has little logic (using a single block RAM for both main memory and all registers, including the PC) and a relatively dense instruction set. Its 16 bit size is well suited to working with only internal FPGA memory and extending the instruction word, including caches and external memory and other such improvements probably wouldn't make much sense. Some more modest extensions, however, could make it suitable for new applications without eliminating the features that make it an interesting architecture in the first place.
Some obvious limitations of the instruction set are the number of registers, the number of modes and the use of the destination as one of the sources. These limitations aren't significant most of the time so one solution would be to have a prefix instruction that modified the following instruction:
| 15_14_13_12 || 11 || 10 || 9_8 || 7_6_5 || 4_3_2_1_0|
| 1 1 1 1 || 1 || MEb || Mb || Rb || REb MEd REd MEs REs|
This uses the same encoding as the DBI instruction except there it never makes sense for bits 10 and 11 not to be zero (always save) so here bit 11 is one instead. Mb and Rb are exactly like the destination mode and destination register in regular instructions but here they refer to a second operand which is distinct from the destination, turning the following instruction from a two address type to a three address one. REb, REd and REs extend the register field of the second source, destination and source, respectively, so that 16 registers can be addresses instead of just 8. In the same way, MEb, MEd and MEs extend the mode field so that there are 8 address modes instead of the original 4. When these bits are zero you get the original modes and when they are one:
| ME || Mode || syntax || name || description|
| 1 || 0 0 || N[R] || array || a word following the instruction has the base address of an array indexed by the register|
| 1 || 0 1 || R->N || offset || a word following the instruction has a 16 bit offset in the structure pointed to by the register|
| 1 || 1 0 || *++R || pre increment || like index, but the register is incremented before being used as an address|
| 1 || 1 1 || *R-- || post descrement || like index, but the register is decremented after being used as an address|
In the case of a 16 bit implementation the first two extra modes are exactly the same. On a 32 or 64 bit implementation they would be different since the first always has a base address the size of the registers while the second always uses a 16 bit offset. Note that the extended instruction is effectively 32 bits long.
For the immediate type of instructions MEs and REs would extend the 5 bit immediate source to 7 bits.
The basic architecture only deals with 16 bit words and must use the masking and rotation instructions to handle bytes. A 32 bit version would have the same limitation and would actually be a little more awkward since now 16 bit operands wouldn't be natural for it and yet that would still be the size of the instructions. An interesting extension would be to have the highest bits in a pointer indicate the size of the object it points to:
| 0xxxxxxxx.... || address of an 8 bit object|
| 10xxxxxxx.... || address of a 16 bit object|
| 110xxxxxx.... || address of a 32 bit object|
| 1110xxxxx.... || address of a 64 bit object|
| 1111xxxxx.... || address of an internal register or i/o port|
The different kinds of pointers address the same ammount of memory and can be converted from one kind to any other. While the C language was created with this kind of thing in mind, most currently available open source C compilers assume that converting between pointer types doesn't generate any code (a quick look at gcc didn't make it very clear if this is the case for it as well) so that:
ptr = (long *) 64; cp = (char *) ptr; ++cp; --ptr;
will be compiled to something like:
So we see that the type casts generate no code but the compiler still has to keep track of the types so it can know how much to add or subtract in the last two instructions. With the tagged pointers we would have code like:
mvi R2,#64 ; char pointer
#-2 ; converted to long pointer
#2 ; converted to char pointer
The code generated by the type casts looks a bit complicated but this isn't a very common operation. Note that the last two instructions don't make the compiler deal with different increment/decrement sizes. In this example it doesn't make a difference, but if the automatic increment/decrement addressing modes were being used then it would make a significant difference.
Since the instruction pointer always deals with 16 bit objects, we can ignore the top two bits of R7 and act as if they were always 1 and 0. This allows us to actually store the flag in the top bit of R7 as before and we also get an extra bit of status which could, for example, indicate a user/supervisor execution mode (probably more useful in a version with a single thread and with interrupts).
Link to this Page
- Other last edited on 1 April 2011 at 8:48:28 pm by 192.168.2.3