I/O Processor

Though particularly interesting for ASIC implementations where the hardware is unchangeable and all flexibility must come from the software, the I/O Processor is also a good option for FPGA designs by replacing random logic with shared uses of block RAMs.

Tasks

16 tasks run in parallel, with the highest priority task that is not blocked having full use of the processor in the next clock cycle. Task 15 has the highest priority of all and 0 the lowest.

The RAM stores the 16 program counters, 16 wake up values, the actual machine code for all the threads and all the state for these thread. A second port for this RAM (read-only) sends data to the video circuits.

On the Oliver truck terminal, the following tasks are used (when the USB port is not available, though it is likely that the USB will get its own I/O processor):

Task	Name	Description
15	wait	makes the single comparison circuit do the work of 16. The other tasks make requests to be wakened at a specific clock cycle (as indicated by a 16 bit counter running at 54MHz, which cycles every 1.21 ms). The comparison is set to the value of the soonest requested cycle and then the task goes to sleep. When it wakes up it also wakes up the indicated task (though since that has a lower priority it will have to wait for task 15 to finish its job first), and it finds the next soonest requested cycle and sets the comparison register to that.
14	addWait	receives a request from other tasks to be woken at a given clock cycle and then changes the tables used by task 15 so this will happen. Care is taken so that being interrupted by task 15 doesn't cause a conflict
13	videoBuf	reads four words from memory into the video buffer
12	videoV	generates the vertical timing for the video. It can also change the program counter for task 11 (which is suspended at this point) to select between different kinds of horizontal lines
11	videoH	generates the horizontal timing for the video. It wakes up task 12 once per line
10	videoChroma	accepts requests from the main processor to change the setting for the chroma frequency
9
8	barcode	detects pulse widths in the bar code reader input
7	lcd	sends data to the liquid crystal display
6	sound	reads a byte of memory and sends it to the audio DAC
5	com2RX	receive data for serial port 2
4	com1RX	receive data for serial port 1
3	com2TX	transmit data for serial port 2
2	com1TX	transmit data for serial port 1
1	keyboard	scans the keys to check for any change
0	rtc	keeps track of the real time

Instructions

The processor has a register based architecture (2 address). Each task has its own 8 register bank, with register 7 being the program counter (bit 15 of register 7 is the flag which normally indicates whether the previous result was zero, but can have other meanings for some instructions). So 128 words of memory (3F80 to 3FFF) are registers, though the registers for any unused task can be used for data or instructions.

The two variations of the instruction format are:

15_14_13_12	11_10	9_8	7_6_5	4_3	2_1_0
operation	save	destination mode	destination register	source mode	source register
xx11	save	destination mode	destination register	immediate unsigned 5 bit value

Where operation codes are:

0000	MOV	move source to destination
0001	ADC	add and set flag to value of carry
0010	ADD	add source to destination
0011	ADI	add immediate to destination
1000	NEG	move complement of source to destination
0101	SBB	subtract and set flag to value of borrow
0110	SUB	subtract source from destination
0111	SBI	subtract immediate from destination
1000	MULL	multiply source with destination and save low word
1001	XOR	exclusive or source with destination
1010	OR	inclusive or source with destination
1011	ORI	or immediate with destination
1100	MULH	multiply source with destination and save high word
1101	BIC	bit clear - and inverted source with destination
1110	AND	and source with destination
1111	ANI	and immediate with destination

Note that the logical operations are structured like this:

.	source = 0	source = 1
destination = 0	0	not i14
destination = 1	not i14 or not i13	i13

If multipĺiers are not easily available, the MULL and MULH instructions can be replaced with LSHR (logical shift right by 1) and ASHR (arithmetic shift right by 1) respectively.

The save field is interpreted as:

00	.	always save result to destination
01	ns	never save result to destination
10	ifz	save to destination if flag is zero
11	ifnz	save to destination if flag is not zero

The mode field (destination and source) is interpreted as:

00	Rx	register	the register itself is used as a source or destination
01	*Rx	index	the register is used as the address of a source or destination in memory
10	*Rx++	post increment	like index, but the register is incremented after being used as the address
11	*--Rx	pre decrement	like index, but the register is decremented before being used as the address

I/O Ports

A number of special memory positions is set aside as input and output ports. These are at the bottom of the memory map where their addresses can easily be loaded into a register with with immediate instructions. Here is the example from Oliver:

address	width	description (read)	description (write)
0	16	task ready - each bit corresponds to a task and if it is 0 then the task is suspended and if it is 1 then the task is ready to run	clears the corresponding bits of the task ready
1	16	task ready	sets the corresponding bits of the task ready
2	16	next word from ring network	sends next word to ring network
3	1/16	last word indication from ring network	sends last word to ring network
4	16	counter - is incremented at 54MHz	trigger - when this matches the counter, task 15 is set to run
8	16	sleep	set sleep value and put current task to sleep
9	4	previous task (set by writing to sleep)	.
10	1	serial tx1	sets tx1
11	1	serial rx1	sets change in rx1 to wake up task 4
12	1	serial tx2	sets tx2
13	1	serial rx2	sets change in rx2 to wake up task 5
14	1	bar code reader	sets change in bar code reader to wake up task 8
15	10	keyboard columns	sets keyboard column value for one clock cycle (used for precharge)
16	8	.	sets the sound DAC
18	8	LCD data	sets LCD data
19	7	.	sets LCD control (e, w, rs, backlight, cs1, cs2, cs3)
1A	16	.	sets chromaLow - bottom 16 bits of the 32 bit counter that generates a 3.58MHz sine wave
1B	16	.	sets chromaHigh - top 16 bits of the 32 bit counter that generates a 3.58MHz sine wave
1C	5	.	sets the video control (vsync, hsync, blank, burst1, burst0)

Internal ports (0 to F) for which the read value is not defined will actually return the value of the task ready register while external ports (10 to 1F) will return the value of the keyboard columns.

Execution

The worst case is when switching from another task to an instruction with pre-decrement or post-increment in both source and destination:

save PC/flag (old task)
load PC/flag (new task)
decode instruction and address registers
update source register
update destination register
fetch source/destination from memory
execute and (possibly) save result

but only three cycles when the current task is the same as the previous one and both source and destination are registers.

At 54MHz this means the total processing power is around 18 MIPS, which is a respectable speed for the simple tasks that it must handle even when divided among 16 coroutines.

Notes

The reason why task must specify when they want to wake up in terms of absolute time instead of a delay value is that they might take an arbitrary time to actually start executing once the desired time arrives and this execution might proceed at an uneven pace as higher priority tasks interfere. With absolute time value the next request can simply add a number to the previous requested value and so eliminate a huge source of jitter.
When any instruction uses register 7 as the destination, the old (but already incremented) value of the PC is saved in register 6. So there are no jumps but only jumps to subroutine. With care, register 6 can still be used to hold temporary values between control flow changes.

Extensions

The goal of the above design was to use up as few FPGA resources as possible. So it has little logic (using a single block RAM for both main memory and all registers, including the PC) and a relatively dense instruction set. Its 16 bit size is well suited to working with only internal FPGA memory and extending the instruction word, including caches and external memory and other such improvements probably wouldn't make much sense. Some more modest extensions, however, could make it suitable for new applications without eliminating the features that make it an interesting architecture in the first place.

Prefix Instruction

Some obvious limitations of the instruction set are the number of registers, the number of modes and the use of the destination as one of the sources. These limitations aren't significant most of the time so one solution would be to have a prefix instruction that modified the following instruction:

15_14_13_12	11	10	9_8	7_6_5	4_3_2_1_0
1 1 1 1	1	MEb	Mb	Rb	REb MEd REd MEs REs

This uses the same encoding as the DBI instruction except there it never makes sense for bits 10 and 11 not to be zero (always save) so here bit 11 is one instead. Mb and Rb are exactly like the destination mode and destination register in regular instructions but here they refer to a second operand which is distinct from the destination, turning the following instruction from a two address type to a three address one. REb, REd and REs extend the register field of the second source, destination and source, respectively, so that 16 registers can be addresses instead of just 8. In the same way, MEb, MEd and MEs extend the mode field so that there are 8 address modes instead of the original 4. When these bits are zero you get the original modes and when they are one:

ME	Mode	syntax	name	description
1	0 0	N[R]	array	a word following the instruction has the base address of an array indexed by the register
1	0 1	R->N	offset	a word following the instruction has a 16 bit offset in the structure pointed to by the register
1	1 0	*++R	pre increment	like index, but the register is incremented before being used as an address
1	1 1	*R--	post descrement	like index, but the register is decremented after being used as an address

In the case of a 16 bit implementation the first two extra modes are exactly the same. On a 32 or 64 bit implementation they would be different since the first always has a base address the size of the registers while the second always uses a 16 bit offset. Note that the extended instruction is effectively 32 bits long.

For the immediate type of instructions MEs and REs would extend the 5 bit immediate source to 7 bits.

Tagged Pointers

The basic architecture only deals with 16 bit words and must use the masking and rotation instructions to handle bytes. A 32 bit version would have the same limitation and would actually be a little more awkward since now 16 bit operands wouldn't be natural for it and yet that would still be the size of the instructions. An interesting extension would be to have the highest bits in a pointer indicate the size of the object it points to:

0xxxxxxxx....	address of an 8 bit object
10xxxxxxx....	address of a 16 bit object
110xxxxxx....	address of a 32 bit object
1110xxxxx....	address of a 64 bit object
1111xxxxx....	address of an internal register or i/o port

The different kinds of pointers address the same ammount of memory and can be converted from one kind to any other. While the C language was created with this kind of thing in mind, most currently available open source C compilers assume that converting between pointer types doesn't generate any code (a quick look at gcc didn't make it very clear if this is the case for it as well) so that:

 ptr = (long *) 64; cp = (char *) ptr; ++cp; --ptr;

will be compiled to something like:

 mov R2,#64
 mov R3,R2
 add R3,R3,#1
 sub R2,R2,#4

So we see that the type casts generate no code but the compiler still has to keep track of the types so it can know how much to add or subtract in the last two instructions. With the tagged pointers we would have code like:

 mvi R2,#64 ; char pointer
 adi R2,#3
 rot R2,*R7++
 #-2 ; converted to long pointer
 mov R3,R2
 lsh R3,*R7++
 #2 ; converted to char pointer
 adi R3,#1
 sbi R2,#1

The code generated by the type casts looks a bit complicated but this isn't a very common operation. Note that the last two instructions don't make the compiler deal with different increment/decrement sizes. In this example it doesn't make a difference, but if the automatic increment/decrement addressing modes were being used then it would make a significant difference.

Since the instruction pointer always deals with 16 bit objects, we can ignore the top two bits of R7 and act as if they were always 1 and 0. This allows us to actually store the flag in the top bit of R7 as before and we also get an extra bit of status which could, for example, indicate a user/supervisor execution mode (probably more useful in a version with a single thread and with interrupts).

Link to this Page

Other last edited on 1 April 2011 at 8:48:28 pm by 192.168.2.3