FPGARelated.com
Forums

Forth-CPU design

Started by Frank Buss September 2, 2006
After thinking about how a "normal" CPU could look like, I've tried to
design a Forth-like CPU:

http://www.frank-buss.de/vhdl/forth-cpu.html

Maybe someone from comp.lang.forth could take a look at it, if something is
missing for compiling normal high-level Forth programs to this instruction
set, without too many complications.

-- 
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Frank Buss wrote:
> After thinking about how a "normal" CPU could look like, I've tried to > design a Forth-like CPU: > > http://www.frank-buss.de/vhdl/forth-cpu.html > > Maybe someone from comp.lang.forth could take a look at it, if something is > missing for compiling normal high-level Forth programs to this instruction > set, without too many complications.
Looks good - some brief comments : I think the opcode is 8 bits ? That's good for off-chip memories. You could optionally allow a 9-bit opcode, for where the memory is FPGA BRAM ? That could also allow an option for 32 bit registers ? I think there is also opcode room to allow one byte SKIP form on the BRA... opcodes ? This could be assembler-automatic - where this is usefull, is in operating from serial flash memory. -jg
Frank Buss wrote:
> After thinking about how a "normal" CPU could look like, I've tried to > design a Forth-like CPU: > > http://www.frank-buss.de/vhdl/forth-cpu.html > > Maybe someone from comp.lang.forth could take a look at it, if something is > missing for compiling normal high-level Forth programs to this instruction > set, without too many complications.
If you are going to compile Forth then CALL is probably the first instruction you want to optimize and I didn't even see it or an equivalent in your list. Other than that this is probably a very reasonable way to encode Forth. Given that you have spare bits in several cases my guess would be that it isn't an optimal encoding (Chuck Moore's 5 bit instruction set and the variations people have made are probably closer to optimal). Even with your current instruction set it is probably possible to write your network address swap code to be a little smaller: push byte 5 :loop dup ; i i pop A ; i @A byte ; (i) i over ; i (i) i push byte 6 ; 6 i (i) i add ; 6+i (i) i pop A ; (i) i @A byte ; (6+i) (i) i over ; (i) (6+i) i !A byte ; (6+i) i over ; i (6+i) i pop A ; (6+i) i !A byte ; i dec ; i-1 bcc byte :loop drop When you have an address register like in the MISC processors it is a good idea to avoid reloading it when possible. In this case having both 6+i accesses happen next to each other does that. And in general it isn't very good to allow the stack to get very deep with temporary values when programming stack machines. By the way - this code fragment that you have been testing various architectures with has a problem in that in general the first word to be swapped would be some random one and not happen to be address zero. Rewritting it so that MAC addresses to be swapped are at 200 and 206, for example, would have no effect in some cases but increase the length of the code for other processors. -- Jecel
Jim Granville wrote:
> > Looks good - some brief comments : > I think the opcode is 8 bits ? > That's good for off-chip memories. > You could optionally allow a 9-bit opcode, for where the memory is FPGA > BRAM ? That could also allow an option for 32 bit registers ? >
Since you have flag registers, you need a way to push and pop them if you have interrupts. Usually Forth chips use 0=, 0< and IF or IF and -IF instead of flag registers, so there is nothing to save. This means that signed comparisons like < take longer but that doesn't seem to be a problem. CALL is used in Forth much more often than in other languages so you should try to encode it compactly. 8-bit opcodes are kind of nice because serial flash is pretty fast. In a Spartan3, you could map the lower 2K bytes of program space to a BRAM block and the rest to external flash. You keep the fast (mostly kernel) words in the BRAM. --brad
Jecel wrote:

> If you are going to compile Forth then CALL is probably the first > instruction you want to optimize and I didn't even see it or an > equivalent in your list.
You are right. My idea was to use something like "pushr byte 3 +PC", "load PC word :my_function" for the call and "popr PC" for the return. But as Jim noticed, there is some room in the branch commands and "jump" is redundant, because it can be implemented with "loadpc word :label" (and even relative to PC with "loadpc word :label +PC", or for skipping the next instruction, the one byte instruction "loadpc # 1 +PC" is possible), so it could be added to the branch commands. But to make the pop/load commands more complete, I've added it as a special destination case and reorganized the branch commands, which now can branch on 8 conditions. I've updated the documentation at http://www.frank-buss.de/vhdl/forth-cpu.html
> Other than that this is probably a very > reasonable way to encode Forth. Given that you have spare bits in > several cases my guess would be that it isn't an optimal encoding > (Chuck Moore's 5 bit instruction set and the variations people have > made are probably closer to optimal).
It could be encoded more compact, but my idea was to avoid large demultiplexers for decoding every single command. Maybe I should start implementing the design to see if this works :-)
> When you have an address register like in the MISC processors it is a > good idea to avoid reloading it when possible. In this case having both > 6+i accesses happen next to each other does that. And in general it > isn't very good to allow the stack to get very deep with temporary > values when programming stack machines.
Thanks. Looks like most of the time an arithmetic operation is executed before "pop A", so I've used one bit for all instructions, which specifies that a "pop A" is executed after the command to compress the code even more: push byte 5 ; i=5 :loop dup popa ; i @A byte ; (i) i over ; i (i) i push byte 6 ; 6 i (i) i add popa ; (i) i @A byte ; (6+i) (i) i over ; (i) (6+i) i !A byte ; (6+i) i over popa ; (6+i) i !A byte ; i dec ; i-1 bcc byte :loop drop ; empty stack Now it is 16 bytes long, just 1 byte more than the shortest code so far, the 6502 code (it needed 15 bytes, not 13 as I wrote).
> By the way - this code fragment that you have been testing various > architectures with has a problem in that in general the first word to > be swapped would be some random one and not happen to be address zero.
This was by intention to see if there are some tricks for zero based addressing. -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
Brad Eckert wrote:

> Since you have flag registers, you need a way to push and pop them if > you have interrupts.
You can use "pop S" and "push S" or even "popr S" and "pushr S" with my instruction set.
> CALL is used in Forth much more often than in other languages so you > should try to encode it compactly.
Yes, I've added it as a possible destination to the "pop" command.
> 8-bit opcodes are kind of nice because serial flash is pretty fast. In > a Spartan3, you could map the lower 2K bytes of program space to a BRAM > block and the rest to external flash. You keep the fast (mostly kernel) > words in the BRAM.
Thanks, mapping the flash into the address space is a good idea and maybe 2 BRAMs: one for the TCP/IP stack and one for fast programs and other data. -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
Frank Buss wrote:

> and reorganized the > branch commands, which now can branch on 8 conditions.
There was one bit too much, so there are 4 conditions only. -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
Frank Buss wrote:
> Jecel wrote: >>Other than that this is probably a very >>reasonable way to encode Forth. Given that you have spare bits in >>several cases my guess would be that it isn't an optimal encoding >>(Chuck Moore's 5 bit instruction set and the variations people have >>made are probably closer to optimal). > > > It could be encoded more compact, but my idea was to avoid large > demultiplexers for decoding every single command. Maybe I should start > implementing the design to see if this works :-)
Yes, but unless you have large amounts of 5 bit memory lying about :) there is more point in focusing on packing the 8 bit opcodes, to try and reduce the use of 16 bit ones. If you do plan to use serial flash, the keep in mind the (IIRC) 5 byte address reload time, so short (1 byte) skips up to 5 bytes ahead would be efficent. In a FPGA, another useful thing to cache could be constants ? - so you do not have to reload the Serial address pointer, just to access (smaller) constants. -jg
Frank Buss wrote:
> Thanks. Looks like most of the time an arithmetic operation is executed > before "pop A", so I've used one bit for all instructions, which specifies > that a "pop A" is executed after the command to compress the code even > more:
That sounds like a very good idea.
> [example snipped] > Now it is 16 bytes long, just 1 byte more than the shortest code so far, > the 6502 code (it needed 15 bytes, not 13 as I wrote).
I haven't actually written it down, but my impression is that in a 16 bit RISC processor like Jan Gray's gr0040 you could do it in 7 instructions (14 bytes). -- Jecel
Jecel wrote:
> Frank Buss wrote: > > Thanks. Looks like most of the time an arithmetic operation is executed > > before "pop A", so I've used one bit for all instructions, which specifies > > that a "pop A" is executed after the command to compress the code even > > more: > > That sounds like a very good idea. > > > [example snipped] > > Now it is 16 bytes long, just 1 byte more than the shortest code so far, > > the 6502 code (it needed 15 bytes, not 13 as I wrote). > > I haven't actually written it down, but my impression is that in a 16 > bit RISC processor like Jan Gray's gr0040 you could do it in 7 > instructions (14 bytes).
I am curious, what exactly is this code segment you are using as a benchmark? I saw the code earlier in this thread, but I don't understand the notation enough to know what it is doing. Is there some other code or a description so that I can see how other Forth CPUs might compare?