I have been following the development of the ZPU, a zero operand processor for FPGAs. The primary intent is to design a CPU that can span a range of sizes from very space efficient to high speed while being efficient at running C code. The original author has an open source compiler producing code for it which seems to be the part he is good at. However, it has been running rather slow in the benchmark they have been running, Dhrystone. I think I figured out why. The ISA is zero operand, but the stack in maintained in memory. There is no stack register architecture. So every stack operation consists of reading the operands, performing the operation and writing back the result. I can see why it is giving slow performance, even when pipelined. I believe the real issue is that the focus is on building a complex machine and trying "techniques" to make it simple and fast. The more I look at things like this, the more I am convinced that the Moore philosophy is right. You can achieve performance by adding more and more complexity, or you can simplify to the point of inherent speed. But then my thinking is biased. I am a hardware guy and my programming has always been the sort of stuff that can fit in a .com file, you know, the ones with the 64 kbyte limit. I still think Bill Gates was right when he said that no one would ever need more than 640 kbytes ;^) I just know that my current multi-GHz machine is not really any faster than my old 12 MHz 286 in many respects... It certainly does not boot any faster and is *much* slower to turn off. Rick
Zero operand CPUs
This discussion explores the design and performance of zero-operand (stack-based) CPUs on FPGAs, specifically focusing on the ZPU and the philosophy of Chuck Moore. Participants analyze why certain implementations, like the ZPU, may suffer from performance bottlenecks when the stack is maintained primarily in memory rather than using high-speed on-chip registers.
The thread concludes that while stack-based architectures offer extreme space efficiency, achieving high performance requires specialized techniques such as stack caching, circular addressing, or compiler-driven stack management to minimize slow memory accesses.
- The ZPU's performance bottleneck often stems from maintaining the stack in memory rather than using a register-based stack cache.
- Hardware techniques like circular addressing and speculative 'spill and fill' operations can mitigate memory latency in stack machines.
- The availability of a GCC toolchain is a significant adoption factor, though some argue a simple macro assembler or Forth-like environment is sufficient for small cores.
- Compilers can be designed to limit stack depth to a fixed size, reducing the hardware complexity required to handle overflows.
- There is a persistent trade-off between the complexity of deep stack caches and the speed of context switching in multi-threaded environments.
On Mon, 16 Mar 2009 19:10:56 -0700 (PDT), rickman <gnuarm@gmail.com> wrote:>I have been following the development of the ZPU, a zero operand >processor for FPGAs. The primary intent is to design a CPU that can >span a range of sizes from very space efficient to high speed while >being efficient at running C code. The original author has an open >source compiler producing code for it which seems to be the part he is >good at. > >However, it has been running rather slow in the benchmark they have >been running, Dhrystone. I think I figured out why. The ISA is zero >operand, but the stack in maintained in memory. There is no stack >register architecture. So every stack operation consists of reading >the operands, performing the operation and writing back the result. I >can see why it is giving slow performance, even when pipelined.Even a stack based ISA can be implemented to be fast. You can certainly make an out of order, multi-issue stack machine as long as you make the memory image consistent with an in-order cpu. In fact even this may not be a requirement if you're not interested in multi-core implementations. You can merge multiple instructions (micro and macro operator fusion) and retire instructions out of order etc by keeping an internal virtual register file which implements register renaming for various stack locations. Whether it's easier or more difficult to extract instruction level parallelism from a stack based ISA is not very clear to me but it certainly is possible. -- Muzaffer Kal DSPIA INC. ASIC/FPGA Design Services http://www.dspia.com
On Mon, 16 Mar 2009 19:10:56 -0700 (PDT), rickman wrote:>[ZPU's] stack is maintained in memory. There is no stack >register architecture. So every stack operation consists of reading >the operands, performing the operation and writing back the result. I >can see why it is giving slow performance, even when pipelined.Strange. I'm sure you too can see what to do about this - same as you had to do on the AMD29K which used some of its huge register set to cache the top of the stack. Implement the stack in on-chip RAM, using circular addressing. When the on-chip stack threatens to overflow, "spill" some of the oldest part of it to main memory, using a fast block write operation. CPU operations then continue to use the on-chip stack but the circular addressing no longer overflows. Similarly, when the stack threatens to underflow, "fill" from main memory. Way back then, you could get quite good performance if you were careful to align the spill/fill operations with a DRAM page. The 29K used software trap routines to do the spill/fill, but I'm sure you could do it at least partly in hardware without too much trouble. Spill/fill can then be done speculatively, in the background, when there is spare bandwidth on the memory interface. Unfortunately the stack cache trashes multi-threading performance, because there is so much context to swap. I guess the correct compromise these days would be very different, with the stack cache probably about 16 words. With only a small stack cache you can keep several process's stacks in the on-chip memory (that's harder to plan, of course, but may still be helpful particularly in a small system).>I believe the real issue is that the focus is on building a complex >machine and trying "techniques" to make it simple and fast. The more >I look at things like this, the more I am convinced that the Moore >philosophy is right. You can achieve performance by adding more and >more complexity, or you can simplify to the point of inherent speed.Always provided you have sufficiently smart compilers to convert complicated real-world code into a suitable stream of your simple instructions. But in general I think I agree. Compilers _are_ pretty smart these days. -- Jonathan Bromley, Consultant DOULOS - Developing Design Know-how VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK jonathan.bromley@MYCOMPANY.com http://www.MYCOMPANY.com The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.
>Strange. I'm sure you too can see what to do about this - >same as you had to do on the AMD29K which used some of its >huge register set to cache the top of the stack. Implement >the stack in on-chip RAM, using circular addressing. When >the on-chip stack threatens to overflow, "spill" some of >the oldest part of it to main memory, using a fast block >write operation. CPU operations then continue to use the >on-chip stack but the circular addressing no longer overflows. >Similarly, when the stack threatens to underflow, "fill" >from main memory. Way back then, you could get quite good >performance if you were careful to align the spill/fill >operations with a DRAM page. The 29K used software trap >routines to do the spill/fill, but I'm sure you could >do it at least partly in hardware without too much trouble. >Spill/fill can then be done speculatively, in the background, >when there is spare bandwidth on the memory interface.You can simplify the hardware a lot by pushing the stack overflow problem back to the compiler. That is, the stack has a fixed size. The compiler can't generate code that overflows that limit.>Unfortunately the stack cache trashes multi-threading >performance, because there is so much context to swap. >I guess the correct compromise these days would be very >different, with the stack cache probably about 16 words. >With only a small stack cache you can keep several process's >stacks in the on-chip memory (that's harder to plan, >of course, but may still be helpful particularly in >a small system).I don't understand that comment. Multi-threading requires separate stacks. That's just more RAM in the CPU, perhaps the virtual CPU number is part of the RAM address if that's what you mean by multi-threading. -- These are my opinions, not necessarily my employer's. I hate spam.
On Tue, 17 Mar 2009 03:46:55 -0500, Hal Murray wrote:>You can simplify the hardware a lot by pushing >the stack overflow problem back to the compiler. >That is, the stack has a fixed size. The compiler >can't generate code that overflows that limit.OK. Getting the compiler to limit the stack size is clearly possible (Transputer, anyone? it had a 3-register stack). But this is certain to cause more memory references to escape to memory. It's a compromise, like everything else.>>Unfortunately the stack cache trashes multi-threading >>performance, because there is so much context to swap.[...]>I don't understand that comment. Multi-threading requires >separate stacks. That's just more RAM in the CPUYes, but if a large swath of CPU register space is used to cache the top-of-stack, then that cache must be saved and restored on a context switch. You can only provide a finite number of stack spaces in the CPU's on-chip RAM, so at some point a context switch is sure to entail a large penalty as some other thread's stack cache must be evicted to main memory. Shallower on-chip stack cache means slower single- thread performance, but faster context switch and the opportunity to keep more threads' stacks on-chip. Compromises again. -- Jonathan Bromley, Consultant DOULOS - Developing Design Know-how VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK jonathan.bromley@MYCOMPANY.com http://www.MYCOMPANY.com The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.
In comp.arch.fpga Hal Murray <hal-usenet@ip-64-139-1-69.sjc.megapath.net> wrote:> You can simplify the hardware a lot by pushing > the stack overflow problem back to the compiler. > That is, the stack has a fixed size. The compiler > can't generate code that overflows that limit.As is usually done for x87 floating point. The story is that the 8087 was designed such that software could detect the stack over/underflow and swap to/from memory. No-one tried writing the software until the hardware was done, and then it was found that it wasn't possible. Presumably it could have been fixed in later processors, but as far as I know, it wasn't changed. -- glen
Hi Three cheers for Chuck Moore, and his hidden friend Keep Less ;-) Another zero operand CPU http://nibz.googlecode.com cheers jacko Now available in free licence of one core per ASIC/FPGA/CPLD, with two conditions. 1. A K Ring Technologies Logo must be printed atop the chip or close by on the PCB at any resolution. 2. Any documentation produced must acknowledge copyright and provide the URL. This licence is for those folks who do not like the BSD derived work restrictions.
On Mar 17, 3:24=A0pm, Jacko <jackokr...@gmail.com> wrote:> Hi > > Three cheers for Chuck Moore, and his hidden friend Keep Less ;-) > > Another zero operand CPUhttp://nibz.googlecode.com > > cheers jacko > > Now available in free licence of one core per ASIC/FPGA/CPLD, with two > conditions. > 1. A K Ring Technologies Logo must be printed atop the chip or close > by on the PCB at any resolution. > 2. Any documentation produced must acknowledge copyright and provide > the URL. > > This licence is for those folks who do not like the BSD derived work > restrictions.the difference between zpu and nibz is that ZPU is supported by GCC toolchain, while there are no tools to generate any meaningful code for nibz correct me if i am wrong Antti
On Mar 17, 9:58=A0am, "Antti.Luk...@googlemail.com" <Antti.Luk...@googlemail.com> wrote:> On Mar 17, 3:24=A0pm, Jacko <jackokr...@gmail.com> wrote: > > > > > Hi > > > Three cheers for Chuck Moore, and his hidden friend Keep Less ;-) > > > Another zero operand CPUhttp://nibz.googlecode.com > > > cheers jacko > > > Now available in free licence of one core per ASIC/FPGA/CPLD, with two > > conditions. > > 1. A K Ring Technologies Logo must be printed atop the chip or close > > by on the PCB at any resolution. > > 2. Any documentation produced must acknowledge copyright and provide > > the URL. > > > This licence is for those folks who do not like the BSD derived work > > restrictions. > > the difference between zpu and nibz is that ZPU is supported > by GCC toolchain, while there are no tools to generate any > meaningful code for nibz > > correct me if i am wrong > > AnttiIs that really the primary critera? I think you are right. But I have a similar CPU design that I expect to use on a project shortly and it will be programmed in assembly, but it will look a lot like Forth. I consider that to be close enough to a high level language. BTW, ZPU may have a GCC compiler, but without a debugger, is that really useful? There aren't many projects done in C that are debugged without an emulator. Rick
rickman wrote:> On Mar 17, 9:58 am, "Antti.Luk...@googlemail.com" > <Antti.Luk...@googlemail.com> wrote: >> On Mar 17, 3:24 pm, Jacko <jackokr...@gmail.com> wrote: >> >> >> >>> Hi >>> Three cheers for Chuck Moore, and his hidden friend Keep Less ;-) >>> Another zero operand CPUhttp://nibz.googlecode.com >>> cheers jacko >>> Now available in free licence of one core per ASIC/FPGA/CPLD, with two >>> conditions. >>> 1. A K Ring Technologies Logo must be printed atop the chip or close >>> by on the PCB at any resolution. >>> 2. Any documentation produced must acknowledge copyright and provide >>> the URL. >>> This licence is for those folks who do not like the BSD derived work >>> restrictions. >> the difference between zpu and nibz is that ZPU is supported >> by GCC toolchain, while there are no tools to generate any >> meaningful code for nibz >> >> correct me if i am wrong >> >> Antti > > Is that really the primary critera? I think you are right. But I > have a similar CPU design that I expect to use on a project shortly > and it will be programmed in assembly, but it will look a lot like > Forth. I consider that to be close enough to a high level language. > > BTW, ZPU may have a GCC compiler, but without a debugger, is that > really useful? There aren't many projects done in C that are debugged > without an emulator. > > RickIt's possible to do a lot of development without a debugger. I often do embedded development without one (though I prefer to have one available if possible). Until you've done debugging with only a single LED for signalling, you haven't really done embedded development. Bonus points if the microcontroller you're using only comes in OTP version. Even big projects can be done without a debugger: <http://linuxmafia.com/faq/Kernel/linus-im-a-bastard-speech.html> So a compiler without a debugger is somewhat limited but still useful, but a debugger without a compiler is rather less useful!





