FPGARelated.com
Forums

Soft core processors: RISC versus stack/accumulator for equal FPGA resources

Started by Unknown September 26, 2015
It would appear there are very similar resource needs for either RISC or Stack/Accumulator architectures when both are of the "load/store" classification.
Herein, same multi-port LUT RAM for either RISC register file or dual stacks.  And the DSP for multiply and block RAM for main memory.  "Load/store" refers to using distinct instructions for moving data between LUT RAM and block RAM.

Has someone studied this situation?
Would appear the stack/accumulator program code would be denser?
Would appear multiple instruction issue would be simpler with RISC?

Jim Brakefield
On 9/26/2015 2:07 PM, jim.brakefield@ieee.org wrote:
> It would appear there are very similar resource needs for either RISC or Stack/Accumulator architectures when both are of the "load/store" classification. > Herein, same multi-port LUT RAM for either RISC register file or dual stacks. And the DSP for multiply and block RAM for main memory. "Load/store" refers to using distinct instructions for moving data between LUT RAM and block RAM. > > Has someone studied this situation? > Would appear the stack/accumulator program code would be denser? > Would appear multiple instruction issue would be simpler with RISC?
I've done a little investigation and the instruction set for a stack processor was not much denser than the instruction set for the RISC CPU I compared it to. I don't recall which one it was. A lot depends on the code you use for comparison. I was using loops that move data. Many stack processors have some levels of inefficiency because of the juggling of the stack required in some code. Usually proponents say the code can be done to reduce the juggling of operands which I have found to be mostly true. If you code to reduce the parameter juggling, stack processors can be somewhat more efficient in terms of code space usage. I have looked at a couple of things as alternatives. One is to use VLIW to allow as much parallelism in usage of the execution units within the processor, they are, data unit, address unit and instruction unit. This presents some inherent inefficiency in that a fixed size instruction field is used to control the instruction unit when most IU instructions are just "next", for example. But it allows both the address unit and the data unit to be doing work at the same time for doing things like moving data to/from memory and counting a loop iteration, for example. Another potential stack optimization I have looked at is combining register and stack concepts by allowing very short offsets from top of stack to be used for a given operand along with variable size stack adjustments. I didn't pursue this very far but I think it has potential of virtually eliminating operand juggling making stack processor much faster. I'm not sure of the effect on code size optimization because of the larger instruction size. -- Rick
On Sunday, September 27, 2015 at 3:37:24 AM UTC+9:30, jim.bra...@ieee.org w=
rote:
>=20 > Has someone studied this situation? > Would appear the stack/accumulator program code would be denser? > Would appear multiple instruction issue would be simpler with RISC? >=20
I worked with the 1980's Lilith computer and its Modula-2 compiler which us= ed a stack-based architecture. Christian Jacobi includes a detailed analysi= s of the code generated in his dissertation titled "Code Generation and the= Lilith Architecture". You can download a copy from my website: http://www.cfbsoftware.com/modula2/ I am currently working on the 2015 RISC equivalent - the FPGA RISC5 Oberon = compiler used in Project Oberon: http://www.projectoberon.com The code generation is described in detail in the included documentation.= =20 I have both systems in operation and have some very similar test programs f= or both. I'll experiment to see if the results give any surprises. Any comp= arison would have to take into account the fact that the Lilith was a 16-bi= t architecture whereas RISC5 is 32-bit so it might be tricky. Regards, Chris Burrows CFB Software http://www.astrobe.com
On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote:
> On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote: > > It would appear there are very similar resource needs for either RISC o=
r Stack/Accumulator architectures when both are of the "load/store" classif= ication.
> > Herein, same multi-port LUT RAM for either RISC register file or dual s=
tacks. And the DSP for multiply and block RAM for main memory. "Load/stor= e" refers to using distinct instructions for moving data between LUT RAM an= d block RAM.
> > > > Has someone studied this situation? > > Would appear the stack/accumulator program code would be denser? > > Would appear multiple instruction issue would be simpler with RISC? >=20 > I've done a little investigation and the instruction set for a stack=20 > processor was not much denser than the instruction set for the RISC CPU=
=20
> I compared it to. I don't recall which one it was. >=20 > A lot depends on the code you use for comparison. I was using loops=20 > that move data. Many stack processors have some levels of inefficiency=
=20
> because of the juggling of the stack required in some code. Usually=20 > proponents say the code can be done to reduce the juggling of operands=20 > which I have found to be mostly true. If you code to reduce the=20 > parameter juggling, stack processors can be somewhat more efficient in=20 > terms of code space usage. >=20 > I have looked at a couple of things as alternatives. One is to use VLIW=
=20
> to allow as much parallelism in usage of the execution units within the=
=20
> processor, they are, data unit, address unit and instruction unit. This=
=20
> presents some inherent inefficiency in that a fixed size instruction=20 > field is used to control the instruction unit when most IU instructions=
=20
> are just "next", for example. But it allows both the address unit and=20 > the data unit to be doing work at the same time for doing things like=20 > moving data to/from memory and counting a loop iteration, for example. >=20 > Another potential stack optimization I have looked at is combining=20 > register and stack concepts by allowing very short offsets from top of=20 > stack to be used for a given operand along with variable size stack=20 > adjustments. I didn't pursue this very far but I think it has potential=
=20
> of virtually eliminating operand juggling making stack processor much=20 > faster. I'm not sure of the effect on code size optimization because of=
=20
> the larger instruction size. >=20 > --=20 >=20 > Rick
> I have looked at a couple of things as alternatives. One is to use VLIW=
=20
> to allow as much parallelism in usage of the execution units within the=
=20
> processor, they are, data unit, address unit and instruction unit.
Have considered multiple stacks as a form of VLIW: each stack having its ow= n part of the VLIW instruction, or if nothing to do, providing future immed= iates for any of the other stack instructions.
> Another potential stack optimization I have looked at is combining=20 > register and stack concepts by allowing very short offsets from top of=20 > stack to be used for a given operand along with variable size stack=20 > adjustments. I didn't pursue this very far but I think it has potential=
=20
> of virtually eliminating operand juggling making stack processor much=20 > faster.
Also this is a way to improve processing rate as there are fewer instructio= ns than "pure" stack code (each instruction has a stack/accumulator operati= on and a small offset for the other operand). While one is at it, one can = add various instructions bits for "return", stack/accumulator mode, replace= operation, stack pointer selector, ... Personally, don't have hard numbers for any of this (there are open source = stack machines with small offsets and various instruction bits, what is nee= ded is compilers so that comparisons can be done). And don't want to dupli= cate any work (AKA research) that has already been done. Jim Brakefield
On Saturday, September 26, 2015 at 8:19:29 PM UTC-5, cfbso...@gmail.com wro=
te:
> On Sunday, September 27, 2015 at 3:37:24 AM UTC+9:30, jim.bra...@ieee.org=
wrote:
> >=20 > > Has someone studied this situation? > > Would appear the stack/accumulator program code would be denser? > > Would appear multiple instruction issue would be simpler with RISC? > >=20 >=20 > I worked with the 1980's Lilith computer and its Modula-2 compiler which =
used a stack-based architecture. Christian Jacobi includes a detailed analy= sis of the code generated in his dissertation titled "Code Generation and t= he Lilith Architecture". You can download a copy from my website:
>=20 > http://www.cfbsoftware.com/modula2/ > I am currently working on the 2015 RISC equivalent - the FPGA RISC5 Obero=
n compiler used in Project Oberon:
>=20 > http://www.projectoberon.com >=20 > The code generation is described in detail in the included documentation.=
=20
>=20 > I have both systems in operation and have some very similar test programs=
for both. I'll experiment to see if the results give any surprises. Any co= mparison would have to take into account the fact that the Lilith was a 16-= bit architecture whereas RISC5 is 32-bit so it might be tricky.
>=20 > Regards, > Chris Burrows > CFB Software > http://www.astrobe.com
>Any comparison would have to take into account the fact that the Lilith wa=
s a 16-bit architecture whereas RISC5 is 32-bit so it might be tricky. And in the 1980s main memory access time was smaller multiple of clock rate= than today's DRAMs. However, the main memory for the RISC5 FPGA card is a= synchronous static RAM with a fast access time and comparable to the main m= emory of the Lilith? Jim Brakefield
On 9/27/2015 8:30 PM, jim.brakefield@ieee.org wrote:
> On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote: >> On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote: >>> It would appear there are very similar resource needs for either >>> RISC or Stack/Accumulator architectures when both are of the >>> "load/store" classification. Herein, same multi-port LUT RAM for >>> either RISC register file or dual stacks. And the DSP for >>> multiply and block RAM for main memory. "Load/store" refers to >>> using distinct instructions for moving data between LUT RAM and >>> block RAM. >>> >>> Has someone studied this situation? Would appear the >>> stack/accumulator program code would be denser? Would appear >>> multiple instruction issue would be simpler with RISC? >> >> I've done a little investigation and the instruction set for a >> stack processor was not much denser than the instruction set for >> the RISC CPU I compared it to. I don't recall which one it was. >> >> A lot depends on the code you use for comparison. I was using >> loops that move data. Many stack processors have some levels of >> inefficiency because of the juggling of the stack required in some >> code. Usually proponents say the code can be done to reduce the >> juggling of operands which I have found to be mostly true. If you >> code to reduce the parameter juggling, stack processors can be >> somewhat more efficient in terms of code space usage. >> >> I have looked at a couple of things as alternatives. One is to use >> VLIW to allow as much parallelism in usage of the execution units >> within the processor, they are, data unit, address unit and >> instruction unit. This presents some inherent inefficiency in that >> a fixed size instruction field is used to control the instruction >> unit when most IU instructions are just "next", for example. But >> it allows both the address unit and the data unit to be doing work >> at the same time for doing things like moving data to/from memory >> and counting a loop iteration, for example. >> >> Another potential stack optimization I have looked at is combining >> register and stack concepts by allowing very short offsets from top >> of stack to be used for a given operand along with variable size >> stack adjustments. I didn't pursue this very far but I think it >> has potential of virtually eliminating operand juggling making >> stack processor much faster. I'm not sure of the effect on code >> size optimization because of the larger instruction size. >> >> -- >> >> Rick > >> I have looked at a couple of things as alternatives. One is to use >> VLIW to allow as much parallelism in usage of the execution units >> within the processor, they are, data unit, address unit and >> instruction unit. > Have considered multiple stacks as a form of VLIW: each stack having > its own part of the VLIW instruction, or if nothing to do, providing > future immediates for any of the other stack instructions.
I assume you mean two data stacks? I was trying hard not to expand on the hardware significantly. The common stack machine is typically two stacks, one for data and one for return addresses. In Forth the return stack is also used for loop counting. My derivation uses the return stack for addresses such as memory accesses as well as jump/calls, so I call it the address stack. This lets you do minimal arithmetic (loop counting and incrementing addresses) and reduces stack ops on the data stack such as the two drops required for a memory write.
>> Another potential stack optimization I have looked at is combining >> register and stack concepts by allowing very short offsets from top >> of stack to be used for a given operand along with variable size >> stack adjustments. I didn't pursue this very far but I think it >> has potential of virtually eliminating operand juggling making >> stack processor much faster. > Also this is a way to improve processing rate as there are fewer > instructions than "pure" stack code (each instruction has a > stack/accumulator operation and a small offset for the other > operand). While one is at it, one can add various instructions bits > for "return", stack/accumulator mode, replace operation, stack > pointer selector, ...
Yes, returns are common so it can be useful to provide a minimal instruction overhead for that. The other things can require extra hardware.
> Personally, don't have hard numbers for any of this (there are open > source stack machines with small offsets and various instruction > bits, what is needed is compilers so that comparisons can be done). > And don't want to duplicate any work (AKA research) that has already > been done. > > Jim Brakefield >
-- Rick
On Sunday, September 27, 2015 at 10:20:39 PM UTC-5, rickman wrote:
> On 9/27/2015 8:30 PM, jim.brak...@ieee.org wrote: > > On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote: > >> On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote: > >>> It would appear there are very similar resource needs for either > >>> RISC or Stack/Accumulator architectures when both are of the > >>> "load/store" classification. Herein, same multi-port LUT RAM for > >>> either RISC register file or dual stacks. And the DSP for > >>> multiply and block RAM for main memory. "Load/store" refers to > >>> using distinct instructions for moving data between LUT RAM and > >>> block RAM. > >>> > >>> Has someone studied this situation? Would appear the > >>> stack/accumulator program code would be denser? Would appear > >>> multiple instruction issue would be simpler with RISC? > >> > >> I've done a little investigation and the instruction set for a > >> stack processor was not much denser than the instruction set for > >> the RISC CPU I compared it to. I don't recall which one it was. > >> > >> A lot depends on the code you use for comparison. I was using > >> loops that move data. Many stack processors have some levels of > >> inefficiency because of the juggling of the stack required in some > >> code. Usually proponents say the code can be done to reduce the > >> juggling of operands which I have found to be mostly true. If you > >> code to reduce the parameter juggling, stack processors can be > >> somewhat more efficient in terms of code space usage. > >> > >> I have looked at a couple of things as alternatives. One is to use > >> VLIW to allow as much parallelism in usage of the execution units > >> within the processor, they are, data unit, address unit and > >> instruction unit. This presents some inherent inefficiency in that > >> a fixed size instruction field is used to control the instruction > >> unit when most IU instructions are just "next", for example. But > >> it allows both the address unit and the data unit to be doing work > >> at the same time for doing things like moving data to/from memory > >> and counting a loop iteration, for example. > >> > >> Another potential stack optimization I have looked at is combining > >> register and stack concepts by allowing very short offsets from top > >> of stack to be used for a given operand along with variable size > >> stack adjustments. I didn't pursue this very far but I think it > >> has potential of virtually eliminating operand juggling making > >> stack processor much faster. I'm not sure of the effect on code > >> size optimization because of the larger instruction size. > >> > >> -- > >> > >> Rick > > > >> I have looked at a couple of things as alternatives. One is to use > >> VLIW to allow as much parallelism in usage of the execution units > >> within the processor, they are, data unit, address unit and > >> instruction unit. > > Have considered multiple stacks as a form of VLIW: each stack having > > its own part of the VLIW instruction, or if nothing to do, providing > > future immediates for any of the other stack instructions. >=20 >=20 > I assume you mean two data stacks? I was trying hard not to expand on=20 > the hardware significantly. The common stack machine is typically two=20 > stacks, one for data and one for return addresses. In Forth the return=20 > stack is also used for loop counting. My derivation uses the return=20 > stack for addresses such as memory accesses as well as jump/calls, so I=
=20
> call it the address stack. This lets you do minimal arithmetic (loop=20 > counting and incrementing addresses) and reduces stack ops on the data=20 > stack such as the two drops required for a memory write. >=20 >=20 > >> Another potential stack optimization I have looked at is combining > >> register and stack concepts by allowing very short offsets from top > >> of stack to be used for a given operand along with variable size > >> stack adjustments. I didn't pursue this very far but I think it > >> has potential of virtually eliminating operand juggling making > >> stack processor much faster. > > Also this is a way to improve processing rate as there are fewer > > instructions than "pure" stack code (each instruction has a > > stack/accumulator operation and a small offset for the other > > operand). While one is at it, one can add various instructions bits > > for "return", stack/accumulator mode, replace operation, stack > > pointer selector, ... >=20 > Yes, returns are common so it can be useful to provide a minimal=20 > instruction overhead for that. The other things can require extra=20 > hardware. >=20 >=20 > > Personally, don't have hard numbers for any of this (there are open > > source stack machines with small offsets and various instruction > > bits, what is needed is compilers so that comparisons can be done). > > And don't want to duplicate any work (AKA research) that has already > > been done. > > > > Jim Brakefield > > >=20 >=20 > --=20 >=20 > Rick
Reply:
>I assume you mean two data stacks?
Yes, in particular integer arithmetic on one and floating-point on the othe= r.
>My derivation uses the return stack for addresses such as memory accesses =
as well as jump/calls, so I call it the address stack. OK
> I was trying hard not to expand on the hardware significantly. > The other things can require extra hardware.=20
With FPGA 6LUTs one can have several read ports (4LUT RAM can do it also, i= ts just not as efficient). At one operation per clock and mapping both dat= a and address stacks to the same LUT RAM, one has two ports for operand rea= ds, one port for result write and one port for "return" address read. Just= about any stack or accumulator operation that fits these constraints is po= ssible with appropriate instruction decode and ALU. The SWAP operation req= uires two writes, so one would need to make TOS a separate register to do i= t in one clock (other implementations possible using two multiport LUT RAMs= ). Jim
On 9/28/2015 12:31 AM, jim.brakefield@ieee.org wrote:
> On Sunday, September 27, 2015 at 10:20:39 PM UTC-5, rickman wrote: >> On 9/27/2015 8:30 PM, jim.brak...@ieee.org wrote: >>> On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman >>> wrote: >>>> On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote: >>>>> It would appear there are very similar resource needs for >>>>> either RISC or Stack/Accumulator architectures when both are >>>>> of the "load/store" classification. Herein, same multi-port >>>>> LUT RAM for either RISC register file or dual stacks. And >>>>> the DSP for multiply and block RAM for main memory. >>>>> "Load/store" refers to using distinct instructions for moving >>>>> data between LUT RAM and block RAM. >>>>> >>>>> Has someone studied this situation? Would appear the >>>>> stack/accumulator program code would be denser? Would appear >>>>> multiple instruction issue would be simpler with RISC? >>>> >>>> I've done a little investigation and the instruction set for a >>>> stack processor was not much denser than the instruction set >>>> for the RISC CPU I compared it to. I don't recall which one it >>>> was. >>>> >>>> A lot depends on the code you use for comparison. I was using >>>> loops that move data. Many stack processors have some levels >>>> of inefficiency because of the juggling of the stack required >>>> in some code. Usually proponents say the code can be done to >>>> reduce the juggling of operands which I have found to be mostly >>>> true. If you code to reduce the parameter juggling, stack >>>> processors can be somewhat more efficient in terms of code >>>> space usage. >>>> >>>> I have looked at a couple of things as alternatives. One is to >>>> use VLIW to allow as much parallelism in usage of the execution >>>> units within the processor, they are, data unit, address unit >>>> and instruction unit. This presents some inherent inefficiency >>>> in that a fixed size instruction field is used to control the >>>> instruction unit when most IU instructions are just "next", for >>>> example. But it allows both the address unit and the data unit >>>> to be doing work at the same time for doing things like moving >>>> data to/from memory and counting a loop iteration, for >>>> example. >>>> >>>> Another potential stack optimization I have looked at is >>>> combining register and stack concepts by allowing very short >>>> offsets from top of stack to be used for a given operand along >>>> with variable size stack adjustments. I didn't pursue this >>>> very far but I think it has potential of virtually eliminating >>>> operand juggling making stack processor much faster. I'm not >>>> sure of the effect on code size optimization because of the >>>> larger instruction size. >>>> >>>> -- >>>> >>>> Rick >>> >>>> I have looked at a couple of things as alternatives. One is to >>>> use VLIW to allow as much parallelism in usage of the execution >>>> units within the processor, they are, data unit, address unit >>>> and instruction unit. >>> Have considered multiple stacks as a form of VLIW: each stack >>> having its own part of the VLIW instruction, or if nothing to do, >>> providing future immediates for any of the other stack >>> instructions. >> >> >> I assume you mean two data stacks? I was trying hard not to expand >> on the hardware significantly. The common stack machine is >> typically two stacks, one for data and one for return addresses. In >> Forth the return stack is also used for loop counting. My >> derivation uses the return stack for addresses such as memory >> accesses as well as jump/calls, so I call it the address stack. >> This lets you do minimal arithmetic (loop counting and incrementing >> addresses) and reduces stack ops on the data stack such as the two >> drops required for a memory write. >> >> >>>> Another potential stack optimization I have looked at is >>>> combining register and stack concepts by allowing very short >>>> offsets from top of stack to be used for a given operand along >>>> with variable size stack adjustments. I didn't pursue this >>>> very far but I think it has potential of virtually eliminating >>>> operand juggling making stack processor much faster. >>> Also this is a way to improve processing rate as there are fewer >>> instructions than "pure" stack code (each instruction has a >>> stack/accumulator operation and a small offset for the other >>> operand). While one is at it, one can add various instructions >>> bits for "return", stack/accumulator mode, replace operation, >>> stack pointer selector, ... >> >> Yes, returns are common so it can be useful to provide a minimal >> instruction overhead for that. The other things can require extra >> hardware. >> >> >>> Personally, don't have hard numbers for any of this (there are >>> open source stack machines with small offsets and various >>> instruction bits, what is needed is compilers so that comparisons >>> can be done). And don't want to duplicate any work (AKA research) >>> that has already been done. >>> >>> Jim Brakefield >>> >> >> >> -- >> >> Rick > > Reply: >> I assume you mean two data stacks? > Yes, in particular integer arithmetic on one and floating-point on > the other.
Yes, if you need floating point a separate stack is often used.
>> My derivation uses the return stack for addresses such as memory >> accesses as well as jump/calls, so I call it the address stack. > OK > >> I was trying hard not to expand on the hardware significantly. The >> other things can require extra hardware.
>
> With FPGA 6LUTs one can have several read ports (4LUT RAM can do it > also, its just not as efficient). At one operation per clock and > mapping both data and address stacks to the same LUT RAM, one has two > ports for operand reads, one port for result write and one port for > "return" address read. Just about any stack or accumulator operation > that fits these constraints is possible with appropriate instruction > decode and ALU. The SWAP operation requires two writes, so one would > need to make TOS a separate register to do it in one clock (other > implementations possible using two multiport LUT RAMs).
I used a TOS register for each stack and used a write port and read port for each stack in one block RAM. The write/read ports share the address. A read happens on each cycle automatically and in all the parts I have used that can be set so the data written in a cycle shows up on the read port, so it is the next on stack at all times. Managing the stack pointers can get a bit complex if an effort to keep it simple is not made. As it was the stack pointer was in the critical timing path which ended in the flag registers. The stack pointers set error flags in the CPU status register for over and underflow. I thought this would be useful for debugging, but there is likely ways to minimize the timing overhead. -- Rick
On Monday, September 28, 2015 at 10:49:47 AM UTC+9:30, jim.bra...@ieee.org wrote:
> And in the 1980s main memory access time was smaller multiple of clock rate than today's DRAMs. However, the main memory for the RISC5 FPGA card is asynchronous static RAM with a fast access time and comparable to the main memory of the Lilith?
Rather than trying to paraphrase the information and risk getting it wrong I refer you to a detailed description of the Lilith memory organisation in the 'Lilith Computer Hardware Manual'. You can download a copy of this and several other related documents from BitSavers: http://www.bitsavers.org/pdf/eth/lilith/ Regards, Chris