FPGARelated.com
Forums

MISC - Stack Based vs. Register Based

Started by rickman March 29, 2013
I have been working with stack based MISC designs in FPGAs for some 
years.  All along I have been comparing my work to the work of others. 
These others were the conventional RISC type processors supplied by the 
FPGA vendors as well as the many processor designs done by individuals 
or groups as open source.

So far my CPUs have always ranked reasonably well in terms of speed, but 
more importantly to me, very well in terms of size and code density.  My 
efforts have shown it hard to improve on code density by a significant 
degree while simultaneously minimizing the resources used by the design. 
  Careful selection of the instruction set can both improve code density 
and minimize logic used if measured together, but there is always a 
tradeoff.  One can always be improved at the expense of the other.

The last couple of days I was looking at some code I plan to use and 
realized that it could be a lot more efficient if I could find a way to 
use more parallelism inside the CPU and use fewer instructions.  So I 
started looking at defining separate opcodes for the two primary 
function units in the design, the data stack and the return stack.  Each 
has its own ALU.  The data stack has a full complement of capabilities 
while the return stack can only add, subtract and compare.  The return 
stack is actually intended to be an "address" processing unit.

While trying to figure out how to maximize the parallel capabilities of 
these units, I realized that many operations were just stack 
manipulations.  Then I read the thread about the relative "cost" of 
stack ops vs memory accesses and I realized these were what I needed to 
optimize.  I needed to find a way to not use an instruction and a clock 
cycle for moving data around on the stack.

In the thread on stack ops it was pointed out repeatedly that very often 
the stack operands would be optimized to register operands, meaning they 
wouldn't need to do the stack ops at all really.  So I took a look at a 
register based MISC design.  Guess what, I don't see the disadvantage! 
I have pushed this around for a couple of days and although I haven't 
done a detailed design, I think I have looked at it enough to realize 
that I can design a register oriented MISC CPU that will run as fast, if 
not faster than my stack based design and it will use fewer 
instructions.  I still need to add some features like support for a 
stack in memory, in other words, pre-increment/post-decrement (or the 
other way around...), but I don't see where this is a bad design.  It 
may end up using *less* logic as well.  My stack design provides access 
to the stack pointers which require logic for both the pointers and 
muxing them into the data stack for reading.

I guess looking at other peoples designs (such as Chuck's) has changed 
my perspective over the years so that I am willing and able to do 
optimizations in ways I would not have wanted to do in the past.  But I 
am a bit surprised that there has been so much emphasis on stack 
oriented MISC machines which it may well be that register based MISC 
designs are also very efficient, at least if you aren't building them to 
service a C compiler or trying to match some ideal RISC model.

-- 

Rick
In comp.arch.fpga rickman <gnuarm@gmail.com> wrote:
> I have been working with stack based MISC designs in FPGAs for some > years. All along I have been comparing my work to the work of others. > These others were the conventional RISC type processors supplied by the > FPGA vendors as well as the many processor designs done by individuals > or groups as open source.
You might also look at some VLIW designs. Not that the designs themselves will be useful, but maybe some of the ideas that were used would help. Well, much of the idea of RISC is that code density isn't very important, and that many of the more complicated instructions made assembly language programming easier, but compilers didn't use them.
> So far my CPUs have always ranked reasonably well in terms of speed, but > more importantly to me, very well in terms of size and code density.
Do you mean the size of CPU (lines of verilog) or size of the programs that run on it?
> My > efforts have shown it hard to improve on code density by a significant > degree while simultaneously minimizing the resources used by the design. > Careful selection of the instruction set can both improve code density > and minimize logic used if measured together, but there is always a > tradeoff. One can always be improved at the expense of the other.
Seems to me that much of the design of VAX was to improve code density when main memory was still a very significant part of the cost of a machine. The large number of addressing modes allowed for efficient use of bits. It also greatly complicated making an efficient pipelined processor. With little overlap and a microprogrammed CPU, it is easy to go sequentially through the instruction bytes and process them in order. Most of the complication is in the microcode itself. But every level of logic adds delay. Using a wide bus to fast memory is more efficient that a complicated decoder. But sometimes RISC went too far. In early RISC, there was the idea of one cycle per instruction. They couldn't do that for multiply, so they added multiply-step, an instruction that you execute many times for each multiply operation. (And maybe no divide at all.) For VLIW, a very wide instruction word allows for specifying many different operations at the same time. It relies on complicated compilers to optimally pack the multiple operations into the instruction stream.
> The last couple of days I was looking at some code I plan to use and > realized that it could be a lot more efficient if I could find a way to > use more parallelism inside the CPU and use fewer instructions. So I > started looking at defining separate opcodes for the two primary > function units in the design, the data stack and the return stack. Each > has its own ALU. The data stack has a full complement of capabilities > while the return stack can only add, subtract and compare. The return > stack is actually intended to be an "address" processing unit.
This kind of thought is what lead to the branch delay slot on many RISC processors. There is work to be done for branches, especially for branch to or return from subroutine. Letting the processor know early allows for more overlap. So, on many early (maybe not now) RISC machines the instruction after a branch instruction is executed before the branch is taken. (It might be a NOOP, though.) Again, reduced code density (if it is NOOP) in exchange for a faster instruction cycle.)
> While trying to figure out how to maximize the parallel capabilities of > these units, I realized that many operations were just stack > manipulations. Then I read the thread about the relative "cost" of > stack ops vs memory accesses and I realized these were what I needed to > optimize. I needed to find a way to not use an instruction and a clock > cycle for moving data around on the stack.
Well, consider the x87 stack instructions. Some operations exist in forms that do and don't pop something off the stack. A small increase in the number of opcodes used allows for avoiding many POP instructions.
> In the thread on stack ops it was pointed out repeatedly that very often > the stack operands would be optimized to register operands, meaning they > wouldn't need to do the stack ops at all really. So I took a look at a > register based MISC design. Guess what, I don't see the disadvantage!
Well, actually the 8087 was designed to be either a register or stack machine, as many instructions can index into the stack. If you keep track of the current stack depth, you can find data in the stack like in registers. Well, it was supposed to be able to do that.
> I have pushed this around for a couple of days and although I haven't > done a detailed design, I think I have looked at it enough to realize > that I can design a register oriented MISC CPU that will run as fast, if > not faster than my stack based design and it will use fewer > instructions. I still need to add some features like support for a > stack in memory, in other words, pre-increment/post-decrement (or the > other way around...), but I don't see where this is a bad design. It > may end up using *less* logic as well. My stack design provides access > to the stack pointers which require logic for both the pointers and > muxing them into the data stack for reading.
Well, the stack design has the advantage that you can use instruction bits either for a memory address or for the operation, allowing for much smaller instructions. But that only works as long as everything is in the right place on the stack.
> I guess looking at other peoples designs (such as Chuck's) has changed > my perspective over the years so that I am willing and able to do > optimizations in ways I would not have wanted to do in the past. But I > am a bit surprised that there has been so much emphasis on stack > oriented MISC machines which it may well be that register based MISC > designs are also very efficient, at least if you aren't building them to > service a C compiler or trying to match some ideal RISC model.
Seems to me that there hasn't been much done in stack machines since the B5500, the first computer I ever used when I was about nine. -- glen
On 3/29/2013 5:43 PM, glen herrmannsfeldt wrote:
> In comp.arch.fpga rickman<gnuarm@gmail.com> wrote: >> I have been working with stack based MISC designs in FPGAs for some >> years. All along I have been comparing my work to the work of others. >> These others were the conventional RISC type processors supplied by the >> FPGA vendors as well as the many processor designs done by individuals >> or groups as open source. > > You might also look at some VLIW designs. Not that the designs > themselves will be useful, but maybe some of the ideas that were used > would help. > > Well, much of the idea of RISC is that code density isn't very > important, and that many of the more complicated instructions made > assembly language programming easier, but compilers didn't use them.
I am somewhat familiar with VLIW. I am also very familiar with microcode which is the extreme VLIW. I have coded microcode for an I/O processor on an attached array processor. That's like saying I coded a DMA controller in a DSP chip, but before DSP chips were around.
>> So far my CPUs have always ranked reasonably well in terms of speed, but >> more importantly to me, very well in terms of size and code density. > > Do you mean the size of CPU (lines of verilog) or size of the programs > that run on it?
I don't count lines of HDL code. I can't see that as a useful metric for much. I am referring to code density of the compiled code for the target CPU. All of my work is in FPGAs and memory is limited inside the chip... well, at least in the small ones... lol Some FPGAs now have literally Mbits of memory on chip. Still, nearly all of my work is miniature and low power, sometimes *very* low power. So there may only be 6 block memories in the FPGA for example.
>> My >> efforts have shown it hard to improve on code density by a significant >> degree while simultaneously minimizing the resources used by the design. >> Careful selection of the instruction set can both improve code density >> and minimize logic used if measured together, but there is always a >> tradeoff. One can always be improved at the expense of the other. > > Seems to me that much of the design of VAX was to improve code > density when main memory was still a very significant part of the > cost of a machine. The large number of addressing modes allowed for > efficient use of bits. It also greatly complicated making an efficient > pipelined processor. With little overlap and a microprogrammed CPU, > it is easy to go sequentially through the instruction bytes and process > them in order. Most of the complication is in the microcode itself.
One of the big differences is that they were not much limited in the size of the machine itself. They could throw lots of gates at the problem and not worry too much other than how it impacted speed. I am again more limited in the small FPGAs I use and I expect I will achieve higher clock speeds by keeping the machine simple as well. Also, as an architectural decision, there is no microcode and all instructions are one clock cycle. This helps with simplification and also makes interrupt processing very simple.
> But every level of logic adds delay. Using a wide bus to fast memory > is more efficient that a complicated decoder. But sometimes RISC went > too far. In early RISC, there was the idea of one cycle per instruction. > They couldn't do that for multiply, so they added multiply-step, an > instruction that you execute many times for each multiply operation. > (And maybe no divide at all.)
I"m not sure what your point is. What part of this is "too far"? This is exactly the type of design I am doing, but to a greater extent.
> For VLIW, a very wide instruction word allows for specifying many > different operations at the same time. It relies on complicated > compilers to optimally pack the multiple operations into the > instruction stream.
Yes, in theory that is what VLIW is. This is just one step removed from microcode where the only limitation to how parallel operations can be is the data/address paths themselves. The primary application of VLIW I have seen is in the TI 6000 series DSP chips. But in reality this is not what I consider VLIW. This design uses eight CPUs which are mostly similar, but not quite identical with two sets of four CPU units sharing a register file, IIRC. In reality each of the eight CPUs gets its own 32 bit instruction stream. They all operate in lock step, but you can't do eight FIR filters. I think of the four in a set, two are set up to do full math, etc and two are able to generate addresses. So this is really two CPUs with dual MACs and two address generators as it ends up being used most of the time. But then they make it run at clocks of over 1 GHz so it is a damn fast DSP and handles most of the cell phone calls as part of the base station. Regardless of whether it is a "good" example of VLIW, it was a marketing success.
>> The last couple of days I was looking at some code I plan to use and >> realized that it could be a lot more efficient if I could find a way to >> use more parallelism inside the CPU and use fewer instructions. So I >> started looking at defining separate opcodes for the two primary >> function units in the design, the data stack and the return stack. Each >> has its own ALU. The data stack has a full complement of capabilities >> while the return stack can only add, subtract and compare. The return >> stack is actually intended to be an "address" processing unit. > > This kind of thought is what lead to the branch delay slot on many > RISC processors. There is work to be done for branches, especially > for branch to or return from subroutine. Letting the processor know > early allows for more overlap. So, on many early (maybe not now) RISC > machines the instruction after a branch instruction is executed before > the branch is taken. (It might be a NOOP, though.) Again, reduced code > density (if it is NOOP) in exchange for a faster instruction cycle.)
That is a result of the heavy pipelining that is being done. I like to say my design is not pipelined, but someone here finally convinced me that my design *is* pipelined with the execution in parallel with the next instruction fetch, but there is never a need to stall or flush because there are no conflicts. I want a design to be fast, but not at the expense of complexity. That is one way I think like Chuck Moore. Keep it simple and that gives you speed.
>> While trying to figure out how to maximize the parallel capabilities of >> these units, I realized that many operations were just stack >> manipulations. Then I read the thread about the relative "cost" of >> stack ops vs memory accesses and I realized these were what I needed to >> optimize. I needed to find a way to not use an instruction and a clock >> cycle for moving data around on the stack. > > Well, consider the x87 stack instructions. Some operations exist in > forms that do and don't pop something off the stack. A small increase > in the number of opcodes used allows for avoiding many POP instructions.
In Forth speak a POP would be a DROP. That is not often used in Forth really, or in my apps. I just wrote some code for my stack CPU and I think there were maybe two DROPs in just over 100 instructions. I am talking about the DUPs, SWAPs, OVERs and such. The end up being needed enough that it makes the register design look good... at least at first blush. I am looking at how to organize a register based instruction set without expanding the size of the instructions. I'm realizing that is one issue with registers, you have to specify them. But I don't need to make the machine totally general like the goal for RISC. That is so writing compilers is easier. I don't need to consider that.
>> In the thread on stack ops it was pointed out repeatedly that very often >> the stack operands would be optimized to register operands, meaning they >> wouldn't need to do the stack ops at all really. So I took a look at a >> register based MISC design. Guess what, I don't see the disadvantage! > > Well, actually the 8087 was designed to be either a register or stack > machine, as many instructions can index into the stack. If you keep > track of the current stack depth, you can find data in the stack like > in registers. Well, it was supposed to be able to do that. > >> I have pushed this around for a couple of days and although I haven't >> done a detailed design, I think I have looked at it enough to realize >> that I can design a register oriented MISC CPU that will run as fast, if >> not faster than my stack based design and it will use fewer >> instructions. I still need to add some features like support for a >> stack in memory, in other words, pre-increment/post-decrement (or the >> other way around...), but I don't see where this is a bad design. It >> may end up using *less* logic as well. My stack design provides access >> to the stack pointers which require logic for both the pointers and >> muxing them into the data stack for reading. > > Well, the stack design has the advantage that you can use instruction > bits either for a memory address or for the operation, allowing for much > smaller instructions. But that only works as long as everything is in > the right place on the stack.
Yes, it is a tradeoff between instruction size and the number of ops needed to get a job done. I'm looking at trimming the instruction size down to give a workable subset for register operations.
>> I guess looking at other peoples designs (such as Chuck's) has changed >> my perspective over the years so that I am willing and able to do >> optimizations in ways I would not have wanted to do in the past. But I >> am a bit surprised that there has been so much emphasis on stack >> oriented MISC machines which it may well be that register based MISC >> designs are also very efficient, at least if you aren't building them to >> service a C compiler or trying to match some ideal RISC model. > > Seems to me that there hasn't been much done in stack machines since > the B5500, the first computer I ever used when I was about nine.
I wouldn't say that. There may not be many commercial designs out there, but there have been a few stack based CPU chips (mostly from the work of Chuck Moore) and there are a number of stack CPUs internal to chips. Just ask Bernd Paysan. He has done several I believe. Mine has only been run in FPGAs. -- Rick
On 03/29/2013 10:00 PM, rickman wrote:
> I have been working with stack based MISC designs in FPGAs for some > years. All along I have been comparing my work to the work of others. > These others were the conventional RISC type processors supplied by the > FPGA vendors as well as the many processor designs done by individuals > or groups as open source. > > So far my CPUs have always ranked reasonably well in terms of speed, but > more importantly to me, very well in terms of size and code density. My > efforts have shown it hard to improve on code density by a significant > degree while simultaneously minimizing the resources used by the design. > Careful selection of the instruction set can both improve code density > and minimize logic used if measured together, but there is always a > tradeoff. One can always be improved at the expense of the other. >
I once made a CPU design for an FPGA that had multiple stacks. There was a general purpose stack "A", two index stacks "X" and "Y", and a return stack "R". ALU operations worked between A and any other stack, so they only required 2 bits in the opcode. There was also a move instruction that could move data from a source to a destination stack. Having access to multiple stacks means you spend less time shuffling data on the stack. There's no more need for swap, over, rot and similar stack manipulation instructions. The only primitive operations you need are push and pop. For instance, I had a load instruction that could load from memory using the address in the X stack, and push the result on the A stack. The cool part is that the X stack itself isn't changed by this operation, so the same address can be used multiple time. So, you could do a LOAD (X) ; load from (X) and push on A 1 ; push literal on A ADD ; add top two elements of A STORE (X) ; pop A, and store in (X) to increment a location in memory. And if you wanted to increment X to access the next memory location, you'd do: 1 ; push literal on A ADD X ; pop X, pop A, add, and push result on A. MOVE A, X ; pop A, and push on X It was an 8 bit architecture with 9 bit instructions (to match the FPGA block RAM + parity bit). Having 9 bit instructions allows an 8 bit literal push to be encoded in 1 instruction. Feel free to e-mail if you want more details.
"rickman" <gnuarm@gmail.com> wrote in message
news:kj4vae$msi$1@dont-email.me...

> I have been working with stack based MISC designs in FPGAs for > some years. All along I have been comparing my work to the work > of others. These others were the conventional RISC type > processors supplied by the FPGA vendors as well as the many > processor designs done by individuals or groups as open source. > > So far my CPUs have always ranked reasonably well in terms of > speed, but more importantly to me, very well in terms of size > and code density. My efforts have shown it hard to improve on > code density by a significant degree while simultaneously > minimizing the resources used by the design. Careful selection > of the instruction set can both improve code density and > minimize logic used if measured together, but there is always a > tradeoff. One can always be improved at the expense of the > other. > > The last couple of days I was looking at some code I plan to use > and realized that it could be a lot more efficient if I could > find a way to use more parallelism inside the CPU and use fewer > instructions. So I started looking at defining separate opcodes > for the two primary function units in the design, the data stack > and the return stack. Each has its own ALU. The data stack has > a full complement of capabilities while the return stack can > only add, subtract and compare. The return stack is actually > intended to be an "address" processing unit. > > While trying to figure out how to maximize the parallel > capabilities of these units, I realized that many operations > were just stack manipulations. Then I read the thread about the > relative "cost" of stack ops vs memory accesses and I realized > these were what I needed to optimize. I needed to find a way to > not use an instruction and a clock cycle for moving data around > on the stack. > > In the thread on stack ops it was pointed out repeatedly that > very often the stack operands would be optimized to register > operands, meaning they wouldn't need to do the stack ops at all > really. So I took a look at a register based MISC design. > Guess what, I don't see the disadvantage! I have pushed this > around for a couple of days and although I haven't done a > detailed design, I think I have looked at it enough to realize > that I can design a register oriented MISC CPU that will run as > fast, if not faster than my stack based design and it will use > fewer instructions. I still need to add some features like > support for a stack in memory, in other words, > pre-increment/post-decrement (or the other way around...), but I > don't see where this is a bad design. It may end up using > *less* logic as well. My stack design provides access to the > stack pointers which require logic for both the pointers and > muxing them into the data stack for reading. > > I guess looking at other peoples designs (such as Chuck's) has > changed my perspective over the years so that I am willing and > able to do optimizations in ways I would not have wanted to do > in the past. But I am a bit surprised that there has been so > much emphasis on stack oriented MISC machines which it may well > be that register based MISC designs are also very efficient, > at least if you aren't building them to service a C compiler or > trying to match some ideal RISC model. >
Are those your actual results or did you just reiterate what is on Wikipedia? Yes, that's a serious question. Read the MISC page: http://en.wikipedia.org/wiki/Minimal_instruction_set_computer See ... ?! Code density is a CISC concept. I don't see how it applies to your MISC project. Increasing code density for a MISC processor means implementing more powerful instructions, i.e., those that do more work, while minimizing bytes in the instruction opcode encoding. Even if you implement CISC-like instructions, you can't forgo the MISC instructions you already have in order to add the CISC-like instructions. So, to do that, you'll need to increase the size of the instruction set, as well as implement a more complicated instruction decoder. I.e., that means the processor will no longer be MISC, but MISC+minimal CISC hybrid, or pure CISC... No offense, but you seem to be "reinventing the wheel" in terms of microprocessor design. You're coming to the same conclusions that were found in the 1980's, e.g., concluding a register based machine can perform better than a stack based machine, except you've applied it to MISC in an FPGA package... How is that a new conclusion? Also, please read up on CISC and RISC: http://en.wikipedia.org/wiki/Reduced_instruction_set_computing http://en.wikipedia.org/wiki/Complex_instruction_set_computing You posted to two groups. Which group was the " ... thread about the relative 'cost' of stack ops vs memory accesses ..." posted on? comp.lang.forth? comp.arch.fpga? I can look it up, but I'd rather not. Also, you cross-posted to comp.arch.fpga. While they'll likely be familiar with FPGAs, most there are not going to be familiar with the features of stack-based processors or Forth processors that you discuss indirectly within your post. They might not be familiar with ancient CISC concepts such as "code density" either, or understand why it was important at one point in time. E.g., I suspect this Forth related stuff from above won't be widely understood on c.a.f. without clarification: "peoples[sic] designs (such as Chuck's)" - various Forth processors by Charles Moore "the data stack and the return stack." - interpreted Forth machine model Rod Pemberton
On 3/30/2013 3:20 AM, Arlet Ottens wrote:
> On 03/29/2013 10:00 PM, rickman wrote: >> I have been working with stack based MISC designs in FPGAs for some >> years. All along I have been comparing my work to the work of others. >> These others were the conventional RISC type processors supplied by the >> FPGA vendors as well as the many processor designs done by individuals >> or groups as open source. >> >> So far my CPUs have always ranked reasonably well in terms of speed, but >> more importantly to me, very well in terms of size and code density. My >> efforts have shown it hard to improve on code density by a significant >> degree while simultaneously minimizing the resources used by the design. >> Careful selection of the instruction set can both improve code density >> and minimize logic used if measured together, but there is always a >> tradeoff. One can always be improved at the expense of the other. >> > > I once made a CPU design for an FPGA that had multiple stacks. There was > a general purpose stack "A", two index stacks "X" and "Y", and a return > stack "R". ALU operations worked between A and any other stack, so they > only required 2 bits in the opcode. There was also a move instruction > that could move data from a source to a destination stack.
Let's see, this would be a hybrid really, between register based and stack based as it has multiple stacks which must be selected in the instruction set like registers, just not many.
> Having access to multiple stacks means you spend less time shuffling > data on the stack. There's no more need for swap, over, rot and similar > stack manipulation instructions. The only primitive operations you need > are push and pop.
There is the extra hardware required to implement multiple stacks. Why have quite so many? I use the return stack for addresses. That works pretty well. Maybe A, X and R?
> For instance, I had a load instruction that could load from memory using > the address in the X stack, and push the result on the A stack. The cool > part is that the X stack itself isn't changed by this operation, so the > same address can be used multiple time. So, you could do a > > LOAD (X) ; load from (X) and push on A > 1 ; push literal on A > ADD ; add top two elements of A > STORE (X) ; pop A, and store in (X) > > to increment a location in memory.
But you aren't quite done yet, at least if you care about stack overflows. Chuck's designs don't care. If he has data on the bottom of the stack he can just leave it. Otherwise you need to drop the address on the X stack. I looked at this when designing my MISC processor. I ended up with two fetch and two stores. One just does the fetch or store and pops the address. The other does a post increment and retains the address to be used in a loop. This one would have to be dropped at the end.
> And if you wanted to increment X to access the next memory location, > you'd do: > > 1 ; push literal on A > ADD X ; pop X, pop A, add, and push result on A. > MOVE A, X ; pop A, and push on X
Add an autoincrement to the index stacks and these three instructions go away.
> It was an 8 bit architecture with 9 bit instructions (to match the FPGA > block RAM + parity bit). Having 9 bit instructions allows an 8 bit > literal push to be encoded in 1 instruction.
I was all about 9 bit instructions and 18 bit address/data bus sizes until I started working with the iCE40 parts which only have 8 bit memory. I think the parity bit is a comms industry thing. That one bit makes a *big* difference in the instruction set capabilities.
> Feel free to e-mail if you want more details.
Thanks. -- Rick
On 03/30/2013 10:54 PM, Rod Pemberton wrote:

>> >> I guess looking at other peoples designs (such as Chuck's) has >> changed my perspective over the years so that I am willing and >> able to do optimizations in ways I would not have wanted to do >> in the past. But I am a bit surprised that there has been so >> much emphasis on stack oriented MISC machines which it may well >> be that register based MISC designs are also very efficient, >> at least if you aren't building them to service a C compiler or >> trying to match some ideal RISC model. >> > > Are those your actual results or did you just reiterate what is on > Wikipedia? Yes, that's a serious question. Read the MISC page: > http://en.wikipedia.org/wiki/Minimal_instruction_set_computer > > See ... ?!
It sounds to me rickman is questioning the (unsupported) claims on wikipedia that stack based machines have an advantage in size and/or simplicity, not reiterating them.
> Code density is a CISC concept. I don't see how it applies to > your MISC project. Increasing code density for a MISC processor > means implementing more powerful instructions, i.e., those that do > more work, while minimizing bytes in the instruction opcode > encoding. Even if you implement CISC-like instructions, you can't > forgo the MISC instructions you already have in order to add the > CISC-like instructions. So, to do that, you'll need to increase > the size of the instruction set, as well as implement a more > complicated instruction decoder. I.e., that means the processor > will no longer be MISC, but MISC+minimal CISC hybrid, or pure > CISC...
It is perfectly possible to trade one type of MISC processor for another one. The choice between stack and register based is an obvious one. If you switch from stack to register based, there's no need to keep stack manipulation instructions around.
> Also, you cross-posted to comp.arch.fpga. While they'll likely be > familiar with FPGAs, most there are not going to be familiar with > the features of stack-based processors or Forth processors that > you discuss indirectly within your post. They might not be > familiar with ancient CISC concepts such as "code density" either, > or understand why it was important at one point in time. E.g., I > suspect this Forth related stuff from above won't be widely > understood on c.a.f. without clarification:
The design of simple and compact processors is of great interest to many FPGA engineers. Plenty of FPGA designs need some sort of control processor, and for cost reduction it's important to use minimal resources. Like rickman said, this involves a careful balance between implementation complexity, speed, and code density, while also considering how much work it is to write/maintain the software that's running on the processor. Code density is still critically important. Fast memory is small, both on FPGA as well as general purpose processors.
On 03/31/2013 12:00 AM, rickman wrote:

>> I once made a CPU design for an FPGA that had multiple stacks. There was >> a general purpose stack "A", two index stacks "X" and "Y", and a return >> stack "R". ALU operations worked between A and any other stack, so they >> only required 2 bits in the opcode. There was also a move instruction >> that could move data from a source to a destination stack. > > Let's see, this would be a hybrid really, between register based and > stack based as it has multiple stacks which must be selected in the > instruction set like registers, just not many.
Exactly. And as a hybrid, it offers some advantages from both kinds of designs.
> > >> Having access to multiple stacks means you spend less time shuffling >> data on the stack. There's no more need for swap, over, rot and similar >> stack manipulation instructions. The only primitive operations you need >> are push and pop. > > There is the extra hardware required to implement multiple stacks. Why > have quite so many? I use the return stack for addresses. That works > pretty well. Maybe A, X and R?
I had all stacks implemented in the same block RAM, just using different sections of it. But you are right, in my implementation I had reserved room for the Y stack, but never really implemented it. Just using the X was sufficient for the application I needed the CPU For. Of course, for other applications, having an extra register/stack that you can use as an memory pointer could be useful, so I left the Y register in the instruction encoding. For 3 registers you need 2 bits anyway, so it makes sense to allow for 4.
> But you aren't quite done yet, at least if you care about stack > overflows. Chuck's designs don't care. If he has data on the bottom of > the stack he can just leave it. Otherwise you need to drop the address > on the X stack.
Correct, if you no longer need the address, you need to drop it from the stack. On the other hand, if you let it drop automatically, and you need it twice, you would have to dup it. Intuitively, I would say that in inner loops it would be more common that you'd want to reuse an address (possibly with offset or autoinc/dec).
> I was all about 9 bit instructions and 18 bit address/data bus sizes > until I started working with the iCE40 parts which only have 8 bit > memory. I think the parity bit is a comms industry thing. That one bit > makes a *big* difference in the instruction set capabilities.
Agreed. As soon you use the parity bits, you're tied to a certain architecture. On the other hand, if you're already chose a certain architecture, and the parity bits are available for free, it can be very advantageous to use them.
In comp.arch.fpga rickman <gnuarm@gmail.com> wrote:

(snip)
>> Well, much of the idea of RISC is that code density isn't very >> important, and that many of the more complicated instructions made >> assembly language programming easier, but compilers didn't use them.
> I am somewhat familiar with VLIW. I am also very familiar with > microcode which is the extreme VLIW. I have coded microcode for an I/O > processor on an attached array processor. That's like saying I coded a > DMA controller in a DSP chip, but before DSP chips were around.
Well, yes, VLIW is just about like compiling from source directly to microcode. The currently in production VLIW processor is Itanium. I believe 128 bit wide instructions, which specify many different operations that can happen at the same time. 128 general and 128 floating point registers, 128 bit wide data bus, but it uses a lot of power. Something like 100W, with Icc of 100A and Vcc about 1V. I have an actual RX2600 dual Itanium box, but don't run it very often, mostly because of the power used. (snip)
>> But every level of logic adds delay. Using a wide bus to fast memory >> is more efficient that a complicated decoder. But sometimes RISC went >> too far. In early RISC, there was the idea of one cycle per instruction. >> They couldn't do that for multiply, so they added multiply-step, an >> instruction that you execute many times for each multiply operation. >> (And maybe no divide at all.)
> I"m not sure what your point is. What part of this is "too far"? This > is exactly the type of design I am doing, but to a greater extent.
The early SPARC didn't have a multiply instruction. (Since it couldn't be done in one cycle.) Instead, they did multiply, at least the Sun machines did, through software emulation, with a software trap. Early when I started using a Sun4/110 I generated a whole set of fonts for TeX from Metafont source, which does a lot of multiply. Unix (SunOS) keeps track of user time (what your program does) and system time (what the OS does while executing your program). Multiply counted as system time, and could be a large fraction of the total time.
>> For VLIW, a very wide instruction word allows for specifying many >> different operations at the same time. It relies on complicated >> compilers to optimally pack the multiple operations into the >> instruction stream.
> Yes, in theory that is what VLIW is. This is just one step removed from > microcode where the only limitation to how parallel operations can be is > the data/address paths themselves. The primary application of VLIW I > have seen is in the TI 6000 series DSP chips. But in reality this is > not what I consider VLIW. This design uses eight CPUs which are mostly > similar, but not quite identical with two sets of four CPU units sharing > a register file, IIRC. In reality each of the eight CPUs gets its own > 32 bit instruction stream. They all operate in lock step, but you can't > do eight FIR filters. I think of the four in a set, two are set up to > do full math, etc and two are able to generate addresses. So this is > really two CPUs with dual MACs and two address generators as it ends up > being used most of the time. But then they make it run at clocks of > over 1 GHz so it is a damn fast DSP and handles most of the cell phone > calls as part of the base station.
For Itanium, the different units do different things. There are instruction formats that divide up the bits in different ways to make optimal use of the bits. I used to have the manual nearby, but I don't see it right now. (snip)
> That is a result of the heavy pipelining that is being done. I like to > say my design is not pipelined, but someone here finally convinced me > that my design *is* pipelined with the execution in parallel with the > next instruction fetch, but there is never a need to stall or flush > because there are no conflicts.
Yes, that counts, but it gets much more interesting with pipelines like the Cray-1 that, after some cycles of latency, generate one result every clock cycle.
> I want a design to be fast, but not at the expense of complexity. That > is one way I think like Chuck Moore. Keep it simple and that gives you > speed.
(snip)
> In Forth speak a POP would be a DROP. That is not often used in Forth > really, or in my apps. I just wrote some code for my stack CPU and I > think there were maybe two DROPs in just over 100 instructions. I am > talking about the DUPs, SWAPs, OVERs and such. The end up being needed > enough that it makes the register design look good... at least at first > blush. I am looking at how to organize a register based instruction set > without expanding the size of the instructions. I'm realizing that is > one issue with registers, you have to specify them. But I don't need to > make the machine totally general like the goal for RISC. That is so > writing compilers is easier. I don't need to consider that.
For x87, they avoid the DUP, SWAP, and such by instructions being able to specify any register in the stack. You can, for example, add any stack register (there are eight) to the top of stack. I haven't thought about it for a while, but I believe either pushing the result, or replacing the previous top of the stack. (snip)
>> Well, the stack design has the advantage that you can use instruction >> bits either for a memory address or for the operation, allowing for much >> smaller instructions. But that only works as long as everything is in >> the right place on the stack.
> Yes, it is a tradeoff between instruction size and the number of ops > needed to get a job done. I'm looking at trimming the instruction size > down to give a workable subset for register operations.
Seems to me that one possibility is to have a really functional stack operation instruction, such that, with the give number of bits, it allows for the most opertions. Some combination of DUP, SWAP, and POP all at once. Though that isn't easy for the stack itself. -- glen
glen herrmannsfeldt wrote:


> Seems to me that much of the design of VAX was to improve code > density when main memory was still a very significant part of the > cost of a machine.
The original VAX series (780, then 730 and 750) and the uVAX I and uVAX II had no cache, so reducing rate of memory fetches was also important. The first cached machine I used was the uVAX III (KA650) and a good part of the performance increase was due to the cache. Jon