Superscalar Out-of-Order Processor on an FPGA

Started by Luke May 9, 2006
Luke wrote:
> I actually did build a CPU for pure MHz speed. It was a super fast > dual-issue cpu, but in order to get the high-clock rates, I had to make > some serious trade-offs. > > Number one tradeoff, the execution stage is two stages and there is at > least one delay slot after every instruction before the result can be > used. This version runs at 150MHz. I have another version with not > much bypass hardware that makes it up to 180MHz. But with three delay > slots and only 8 registers per issue slot, scheduling becomes a major > issue. > > Number two: 16-bit architecture. Addition actually takes a long time > using ripple-carry in an FPGA, and there's reall no way around it. > 16-bit is pretty easy to add, so that's what it gets. It's also 16-bit > to cut down on utilization. > > Number three: Some arithmetic instructions are split in two. For > example, shift instructions and 32-bit addition is split into two > instructions. I simply could not afford the logic and delay of doing > these with one instruction. > > Number four: 16 bit addressing. Same deal with addition, it takes too > long, and i don't want to extend the delay slots any further, so I have > 16 bit addressing only. Also, instruction sizes were 16 bits to cut > down on logic and keep things "simple". > > So besides being a total pain in the butt to schedule and program, it > really is rocket-fast. It is at it's very worst 2 times faster than a > 32-bit pipleined processor i designed, and at it's best, it is 10 times > faster. With decent scheduling and a superb compiler or hand coding, > it should be able to sustain 5-8 times faster. > > The other advantage is that I could put 12 of them on a spartan 3 1000. > Theoretically, I could get the performance of a really crappy modern > computer with these things. > > And now I come back to reality. It's such a specialized system, and > the memory architecture, ISA and all around whole system is a mess. > Yes, it's super fast, but so what?
Well, to some users, that is important.
> I would be so much better off just > designing custom pipelined logic to do something rather than this gimp > of a cpu. > > So that's why I'm designing a "modern" processor. It's a general > purpose CPU that could run real software such as Linux. It's that much > more useful ;)
Sounds more like a microprocessor, whereas the first one is more like a microcontroller. There is room for both, so don't throw the first one away! With a small/nimble core, you have the option to deploy more than one, and in an FPGA, that's where soft-cpu's can run rings around other solutions. How much Ram/Registers could the 16 bit one access ? -jg
Luke wrote:
> I've got a little hobby of mine in developing processors for FPGAs. > I've designed several pipelined processors and a multicycle processor. >
I wonder how you manage to find enough time to do all of this! I've only found time during my undergraduate years to build two processors, both with an excuse: A multicycle (>= 12!) processor for a 2nd-year digital logic project (16-bit, 13MHz on Altera EPF10K70) A pipelined processor (no I/O though, so useless) built for clock speed as a summer research project at my university. (32-bit, 250MHz on Altera Stratix EP1S40) I haven't ever found enough time otherwise!
I usually just go straight to HDL.  I like to prototype different
pieces of hardware in Verilog while I'm still designing the overall
architecture so that I have a better idea of what exactly I'm trading
off in terms of clock speed and utilization when I chose one design
over another.  It really does end up being an iterative process.  The
final design will be done in VHDL.

I'm shooting for 50MHz, but I have a feeling the bypass logic simply
won't be able to go that fast.  The bypass logic ends up being the
critical path for all of the processors of this type that I've looked
at, so I'll be paying special attention to it.  Once I get the first
stage implemented (basically just a dataflow processor), I'll be able
to experiment with adding an extra pipeline stage in the bypass logic
(kinda defeats the purpose, ohh well).  That'll probably lower the
overall IPC, but it may or may not increase the clock frequency enough
to be an advantage.  So hopefully I'll have the critical path laid out
at an early enough stage that I will be able to do something about it
without too much grief.

Then again, there will still be plenty of opportunities for other
critical paths to pop up that I hadn't thought of.  I think 25MHz is
probably a reasonable estimate.

JJ wrote:
> Luke wrote: > >>I must not have been very clear. The 180MHz version had no bypassing >>logic whatsoever. It had three delay slots. The 150MHz version did >>have bypassing logic, it had one delay slot. >> >>I read up on carry lookahead for the spartan 3, and you're correct, it >>wouldn't help for 16-bits. In fact, it's slower than just using the >>dedicated carry logic. > > > I also used a 32b CSA design in the design prior to one described > above. It worked but was pretty expensive, IIRC it gave me 32b adds in > same cycle time as 16b ripple add but it used up 7 of 8b ripple add > blocks and needed 2 extra pipe stages to put csa select results back > together and combined that with the CC status logic. The 4 way MTA > though just about took it without too much trouble but it also needed > hazard and forwarding logic. Funny thing I saw was the doubling of > adders added more C load to the pipeline FFs and those had to be > duplicated as well and so the final cost of that 32 datapath was > probably 2 or 3 x bigger than a plain ripple datapath and much harder > to floorplan the P/R. > > What really killed that design was an interlock mechanism designed to > prevent the I Fetch and I Exec blocks from ever running code from same > thread at same time, that 1 little path turned out to be 3 or 4 x > longer than the 32b add time and no amount of redesign could make it go > away, all that trouble for nout. The lesson learned was that complex > architecture with any interlocks usually gets hammered on these paths > that don't show up till the major blocks are done. The final state of > that design was around 65MHz when I thought I would hit 300MHz on the > datapath, and the total logic was about 3x the current design. Not much > was wasted though, much of the conceptual design got rescued in simpler > form. In an ASIC this is much less of a problem since transistor logic > is relatively much faster per clock freq than FPGAs, it would have been > more like 20 gates and ofcourse the major cpu designers can throw > bodies at such problems. > > I wonder how you will get the performance you want without finding an > achilles heal till the later part is done. You have to finish the > overall logic design before commiting to design specific blocks and it > ends up taking multiple iterations. When I said 25MHz in the 1st post I > meant that to reflect these sorts of critical paths that can't be > forseen till your done rather than the datapaths. Thats why I went to > the extreme MTA solution to abolish or severely limit almost all > variables, make it look like a DSP engine and you can't fail. > > Curiously how do you prototype the architecture, in cycle C, or go > straight to HDL simulation? > > Anyway have fun > > John Jakson > transputer guy > > the details are at wotug.org if interested >
I personally use paper and pen. I draw the datapath functionality and think about how they will map into Xilinx FPGA structures while drawing and designing. I like to have large papers so I can draw timing diagrams on the same page. Only when I have some design which I believe would be reasonable I start to code. When I think more about it then I realize that my most used design tool is still the paper and pen. G�ran
G�ran Bilski schrieb:

> When I think more about it then I realize that my most used design tool > is still the paper and pen.
Dinosaur ;-) Regards Falk
The 16-bit cpu could access up to 256K of RAM and had 16 16-bit
registers.  I actually am able to fit 12 of these on one spartan 3 1000
chip, limited only by available block ram.  It was nifty, but if I
really needed that much specialized processing power it would make more
sense to build some custom logic in the FPGA to do it.

G=F6ran Bilski wrote:
> JJ wrote:
> > > Curiously how do you prototype the architecture, in cycle C, or go > > straight to HDL simulation? > > > > Anyway have fun > > > > John Jakson > > transputer guy > > > > the details are at wotug.org if interested > > > > I personally use paper and pen. > I draw the datapath functionality and think about how they will map into > Xilinx FPGA structures while drawing and designing. > I like to have large papers so I can draw timing diagrams on the same pag=
e=2E
> Only when I have some design which I believe would be reasonable I start > to code. > > When I think more about it then I realize that my most used design tool > is still the paper and pen. > > G=F6ran
Aha, I kinda do the opposite, draw lots of tiny fragments on the small notepad sheets and code up the model in cycle C in a Verilog subset or style. If I use larger sheets I am afraid I will have to redraw the whole thing too many times to keep if looking perfect. For awhile I even stooped to ascii graphics over .... background once things are settling down at least they can be edited in small blocks but some edits are torture, better for regular paths. I rediscovered Canvas for Mac & Windows an excellent older 2d drawing tool but thats all it can do, but far better than the OpenOffice drawing. Cycle C tells me right away that the architecture does what its supposed to possibly with millions of cycles of testing and analysis of different approaches but it doesn't give me any warnings about critical paths until its too late. The notepad drawings though layout the C code fragments as TTL/Lut level schematics so I usually have an idea of all path length. The cycle C and the Verilog though are layed out in the same format and sometime I can't tell which is which, although one has to be careful with assigns, in C they are sequential, in Verilog they are not. One nice aspect of the C approach is that parts of the cpu expressed in cycle C also have a faster behavioural version that can be used by other applications such as compiler or OS as if they were running on a PC with that hardware feature available such as the MMU which in this case does memory allocation and user level memory management. That means software can be written for the target and executed on a PC model of it even though the hardware design is incomplete. I do recall my days at Inmos where the Transputer was all over very large A1? sheets of paper and on the walls and floors at gate level, and much of it before any gate level simulation decades before synthesis. I wouldn't want to go back to that again. John Jakson transputer guy
John,

I'm unfamiliar with cycle C, but your post has me interested.

Is this a specific simulator?  Or coding style?  Are tools freely /
cheaply available?

Thanks!
Stephen

Stephen Craven wrote:
> John, > > I'm unfamiliar with cycle C, but your post has me interested. > > Is this a specific simulator? Or coding style? Are tools freely / > cheaply available? > > Thanks! > Stephen
Google for various terms, < cycle C>, its mostly adhoc. There are many ways to do cycle C, it goes back to really being a poor mans simulator although it has some huge advantages and disadvantages over real HDLs. In the far past it was Pascal, and BCPL even APL. I use Verilog because it is fairly close to C syntax but the semantics obviously very different. There are ways to merge the two languages into one system, in larger teams, people use the PLI interface but that has many of its own drawbacks. Today there is also SystemVerilog, not sure where its at. There are, were, several Cycle C vendors all pushing the same idea, most starved to death, the ASIC world doesn't really like to pay for C HDLs that are puny compared to real HDLs and many EEs are perfectly able to devise their own Cycle C if they need to and have done so in many larger companies. It boils down to a duplication of coding effort, sometime it has to be done so that software guys can have access to a functional model in their own language or because the executable reference spec came from Matlab, Fortran F2C etc and can be executed with a HDL C model. Cycle C simulations can run models many orders fasters than HDL simulators and to all intents C compilers and C cpu cycles are free while most HDL simulators have restrictions and are much slower but much more detailed and much better HDLs. Synopsys bought out one Verilog simulator company Chronologic that had Verilog to C output that really speeded things up, it was the fastest simulator out there but was a $50K item. That prompted me at one time to write HDL in a funky C HDL and put that through the C preprocessor. It let me use a nested instances HDL like syntax but it was really pretty awefull. Other people have done it better many times. I later went the other route and just wrote a Verilog to C compiler that accepted a restricted RTL single cycle subset of Verilog and that produced fairly good C code about half as fast as hand writtten C but since the source is Verilog, anyone could use it and no mucking with low level C. I should have done maintenance on it but have neglected it. For the Transputer design, the PEs are really only a few pages of eqns, I fell back on the HDL C with less ambitions, no nesting, that also uses a few macros to put in "assign", "always", "begin" etc into the code which translate into mostly null or { } tokens.. The real goal is to go back to the old V2C compiler and combine the front end with the C compiler that is being developed that merges subsets of event based Verilog with C++. The merged language will take a C++ like class declaration and add signal ports as well as assigns and always to it. So Verilog module declarations become process pname (port declarations) { // std C statements and declarations // limited Verilog signal declarations // assign .. // always ... // some other Verilog stuff. } That will be some time off. I have no idea how much of C++ OO stuff makes any sense beyond improved C. This is to replace the occam langauge the Transputer originally ran. One can figure the simulator event wheel is now inside the Transputer MMU core so the compiler has to map Verilog constructs onto specialized instructions. Ofcourse the behavioural code for the MMU and schedluer running on a PC could as well be called a simulator engine. John Jakson transputer guy
Luke wrote:

> The 16-bit cpu could access up to 256K of RAM and had 16 16-bit > registers.
Could you page/tile the registers into a block ram ?
> I actually am able to fit 12 of these on one spartan 3 1000 > chip, limited only by available block ram. It was nifty, but if I > really needed that much specialized processing power it would make more > sense to build some custom logic in the FPGA to do it.
It is a good idea to have a number of options for any problem. Custom logic is great for non-data intensive HW fabric, but often you have things at the device driver level, where a small core makes better use of the FPGA than rolling the whole thing in logic. A faster small core, means one can do more in the core, before needing to move to logic. What's needed is really "the smallest core an HLL can work on". Some good work is being done around Pico/PacoBlaze. -jg