Superscalar Out-of-Order Processor on an FPGA

Started by Luke May 9, 2006
I've got a little hobby of mine in developing processors for FPGAs.
I've designed several pipelined processors and a multicycle processor.
Now I want to design a more modern processor.  To me, this means
superscalar, out of order execution, branch prediction, and data and
instruction cache.

I know that this type of processor isn't the best fit for the
technology.  Has anyone come across such a design for an FPGA?  I've
perused the net looking for them and have come up short.

I think a 4-Issue processor should be feasible on a Spartan 3 1000.
I've done an analysis on the major bits of hardware, and it seems quite
possible.

I've posted some links to papers I found on OOO scheduling algorithms
on my blog: http://bitstuff.blogspot.com.  The most space efficient
algorithm I found uses dependency chains and schedules instructions
into multiple FIFOs.

Any ideas or thoughts?

The register file will likely be a bottleneck.  You will likely have to
time-multiplex the ports to get the required 8-read ports and 4-write
ports to support 4-issues / cycle.  Or I suppose you could make a RF
out of flops with as many ports as you need, but this would be huge.

Did have anything specific in mind for the register file?

Stephen

Luke wrote:
> I've got a little hobby of mine in developing processors for FPGAs. > I've designed several pipelined processors and a multicycle processor. > Now I want to design a more modern processor. To me, this means > superscalar, out of order execution, branch prediction, and data and > instruction cache. > > I know that this type of processor isn't the best fit for the > technology. Has anyone come across such a design for an FPGA? I've > perused the net looking for them and have come up short. > > I think a 4-Issue processor should be feasible on a Spartan 3 1000. > I've done an analysis on the major bits of hardware, and it seems quite > possible. > > I've posted some links to papers I found on OOO scheduling algorithms > on my blog: http://bitstuff.blogspot.com. The most space efficient > algorithm I found uses dependency chains and schedules instructions > into multiple FIFOs. >
I don't consider OoO and the rest of it as real progress anymore as least when you get to multiple issue that isn't sustained and esp not in an FPGA. They only really help get us get lots of theoretical performance and mostly hitting the Memory Wall when you get to GHz of clock. SInce you won't get that clock rate, all the assumtions are different too. On the other hand if its a prototype for an ASIC or full custom you never intend or afford to fab, then performance doesn't really matter so much. If you realy want some performance out of an FPGA I'd stick to multithreading and latency hiding like say the Niagara and you can make such cores quite a bit smaller and put a few in too. Then what to do with all those threads? I suspect if you could fully implement all the usual stuff that goes into a x86 OoO, Branch Prediction, Register Renaming, multilevel caches but with a much simpler instruction set say Mips or DLX, I bet the clock rate would be no better than 25MHz. By going the opposite route and keeping the architecture as simple as possible, one can hit the max freq of BlockRam cycles in multi cycle design and probably get the same performance with almost no hardware. Take a look at the Leon, it seems to be pretty well regarded. Also Sun did release the gate level Verilog for their Niagara, you might be able to make some sense of it but they aren't OoO. The more interesting part is in the MMU, do you go the same route and implement the usual caches with regular DRAM or look at a more interesting DRAM like RLDRAM that has about 10-20x more real random throughput which changes the whole memory cache scene alltogether. I assume you have your copy of Computer Architecture: A Quantitative Approach by H & P around to guide you, pretty good in most respects.
> Any ideas or thoughts?
I am actually kind of surprised that no one has taken up the MMIX instruction set by D Knuth, it is well documented, should be completely free of legal issues, the design took in advice from many noted architects and isn't too far from the Mips machine. It would probably bring some notoriety to any implementors, esp if MMIX is actually tought in schools. John Jakson transputer guy
I'm pretty sure I'll be time-multiplexing the register file.  I at
least have to multiplex the write port.  Depending on timing and space
constraints, I may multiplex the read ports, clone the register file,
or some of both.

This shouldn't be _too_ much of a problem.  The Distributed RAM in the
Xilinx FPGAs is pretty fast, and the register file shouldn't be on a
critical path.  The register file write data will come straight from
the Re-Order Buffer with little or no logic in-between.  The register
file read data however will likely have more logic between it and a
flop.  If multiplexing the read ports puts it on the critical path,
then I can always clone the register file for each read port.  It's
most likely that the bypass logic will be on the critical path, so this
shouldn't be an issue. (on second thought: I may need to register the
output of the register file as the output will essentially be clocked
4x the system clock...)

I think that's the only data structure that I will need to
time-multiplex.  I've been analyzing the various data structures for
instruction scheduling and it looks like I will be able to neatly
partition their write ports into segments based on instruction issue
slots.  This will take up marginally more logic than regular
Distributed RAM (just a couple LUTs if it's placed well and uses built
in MUXes), and will be about the same speed.

On the other hand, the write-back bus will need to properly route
results to the correct partitions, potentially limiting throughput.

I actually did build a CPU for pure MHz speed.  It was a super fast
dual-issue cpu, but in order to get the high-clock rates, I had to make
some serious trade-offs.

Number one tradeoff, the execution stage is two stages and there is at
least one delay slot after every instruction before the result can be
used.  This version runs at 150MHz.  I have another version with not
much bypass hardware that makes it up to 180MHz.  But with three delay
slots and only 8 registers per issue slot, scheduling becomes a major
issue.

Number two: 16-bit architecture.  Addition actually takes a long time
using ripple-carry in an FPGA, and there's reall no way around it.
16-bit is pretty easy to add, so that's what it gets.  It's also 16-bit
to cut down on utilization.

Number three: Some arithmetic instructions are split in two.  For
example, shift instructions and 32-bit addition is split into two
instructions.  I simply could not afford the logic and delay of doing
these with one instruction.

Number four: 16 bit addressing.  Same deal with addition, it takes too
long, and i don't want to extend the delay slots any further, so I have
16 bit addressing only.  Also, instruction sizes were 16 bits to cut
down on logic and keep things "simple".

So besides being a total pain in the butt to schedule and program, it
really is rocket-fast.  It is at it's very worst 2 times faster than a
32-bit pipleined processor i designed, and at it's best, it is 10 times
faster.  With decent scheduling and a superb compiler or hand coding,
it should be able to sustain 5-8 times faster.

The other advantage is that I could put 12 of them on a spartan 3 1000.
 Theoretically, I could get the performance of a really crappy modern
computer with these things.

And now I come back to reality.  It's such a specialized system, and
the memory architecture, ISA and all around whole system is a mess.
Yes, it's super fast, but so what?  I would be so much better off just
designing custom pipelined logic to do something rather than this gimp
of a cpu.

So that's why I'm designing a "modern" processor.  It's a general
purpose CPU that could run real software such as Linux.  It's that much
more useful ;)

Also, anyone can create a fast, simple processor.  What's important is
proper balance.  I do agree that OOO and all that is not suitable for
an FPGA.  But it sure is fun!

I did some of the same things as your 16b design but with some
differences.

I used 2 clock per microcycle at 300Mhz or so on V2Pro and <<200MHz on
SP3 so that the BlockRam or a 16bit add were the critical paths or 3
levels of Lut logic in the control.

The 32b ISA therefore gets 4 ports out of a BlockRam with a large
register set upto 64 regs but encodes the ops in variable length 16b
opcodes using prefixes to stretch address widths 3b at a time for 3 Reg
fields or to set up a literal in 8b chunks. I was always partial to the
old 9900 and 68000.

The pipeline is quite long since it runs 4 way threads and therefore
has no hazard or forwarding logic. The MTA logic allows the instruction
decode to occur over many pipelines since only every 4th pair is per
thread. The BlockRam also holds the instruction queue reminiscent of
the 8086 per thread and the IFetch unit also has another tiny queue to
reassemble 16b words into 1..4 x 16b final opcode.

Since the PEs only run RRR codes in 2 clocks and average Bcc or Ld,St
in 4 or sometimes 6 clocks, the branch latency covers the cc decode so
no prediction needed. Bcc inside the queue are also 2 clocks. The Ld,
St codes go out to a MMU that hashes 32b address with a 32b object id
into RLDRAM on 8 word line blocks. The idea there is that the RLDRAM
interleaved throughput can be shared over 10 or so of these PEs which
leaves me with 40 odd threads. The RLDRAM 20ns latency is easily
covered by the MTA pipeline so Ld,St have a slight Memory Wall effect
over effectively the whole DRAM address space so no SRAM Dcache needed.

So instead of a Memory Wall, I have a Thread Wall, use em or lose em
~40*~25mips. The MMU also implements the remainder of the Transputer
process scheduler and DMA, message passing and links so thats where
most of the real fun is, and much remains to be done.

Ofcourse it only really makes sense if you want to run lots of
processes as any Transputer user would. The basic version was done
25yrs ago so I guess its not a modern processor yet.

Post your results when you have some, you might be the 1st to tackle
OoO SS I haven't heard of anything else.

John Jakson
transputer guy

That's pretty slick!  I like it.

I've got a blog that I write random ideas and notes about my projects
online: http://bitstuff.blogspot.com

You'll either find it incredibly dull, or it'll pique your interest.
Most people I know could care less about this stuff, so I appreciate
the dialouge and comments.

"Luke" <lvalenty@gmail.com> writes:
> Number one tradeoff, the execution stage is two stages and there is at > least one delay slot after every instruction before the result can be > used. This version runs at 150MHz. I have another version with not > much bypass hardware that makes it up to 180MHz.
I must be confused. Since your two-stage 150 MHz processor needed a delay slot for all access to results, it must not have used any bypassing. So are you telling us that adding some ("not much") bypass hardware sped it up by 20%? That seems counterintuitive.
> Number two: 16-bit architecture. Addition actually takes a long time > using ripple-carry in an FPGA, and there's reall no way around it.
Actually, I've implemented wide carry-select adders using carry lookahead in the Spartan 3. So there is a "way around it", but it probably won't help for a 16-bit adder. Eric
I must not have been very clear.  The 180MHz version had no bypassing
logic whatsoever.  It had three delay slots.  The 150MHz version did
have bypassing logic, it had one delay slot.

I read up on carry lookahead for the spartan 3, and you're correct, it
wouldn't help for 16-bits.  In fact, it's slower than just using the
dedicated carry logic.

Luke wrote:
> I must not have been very clear. The 180MHz version had no bypassing > logic whatsoever. It had three delay slots. The 150MHz version did > have bypassing logic, it had one delay slot. > > I read up on carry lookahead for the spartan 3, and you're correct, it > wouldn't help for 16-bits. In fact, it's slower than just using the > dedicated carry logic.
I also used a 32b CSA design in the design prior to one described above. It worked but was pretty expensive, IIRC it gave me 32b adds in same cycle time as 16b ripple add but it used up 7 of 8b ripple add blocks and needed 2 extra pipe stages to put csa select results back together and combined that with the CC status logic. The 4 way MTA though just about took it without too much trouble but it also needed hazard and forwarding logic. Funny thing I saw was the doubling of adders added more C load to the pipeline FFs and those had to be duplicated as well and so the final cost of that 32 datapath was probably 2 or 3 x bigger than a plain ripple datapath and much harder to floorplan the P/R. What really killed that design was an interlock mechanism designed to prevent the I Fetch and I Exec blocks from ever running code from same thread at same time, that 1 little path turned out to be 3 or 4 x longer than the 32b add time and no amount of redesign could make it go away, all that trouble for nout. The lesson learned was that complex architecture with any interlocks usually gets hammered on these paths that don't show up till the major blocks are done. The final state of that design was around 65MHz when I thought I would hit 300MHz on the datapath, and the total logic was about 3x the current design. Not much was wasted though, much of the conceptual design got rescued in simpler form. In an ASIC this is much less of a problem since transistor logic is relatively much faster per clock freq than FPGAs, it would have been more like 20 gates and ofcourse the major cpu designers can throw bodies at such problems. I wonder how you will get the performance you want without finding an achilles heal till the later part is done. You have to finish the overall logic design before commiting to design specific blocks and it ends up taking multiple iterations. When I said 25MHz in the 1st post I meant that to reflect these sorts of critical paths that can't be forseen till your done rather than the datapaths. Thats why I went to the extreme MTA solution to abolish or severely limit almost all variables, make it look like a DSP engine and you can't fail. Curiously how do you prototype the architecture, in cycle C, or go straight to HDL simulation? Anyway have fun John Jakson transputer guy the details are at wotug.org if interested