FPGARelated.com
Forums

Implementing five stage pipeline

Started by vssumesh October 15, 2005
Hi all,
Thinking of a 5 stage pipeline risc.
1. fetch
2. decode
3. execute
4. buffer
5. write back
  The result of execution stage is buffered at the +ve edge of the
buffer cycle. And this works if we enable the data forwarding method.
And the next instruction will get the updated values from the buffer
register at its execution stage. And the buffered data will be placed
to memory ony at the write back stage.
  My doubt is if this is true where will we buffer the output of the
execution stage of the second instruction at the +ve edge of its buffer
cycle as the buffer is still holding the result of the previous
instruction.
  I am a beginer to these type of things. This is similar to the ARM9
pipeline. Whats their way of tackling this situation.

Some body please advice me in this issue as i am still wondering aout
what to do ???

vssumesh wrote:
> Hi all, > Thinking of a 5 stage pipeline risc. > 1. fetch > 2. decode > 3. execute > 4. buffer > 5. write back > The result of execution stage is buffered at the +ve edge of the > buffer cycle. And this works if we enable the data forwarding method. > And the next instruction will get the updated values from the buffer > register at its execution stage. And the buffered data will be placed > to memory ony at the write back stage. > My doubt is if this is true where will we buffer the output of the > execution stage of the second instruction at the +ve edge of its buffer > cycle as the buffer is still holding the result of the previous > instruction. > I am a beginer to these type of things. This is similar to the ARM9 > pipeline. Whats their way of tackling this situation.
Really its a bit strange that a complete beginner would jump into the deep end when in the past we went through some baby steps 1st. Are you doing this in Verilog/VHDL or C or some other academic cpu design tool? You do have the Hennessy-Patterson book right (from MKP)? Not actually in this book but one radical suggestion I have is to throw this horrible design model away and do a multi threaded architecture. You replace one set of cycle sucking performance limits with a much simpler thread engine that ultimately boils down to a 2 or so bit counter in state control. Such a design can run around 2x as fast as the single threaded design at circuit levels, and that 2x can be traded back for many simplifications to get back same speed but with much less hardware. Definitely you don't need register forwarding, or hazard logic detection logic in MTA designs, but you do end up with a couple of threads for the end user to deal with. Many more issues there. John gmail or transputer2 at yahoo
But i just can do that because i am building this to imitate ARM
pipeline.

I'm not familiar with ARM cores, so I can't help you emulate their
architecture.  My advice would be to read datasheets for ARM cores, if
they're available, but they're probably not.  It seems to me that if
they gave away their architecture (the world knew it) and they already
"give away" their instruction set they would have no product anymore.
I suppose their implementation is probably pretty good, but still
architecture is a major part of design.

My question for John though is what do you mean by a threaded
architecture?  I don't see how adding a second core will make the first
one run twice as fast.  It seems to me each "thread" needs to be
independent enough to have its own pipeline.  If each has its own
pipeline, register forwarding and hazard logic are going to be needed.
If you didn't have those I suppose you could just bubble the pipeline,
but that seems pretty wasteful.

The big problem I see with multi-core FPGA based processors is that
it's very easy to be memory bound in an FPGA.  Fetch from an SDRAM is
only so fast.  I know you can put several of them in parallel to
improve performance and I suppose that would do it, but the limits are
definitely close without some good caching schemes.  Unfortunately, it
seems that associative caches are very expensive to implement in an
FPGA.

-Arlen

The real problem here is that the H & P comp arch bible and most books
that repeat the same material don't teach anyone today how to do
anything that doesn't look like a DLX so tough if you have to figure it
out yourself. This is esp true for MTA design, a much overlooked
technique.

Now MTA (multi threaded architecture) isn't even new, it goes back to
the 50s in previous century (the one that starts with 1). The idea is
really simple, very familiar to DSP people who do a lot of
transposition between parallel & serial DSP, bit wise v word wise.

I will elaborate a simple design that works well for me at 300MHz in
V2Pro and still uses only < 500 FFs LUT sites (not the 1000 typically
needed for 32b work) but not complete in some opcode decodes.

I use a single Blockram to hold 4 sets of state, 128 by 32 bit words
each. That is further split  for each thread, half for register file
and half for ICache. The RFile  therefore gives 64 regs, and the ICache
or queue is 128 16 bit opcodes (or 64 32b opcodes or 256 8b opcodes)..
Please don't ever do 8b opcodes!

The primary controller is a 3 bit counter counting through 8 states, b0
is used for odd,even for each instruction slot. B1,2 used to
distuinguish which thread is in effect.

The odd,even bit lets me do 32bit math over 2 clocks 16b at a time. It
also lets me get 2 operand reads and a later write back paired with an
early opcode fetch in 2 cycles so its 4 way ported. These reads,
writes, Ifetch are for 3 different threads though. All Blockram
accesses are 32b wide. The datapath takes 32b every other cycle for x,y
inputs and 5 clocks later returns z 32b result on opposite phase to
Blockram. At same phase, another opcode pair is fetched.

The big bang is that the design clocks at the limit of the Blockram, or
16b add or 3 LUTs of logic which is about 2x faster than the usual 32b
flat single thread pipelines. The usual instruction decision logic that
is often crammed into 1 pipeline, now straddles 8 pipelines so very
little logic needed between pipes.

Now thread i+0 reads data operands in clock t0 but writes results back
at t5 and later in next opcode for that thread reads operands in t8,
same for t16 & so on.

Thread i+0 uses t0,t5,  t8,t13 etc   for reg reads & writes
Thread i+1 uses t1,t6,   t9,t14 etc.
Thread i+2 uses t2,t7,  t10,t15 etc
Thread i+3 uses t3,t8,  t11,t16 etc

So all threads stay out of each others way, no interlocks, no
forwarding, no hazards, no branch prediction, but 4 thread states.

I missed out alot of detail, hey you have to figure this out on own
nickel if you want this sort of design. The cond codes, PC and other
cpu state regs will exist 4 times, these can use Srl16s, a DP ram or a
barrel wheel of 4 states moving on mostly 1 phase.

The ARM is a problem period, you tend to get chased or sued if you get
anything done esp if any intent to give away or resell. I don't think
it is that great anyway, copying any cpu designed for VLSI into FPGA
leaves bad taste. Instead use own opcode set and look at Jan Gray's
site for Lcc hints to port compiler etc.

As for associative caches, doing things the regular way with 1 or 2 way
set assoc is very expensive, instead I use hashing and that makes
things look very associative. I also expect to use RLDRAM but thats
another story. One nice thing about 4 way and esp 2 phase design is
that every opcode takes 8* 1 or sometimes 2,3 actual cycles. RLDRAM can
clock at 300MHz and has latency of 8 cycles per threaded bank. So my
DRAM is faster than my min opcode sequence for load,store so I don't
need DCache. The ICache is there to help the much more predictable I
flow but isn't really associative since its just a queue of opcodes
near PC value. All 4 threads over many processor copies  see their own
private DRAM shared in 1 device.

As for multi core, this design is intended to be replicated a few times
to combine with 1 MMU dispensing RLDRAM bandwidth amongst 4N threads.
Since there is no memory wall, each thread compares with a scaler x86
at 2GHz/8 /4 so 8 PEs comes out about same. Deal with 4N threads and no
cache misses or deal with broken serial model that dare not miss any
cache.

The SDRAM is not actually too slow, it is only 2-3 x slower than RLDRAM
as latency goes, the problem is it has no concurrency so only 1 bank in
flight v 8 so RLDRAM gets 20x more work done. Threaded DRAM goes with
threaded processor.

Think I said enough for now.

John

I understand the basic idea.  I can see how it solves a lot of problems
because the time between cycles for an individual thread is long enough
that you don't have to deal with forwarding or hazards or branch
prediction or anything like that.  Each thread is something of a
multicycle architecture.

Unfortunately it seems that a multi-threaded architecture definitely
needs a new programming paradigm.  I don't think your standard C
program would map well onto that.  (If you were running 4 C programs,
however, I could see it working quite well).  But I suppose that is a
different sort of problem to face.

Thanks for the info.  I may very well look into an architecture like
this at some point.

-Arlen

gallen wrote:
> I understand the basic idea. I can see how it solves a lot of problems > because the time between cycles for an individual thread is long enough > that you don't have to deal with forwarding or hazards or branch > prediction or anything like that. Each thread is something of a > multicycle architecture. >
Exactly, we do this all the time in DSP to break dependancies.
> Unfortunately it seems that a multi-threaded architecture definitely > needs a new programming paradigm. I don't think your standard C > program would map well onto that. (If you were running 4 C programs, > however, I could see it working quite well). But I suppose that is a > different sort of problem to face. >
Typically if a processor is already running some sort of OS with time sharing of processes, then having to deal with 4 HW threads is not a big deal except that the threads run at 1/t of clock. But if many of these PEs are available in each MMU cluster then that is 4N threads. It gets much more interesting when the MMU introduces its own OS memory management issues and the language of choice looks like a hybrid of C/C++ with occam and Verilog. C gives us structs with data members and usually manipulated by any old functions, no special logical structure at all. C/C++ gives us classes to add member functions to member data for object oriented programming but no concurrency or liveness. V++ (in development) give us a process which looks just like a class with added port list and body code that can instance other process objects ala Verilog. // monospace process pname1 ( in .., out ..., // just like Verilog port list, event driven ints etc ) { // data ports not event driven data members; // just like C vars in struct function members; // just like C++ class methods wires ...; // just like Verilog process body code; // just like Verilog module body l1: pname2(.. ); // just like Verilog instance of another process/module l2: pname3(.. ); // labels are used to name instances in the hierarchy assign ...; // just like Verilog continuous assigns always { ;;; } // just like Verilog event driven parallel logic } // usually endmodule Now a process hierarchy combines C++ class OO structure with an event driven HDL like structure with some help from processor to support many threads or processes etc. 1) Data, 2) Objects, 3) Processes.
> Thanks for the info. I may very well look into an architecture like > this at some point. > > -Arlen
regards John transputer guy
John,

Thank you very much for the insightful description.  Is there any
chance that you could post some HDL to OpenCores?  I am certain that
others are as interested as myself in playing with a simple MTA.

Stephen

Stephen

That will depend on future events.

I would like to complete this compiler and finish the remaining
opcodes, and MMU first, its a unified compiler + processor project as
was the original Transputer. One project makes no sense without the
other.

I would prefer to make something commercial out of it with some free
use for .edu. I am not too worried about time to market since there is
lots of work to do and nobody else seems to be interested in doing
this. Most seem happy to reinvent the same dead end ST designs and ST
languages over and over.

If I do put it out in the open, it could be on opencores or whatever,
with BSD/MIT license, but it would better for me to exploit this if I
can too.

Updates here, c.a, c.s.t etc osnews, one day a web home too.

John

transputer guy