comp.arch.fpga | Spartan 3 - avaliable in small quantities?| page 2

Reply by john jakson ●February 22, 20042004-02-22

Joshua replied:

> > johnjakson_usa_com
> 
> John,
> 
> I'd check your report files closely if I were you. If you are seeing
> 311MHZ on a Spartan 3 something is very wrong. I suspect that your
> synthesizer discarded most of your design. My experience sith Spartan
> XC3S400-4s is that they are much slower than Virtex2Ps (-5 is the V2P that
> I'm comparing it to). I'm able to get the Spartan 3s to meet 140MHz timing
> but that is with very few logic levels between pipeline stages. I'm sure
> that with lots of floorplanning it would be possible to push it higher
> than that but certainly not to 300MHz, especially not on something as
> complex as a CPU.

Hi Joshua, Rick

Hopefully 4th time lucky, my girls are helping me way too much. With
google I don't know what happened for several hours, I am sure a
couple of half posts are infront. Apologies. Long replay warning.

I know what you are saying. My 1st paper cpu arch when presented to
XST gives me little clue where to start. I always used to work on
ASICs in teams where I write Verilog & C models and someone else (far
less speed/area motivated) bangs the FPGA tool. With Virtex800 exp
only at <30MHz I never had that great an expectation to start with, I
always had way too much logic in each pipeline but we only needed
30MHz. There was no time to explore speediac style and reduce logic as
it was ASIC prototyping.

Ray Andraka's work on super pipeling everything DSP left me wondering
if a cpu could also go as fast. Usually not so because there are way
too many random blocks of logic covering many adjacent pipelines. This
is why MicroBlaze is stuck in the 120MHz zone, I could probably guess
(reverse engineer) the code used for the datapath if I really studied
the ISA.

But the Alpha chip and ofcourse now the x86s are also deeply
superpipelined but more complex than can fit in any FPGA (or maybe
not). Now I am free to explore the boundaries and see what can be done
on a clean sheet at max freq.

I am also following very late after Philip and Jans work on FPGA cpus
from the 4000 days but even Jan got 30MHz on 4000s along time ago.
Since I am coming from cpu & DSP background, I wanted Alpha speed but
on a better architecture for par programming ie a modern Transputer.

I built a no of test projects that only included 1 instance of a real
pipelined blockram, or adders of varying widths, and so on. I also
play through the device type list and try sp2s through to v2pro with
varying speed grades and even different packages since the reports
only take 20s for such simple models. The last speed file posted by
Austin made a huge difference bringing sp3 close enough to v2pro that
the differences is marginal, only -8 pulled ahead another 5%. The sp2s
remain at the lower end of 100-200MHz which is what I expect for these
simple pipes.

I always study the report and generate the layout. Everything looks
kosher but the layout always looks haphazard. So I learn to use the
floorplanner and write C code to make the .ucf file for FF placement.
On occasion a stupid typo would whip up the speed to 700Mhz or
something, and voila most of the top level would be missing but then
the report usually says as much in bright red or yellow. I only allow
a few yellow marks for known issues beyond my control like the unused
parity bits of blockram instance. Any more than that requires
immediate fixing.

Now that I have my expectations set right I know that a Blockram can
cycle at around 320MHz on various sp3 -5 devices. Infact the ds99.pdf
IIRC says as much. A 32b plain adder is 250MHz, that needs pipelining
work to get to 300MHz plus. I ended up with a 12,10,10.msb width
3stage 32b add. I really wanted to do a faster 2 stage carry select
design but XST always seem to hack it into something less. Trivial
things like generating CVNZ flags become trouble at that speed, I end
up piping that as well since you can only do 3 LUT layers of logic or
a 12b registered add or 12b logic fn() or a BRam cycle and ZERO
combinations of these.

This is only possible because the cpu design is 4 way hyperthreaded
with 1 nice hazard path, so that all the datapath pipes are as
decoupled as they would be in any DSP engine. Only the instruction
decode has some local coupling but again it has no wide adds or big
rams so its looking doable and it is also Nway threaded. I have more
work to do but I never add more logic in series with my critical
blocks. If I get to 4 LUT/mux levels I immediately drop out of warp
speed back to 250MHz or even way less and that makes the other stuff
that is fully pipelined redundant. Any time my speed drops below
311MHz, I know I just added a 4th LUT level, track it down and redo it
till its 3 or less. This usually requires working on that module in
isolation, keeping its speed as much as possible over my target.
Further I can not allow any module to have unregistered IOs however
painful that is with out tracking that at a global level. The 3 levels
of LUT logic is almost always in one place inside a module between 2
pipes. The Verilog code is a mix of structural & RTL style, assigns
for wiring and always @ for the FFing.

This is really the same deal with the fastest VLSI cpus that are
limited to 10 levels of low fanout gate level logic. Seymour was doing
this in ECL 40yrs ago. A LUT counts as 3 levels of gate logic so close
enough 10gates.

I will report on the work as it gets closer to live results. I know I
can download to an sp2e dev board for about 200MHz or way less but by
the time the cpu C & Verilog models can run code and I have the lcc
compiler done, gee I might have a sp3 -5 dev board to play with. The
intended market is licensing to high end users for embedded & par
computing. I am even tempted to max the datapath to 64b as it only
adds 3-4 pipestages and not much to the control.

The LUT count is still below 500. and is mostly going to control, a
64b Alpha path would balance it more to computing, but thats another
story. My only concern is how much power 1 cpu <800 LUTs or FFs will
dump. I use 2BRams per cpu instance, so I am just about to lose having
2 in an sp 50. The bigger sp's though are more on the LUT side.

Regards all

johnjakson_usa_com

Reply by Jim Granville ●February 23, 20042004-02-23

john jakson wrote:
<interesting stuff snipped>
> 
> If I get to 4 LUT/mux levels I immediately drop out of warp
> speed back to 250MHz or even way less and that makes the other stuff
> that is fully pipelined redundant. Any time my speed drops below
> 311MHz, I know I just added a 4th LUT level, track it down and redo it
> till its 3 or less. This usually requires working on that module in
> isolation, keeping its speed as much as possible over my target.
> Further I can not allow any module to have unregistered IOs however
> painful that is with out tracking that at a global level. The 3 levels
> of LUT logic is almost always in one place inside a module between 2
> pipes. The Verilog code is a mix of structural & RTL style, assigns
> for wiring and always @ for the FFing.
> 
> This is really the same deal with the fastest VLSI cpus that are
> limited to 10 levels of low fanout gate level logic. Seymour was doing
> this in ECL 40yrs ago. A LUT counts as 3 levels of gate logic so close
> enough 10gates.
> 
> I will report on the work as it gets closer to live results. 

  Sounds to me like something you could negotiate
a job at Xilinx doing :)

  Their marketing dept would just LOVE to boast about 300+ MHz
CPU cores, even if that is 'very peaky'. (after all, so are the
alternatives)

  Key question is what code size is this working from ?

-jg

Reply by Jon Beniston ●February 23, 20042004-02-23

> I am even tempted to max the datapath to 64b as it only
> adds 3-4 pipestages and not much to the control.

Sure, but the more pipeline stages you add, the longer the latency is
for each instruction. How many cycles latency will there be for a
single add instruction? Do you intend to make sure that the number of
threads is equal to this latency, so that the latency as perceived the
thread executing the instruction is 0?

What's your cache / memory architecture? Handling lots of threads
could be tricky.

Cheers,
JonB

Reply by john jakson ●February 23, 20042004-02-23

> > This is really the same deal with the fastest VLSI cpus that are
> > limited to 10 levels of low fanout gate level logic. Seymour was doing
> > this in ECL 40yrs ago. A LUT counts as 3 levels of gate logic so close
> > enough 10gates.
> > 
> > I will report on the work as it gets closer to live results. 
> 
>   Sounds to me like something you could negotiate
> a job at Xilinx doing :)
> 
>   Their marketing dept would just LOVE to boast about 300+ MHz
> CPU cores, even if that is 'very peaky'. (after all, so are the
> alternatives)
> 
>   Key question is what code size is this working from ?
> 
> -jg

I am sure anyone would love to get a cpu at 300MHz in FPGA but the
arch will be on my terms. The code base is remarkably small v previous
projects I have worked on, the Verilog is <4K IIRC sofar. It will get
bigger for control logic. 1st pass will defer some opcode complexity
to xops as TI9900 once called them ie low overhead low address
subroutines. That will reduce performance of OS message passing
scheduling specific code by 4x or so but its easier to write asm than
design HW. Later FPGA space permitting most of that will get hardened.

Note there is almost no HW needed for hazard detection, no bra
prediction, no pipeline flushing. Just like a DSP really. Other than
that the cpu looks more like 4 78MHz classic Ld St Risc cpus
timesharing the HW (and the cache unfortunately). Actually the
performance on paper should compare well to x86 at 1.5 the clk ie
500MHz x86 but the cache size is a real limit here. I still have to
design cache & TLB HW. Associative HW really costs.

Note that all ccbranches take 0 cycles as they group onto non bra
opcodes. So it may well run smaller loops at effective 400MHz if every
4th op is a bra. Its also a joy to count cycles based on bandwidth the
opcode actually uses, so ccbra really uses 1/8 or 1/4 cycles but from
slack time. And add a<-b+c would actually use 9/8 since the opcode
fetch is another 1/8 from slack time. But a cmp a,b would be 5/8 since
the unused write back gives back 4/8 cycles. Ofcourse the ops really
run in integer cycles, but there are queues to be filled and that uses
slack cache memory ports. The actual non Transputer ISA is actually
quite soft, I can mess with opcode encodings at will since as we all
know, cpus only do movs these days (yeh right).

The arch should port to any FPGA that supports true dual port
2WW/2RR/2RW BlockRams, not really using any other special features.
SRL16 nice if available. This also means it can be ASICed where I
would expect it to run atleast 3x faster as long as the libs include
prebuilt DP Ram as that is always the 1st limit. Other adder width
limits can be worked around. Some time I will get around to trying
free Quartus but I wish they would drop the IP node nonsense.

I am really pushing to get the Transputer arch back in front since
that allows many cpus to work in harmony with the message passing
scheme. It worked well before but Inmos folded up for bad engineering
& business reasons not because the basic premise was unsound. At one
time before 486 came along it was the dominant 32b arch esp in Europe
and very popular with HW embedded & extreme computing types. Occam was
a killer though, most SW types didn't get it although in hindsight I
see it now as a Forthy/Lispy HDL language.

I address that issue by suggesting it be programmed in V++ a language
which just combines std C with Verilog event driven language. It also
includes the Occam primitives Par, Alt, Seq, and !? operators to round
out but using C syntax. HandelC does the same thing and is a SW & HW
language too. I am about half way done on that using lcc as the base
technology. Have to get back to tree generation and code emit. Std lcc
can't just be hacked the way Jan did on XSOC because of the need to
include the Par support. The runtime is really a tiny OS with
scheduler, basic memory management in SW etc. The compiler is actually
90% of the project effort and the HW is almost the easy part,
certainly the fun part. I would like to transfer the compiler workload
pronto but few compiler writers know about Verilog internals etc.

The big kicker here as I keep saying is that end user code can be
written in any of the 3 styles, maybe start in C and rewrite parts in
Occam style message passing to get more parallelism. For real speed
ups rerwite in HDL style and voila the SW can be synthed with any free
FPGA synth tool into something like a coprocessor. Fits in very well
the good article the Altera guy linked here a few days ago on another
thread.

Better get back to work

johnjakson_usa_com

508-4800777

Reply by john jakson ●February 23, 20042004-02-23

jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0402230332.3e1160e@posting.google.com>...
> > I am even tempted to max the datapath to 64b as it only
> > adds 3-4 pipestages and not much to the control.
> 
> Sure, but the more pipeline stages you add, the longer the latency is
> for each instruction. How many cycles latency will there be for a
> single add instruction? Do you intend to make sure that the number of
> threads is equal to this latency, so that the latency as perceived the
> thread executing the instruction is 0?
> 
> What's your cache / memory architecture? Handling lots of threads
> could be tricky.
> 
> Cheers,
> JonB

I just posted a very long reply but the server just xxxxed it so I
will write it again later offline.

Quick answer yes, HT must match 4 or 8 etc. Cache architecture is
currently 1 way set associative, but more Blockrams would allow more
ways. Question of whether the FPGA should hold lots of lite cpus or 1
monster cpu or maybe combinations of both!

Regards

johnjakson_usa_com

508 4800777 EST after 8pm

Reply by Ray Andraka ●February 23, 20042004-02-23

John,

In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for
it.  I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz.  It was a
pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep
register file like the 1802, and it was a harvard architecture like the PIC.   Like the 1802, the operands
for the ALU were fetched from the register file and results returned to the register file.  The beauty of it
was that for control applications, you often did not even need any memory beyond the register file.  The
processor size was about 80 CLBs (translates to 80 slices in current architectures).   I'm not a compiler
person, so the big difficulty I had with it was the compiler.

I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
code for it, although folks who are more saavy than I on the software side might argue that the high speed
hardware design is the hard part.

john jakson wrote:

> jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0402230332.3e1160e@posting.google.com>...
> > > I am even tempted to max the datapath to 64b as it only
> > > adds 3-4 pipestages and not much to the control.
> >
> > Sure, but the more pipeline stages you add, the longer the latency is
> > for each instruction. How many cycles latency will there be for a
> > single add instruction? Do you intend to make sure that the number of
> > threads is equal to this latency, so that the latency as perceived the
> > thread executing the instruction is 0?
> >
> > What's your cache / memory architecture? Handling lots of threads
> > could be tricky.
> >
> > Cheers,
> > JonB
>
> I just posted a very long reply but the server just xxxxed it so I
> will write it again later offline.
>
> Quick answer yes, HT must match 4 or 8 etc. Cache architecture is
> currently 1 way set associative, but more Blockrams would allow more
> ways. Question of whether the FPGA should hold lots of lite cpus or 1
> monster cpu or maybe combinations of both!
>
> Regards
>
> johnjakson_usa_com
>
> 508 4800777 EST after 8pm

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Reply by Jim Granville ●February 23, 20042004-02-23

Ray Andraka wrote:
> John,
> 
> In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for
> it.  I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz.  It was a
> pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep
> register file like the 1802, and it was a harvard architecture like the PIC.   Like the 1802, the operands
> for the ALU were fetched from the register file and results returned to the register file.  The beauty of it
> was that for control applications, you often did not even need any memory beyond the register file.  The
> processor size was about 80 CLBs (translates to 80 slices in current architectures).   I'm not a compiler
> person, so the big difficulty I had with it was the compiler.
> 
> I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
> code for it, although folks who are more saavy than I on the software side might argue that the high speed
> hardware design is the hard part.

  This is right, and John admits this in another reply.
You should also add DEBUG support, as that's more important as the CPU 
targets bigger applications.
  Once you have a compiler, users will want to do more and more, and 
then debug becomes very important.

  It depends a lot on the target use.
Something that runs from a Block RAM inside the FPGA, can be very 
small/very fast, but is probably best coded in some form of Assembler.
  Best example of 'Advanced Assembler Art' is Randy Hyde's HLA (High 
level Assembler) but that currently targets only x86
- tho I'm sure that's not hard to fix :)
  This HLA allows IF..THEN..ELSIF etc, and handles the labels needed,
as well as giving local scope (so is a big step-up from vanilla ASM).

-jg

Reply by john jakson ●February 23, 20042004-02-23

Ray Andraka <ray@andraka.com> wrote in message news:<403A43E8.6338D0C1@andraka.com>...
> John,
> 
> In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for
> it.  I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz.  It was a
> pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep
> register file like the 1802, and it was a harvard architecture like the PIC.   Like the 1802, the operands
> for the ALU were fetched from the register file and results returned to the register file.  The beauty of it
> was that for control applications, you often did not even need any memory beyond the register file.  The
> processor size was about 80 CLBs (translates to 80 slices in current architectures).   I'm not a compiler
> person, so the big difficulty I had with it was the compiler.
> 
> I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
> code for it, although folks who are more saavy than I on the software side might argue that the high speed
> hardware design is the hard part.
> 
> 

Hi Ray

Half agreed, as Jan has shown any std risc cpu project can grab lcc to
do the task quite quickly by messing with the emit tables. If this
were just another std risc project I'd probably do same, but then it
wouldn't be anywhere near 300MHz either, more like MicroBlaze.

Only hyperthreading allows max speed, but if the processes don't
communicate with each other then lcc could still be used as is and
ignore the HT stuff.

Some of my background is in compilers and other tools but I never
worked for anybody doing that. The lcc compiler (Hanson & Fraser) is
possibly the best documented C compiler writing text book around and
highly recomended as it explains thoroughly just how horrible C really
is where most C books gloss over it's complexity. The complexity for
me comes because I am combining essentially 3 langs together and
putting in a mini OS runtime. The Transputer did it before but chose
an unfriendly syntax and supported C only as an afterthought.

I will probably get through it ok but I would love to pass that part
on but then that person would be knee deep in it instead.

The HW part is more fun though. The 1802 takes me back, not bad in a
twisted sort of way, it certainly used very little logic, I had it
under a scope at Inmos.

Regards

johnjakson_usa_com

Reply by Hal Murray ●February 23, 20042004-02-23

>I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
>code for it, although folks who are more saavy than I on the software side might argue that the high speed
>hardware design is the hard part.

How much code are you writing?  Would you be willing/happy to do it in asembler?

Assemblers can be pretty simple, especially if the target is raw binary running
at loaded at 0 rather than something needing linkers and libraries.  Also helps
if the target is RISC and doesn't have messy addressing modes.

How much would a reasonably clean sample assembler help?  There should be
a good example from the academic world.  Just type in the new opcode table.

-- 
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

Reply by Ray Andraka ●February 23, 20042004-02-23

The low complexity is why I chose the architecture I did.  Unfortunately,  I did that design in schematics, before
I started using VHDL, so resurrecting it at this point involves more time than can devote to it.

john jakson wrote:

> The HW part is more fun though. The 1802 takes me back, not bad in a
> twisted sort of way, it certainly used very little logic, I had it
> under a scope at Inmos.
>
> Regards
>
> johnjakson_usa_com

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Previous 123 Next

Spartan 3 - avaliable in small quantities?

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group