FPGARelated.com
Forums

Spartan 3 - avaliable in small quantities?

Started by Thomas Heller February 20, 2004
Joshua replied:

> > johnjakson_usa_com > > John, > > I'd check your report files closely if I were you. If you are seeing > 311MHZ on a Spartan 3 something is very wrong. I suspect that your > synthesizer discarded most of your design. My experience sith Spartan > XC3S400-4s is that they are much slower than Virtex2Ps (-5 is the V2P that > I'm comparing it to). I'm able to get the Spartan 3s to meet 140MHz timing > but that is with very few logic levels between pipeline stages. I'm sure > that with lots of floorplanning it would be possible to push it higher > than that but certainly not to 300MHz, especially not on something as > complex as a CPU.
Hi Joshua, Rick Hopefully 4th time lucky, my girls are helping me way too much. With google I don't know what happened for several hours, I am sure a couple of half posts are infront. Apologies. Long replay warning. I know what you are saying. My 1st paper cpu arch when presented to XST gives me little clue where to start. I always used to work on ASICs in teams where I write Verilog & C models and someone else (far less speed/area motivated) bangs the FPGA tool. With Virtex800 exp only at <30MHz I never had that great an expectation to start with, I always had way too much logic in each pipeline but we only needed 30MHz. There was no time to explore speediac style and reduce logic as it was ASIC prototyping. Ray Andraka's work on super pipeling everything DSP left me wondering if a cpu could also go as fast. Usually not so because there are way too many random blocks of logic covering many adjacent pipelines. This is why MicroBlaze is stuck in the 120MHz zone, I could probably guess (reverse engineer) the code used for the datapath if I really studied the ISA. But the Alpha chip and ofcourse now the x86s are also deeply superpipelined but more complex than can fit in any FPGA (or maybe not). Now I am free to explore the boundaries and see what can be done on a clean sheet at max freq. I am also following very late after Philip and Jans work on FPGA cpus from the 4000 days but even Jan got 30MHz on 4000s along time ago. Since I am coming from cpu & DSP background, I wanted Alpha speed but on a better architecture for par programming ie a modern Transputer. I built a no of test projects that only included 1 instance of a real pipelined blockram, or adders of varying widths, and so on. I also play through the device type list and try sp2s through to v2pro with varying speed grades and even different packages since the reports only take 20s for such simple models. The last speed file posted by Austin made a huge difference bringing sp3 close enough to v2pro that the differences is marginal, only -8 pulled ahead another 5%. The sp2s remain at the lower end of 100-200MHz which is what I expect for these simple pipes. I always study the report and generate the layout. Everything looks kosher but the layout always looks haphazard. So I learn to use the floorplanner and write C code to make the .ucf file for FF placement. On occasion a stupid typo would whip up the speed to 700Mhz or something, and voila most of the top level would be missing but then the report usually says as much in bright red or yellow. I only allow a few yellow marks for known issues beyond my control like the unused parity bits of blockram instance. Any more than that requires immediate fixing. Now that I have my expectations set right I know that a Blockram can cycle at around 320MHz on various sp3 -5 devices. Infact the ds99.pdf IIRC says as much. A 32b plain adder is 250MHz, that needs pipelining work to get to 300MHz plus. I ended up with a 12,10,10.msb width 3stage 32b add. I really wanted to do a faster 2 stage carry select design but XST always seem to hack it into something less. Trivial things like generating CVNZ flags become trouble at that speed, I end up piping that as well since you can only do 3 LUT layers of logic or a 12b registered add or 12b logic fn() or a BRam cycle and ZERO combinations of these. This is only possible because the cpu design is 4 way hyperthreaded with 1 nice hazard path, so that all the datapath pipes are as decoupled as they would be in any DSP engine. Only the instruction decode has some local coupling but again it has no wide adds or big rams so its looking doable and it is also Nway threaded. I have more work to do but I never add more logic in series with my critical blocks. If I get to 4 LUT/mux levels I immediately drop out of warp speed back to 250MHz or even way less and that makes the other stuff that is fully pipelined redundant. Any time my speed drops below 311MHz, I know I just added a 4th LUT level, track it down and redo it till its 3 or less. This usually requires working on that module in isolation, keeping its speed as much as possible over my target. Further I can not allow any module to have unregistered IOs however painful that is with out tracking that at a global level. The 3 levels of LUT logic is almost always in one place inside a module between 2 pipes. The Verilog code is a mix of structural & RTL style, assigns for wiring and always @ for the FFing. This is really the same deal with the fastest VLSI cpus that are limited to 10 levels of low fanout gate level logic. Seymour was doing this in ECL 40yrs ago. A LUT counts as 3 levels of gate logic so close enough 10gates. I will report on the work as it gets closer to live results. I know I can download to an sp2e dev board for about 200MHz or way less but by the time the cpu C & Verilog models can run code and I have the lcc compiler done, gee I might have a sp3 -5 dev board to play with. The intended market is licensing to high end users for embedded & par computing. I am even tempted to max the datapath to 64b as it only adds 3-4 pipestages and not much to the control. The LUT count is still below 500. and is mostly going to control, a 64b Alpha path would balance it more to computing, but thats another story. My only concern is how much power 1 cpu <800 LUTs or FFs will dump. I use 2BRams per cpu instance, so I am just about to lose having 2 in an sp 50. The bigger sp's though are more on the LUT side. Regards all johnjakson_usa_com
john jakson wrote:
<interesting stuff snipped>
> > If I get to 4 LUT/mux levels I immediately drop out of warp > speed back to 250MHz or even way less and that makes the other stuff > that is fully pipelined redundant. Any time my speed drops below > 311MHz, I know I just added a 4th LUT level, track it down and redo it > till its 3 or less. This usually requires working on that module in > isolation, keeping its speed as much as possible over my target. > Further I can not allow any module to have unregistered IOs however > painful that is with out tracking that at a global level. The 3 levels > of LUT logic is almost always in one place inside a module between 2 > pipes. The Verilog code is a mix of structural & RTL style, assigns > for wiring and always @ for the FFing. > > This is really the same deal with the fastest VLSI cpus that are > limited to 10 levels of low fanout gate level logic. Seymour was doing > this in ECL 40yrs ago. A LUT counts as 3 levels of gate logic so close > enough 10gates. > > I will report on the work as it gets closer to live results.
Sounds to me like something you could negotiate a job at Xilinx doing :) Their marketing dept would just LOVE to boast about 300+ MHz CPU cores, even if that is 'very peaky'. (after all, so are the alternatives) Key question is what code size is this working from ? -jg
> I am even tempted to max the datapath to 64b as it only > adds 3-4 pipestages and not much to the control.
Sure, but the more pipeline stages you add, the longer the latency is for each instruction. How many cycles latency will there be for a single add instruction? Do you intend to make sure that the number of threads is equal to this latency, so that the latency as perceived the thread executing the instruction is 0? What's your cache / memory architecture? Handling lots of threads could be tricky. Cheers, JonB
> > This is really the same deal with the fastest VLSI cpus that are > > limited to 10 levels of low fanout gate level logic. Seymour was doing > > this in ECL 40yrs ago. A LUT counts as 3 levels of gate logic so close > > enough 10gates. > > > > I will report on the work as it gets closer to live results. > > Sounds to me like something you could negotiate > a job at Xilinx doing :) > > Their marketing dept would just LOVE to boast about 300+ MHz > CPU cores, even if that is 'very peaky'. (after all, so are the > alternatives) > > Key question is what code size is this working from ? > > -jg
I am sure anyone would love to get a cpu at 300MHz in FPGA but the arch will be on my terms. The code base is remarkably small v previous projects I have worked on, the Verilog is <4K IIRC sofar. It will get bigger for control logic. 1st pass will defer some opcode complexity to xops as TI9900 once called them ie low overhead low address subroutines. That will reduce performance of OS message passing scheduling specific code by 4x or so but its easier to write asm than design HW. Later FPGA space permitting most of that will get hardened. Note there is almost no HW needed for hazard detection, no bra prediction, no pipeline flushing. Just like a DSP really. Other than that the cpu looks more like 4 78MHz classic Ld St Risc cpus timesharing the HW (and the cache unfortunately). Actually the performance on paper should compare well to x86 at 1.5 the clk ie 500MHz x86 but the cache size is a real limit here. I still have to design cache & TLB HW. Associative HW really costs. Note that all ccbranches take 0 cycles as they group onto non bra opcodes. So it may well run smaller loops at effective 400MHz if every 4th op is a bra. Its also a joy to count cycles based on bandwidth the opcode actually uses, so ccbra really uses 1/8 or 1/4 cycles but from slack time. And add a<-b+c would actually use 9/8 since the opcode fetch is another 1/8 from slack time. But a cmp a,b would be 5/8 since the unused write back gives back 4/8 cycles. Ofcourse the ops really run in integer cycles, but there are queues to be filled and that uses slack cache memory ports. The actual non Transputer ISA is actually quite soft, I can mess with opcode encodings at will since as we all know, cpus only do movs these days (yeh right). The arch should port to any FPGA that supports true dual port 2WW/2RR/2RW BlockRams, not really using any other special features. SRL16 nice if available. This also means it can be ASICed where I would expect it to run atleast 3x faster as long as the libs include prebuilt DP Ram as that is always the 1st limit. Other adder width limits can be worked around. Some time I will get around to trying free Quartus but I wish they would drop the IP node nonsense. I am really pushing to get the Transputer arch back in front since that allows many cpus to work in harmony with the message passing scheme. It worked well before but Inmos folded up for bad engineering & business reasons not because the basic premise was unsound. At one time before 486 came along it was the dominant 32b arch esp in Europe and very popular with HW embedded & extreme computing types. Occam was a killer though, most SW types didn't get it although in hindsight I see it now as a Forthy/Lispy HDL language. I address that issue by suggesting it be programmed in V++ a language which just combines std C with Verilog event driven language. It also includes the Occam primitives Par, Alt, Seq, and !? operators to round out but using C syntax. HandelC does the same thing and is a SW & HW language too. I am about half way done on that using lcc as the base technology. Have to get back to tree generation and code emit. Std lcc can't just be hacked the way Jan did on XSOC because of the need to include the Par support. The runtime is really a tiny OS with scheduler, basic memory management in SW etc. The compiler is actually 90% of the project effort and the HW is almost the easy part, certainly the fun part. I would like to transfer the compiler workload pronto but few compiler writers know about Verilog internals etc. The big kicker here as I keep saying is that end user code can be written in any of the 3 styles, maybe start in C and rewrite parts in Occam style message passing to get more parallelism. For real speed ups rerwite in HDL style and voila the SW can be synthed with any free FPGA synth tool into something like a coprocessor. Fits in very well the good article the Altera guy linked here a few days ago on another thread. Better get back to work johnjakson_usa_com 508-4800777
jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0402230332.3e1160e@posting.google.com>...
> > I am even tempted to max the datapath to 64b as it only > > adds 3-4 pipestages and not much to the control. > > Sure, but the more pipeline stages you add, the longer the latency is > for each instruction. How many cycles latency will there be for a > single add instruction? Do you intend to make sure that the number of > threads is equal to this latency, so that the latency as perceived the > thread executing the instruction is 0? > > What's your cache / memory architecture? Handling lots of threads > could be tricky. > > Cheers, > JonB
I just posted a very long reply but the server just xxxxed it so I will write it again later offline. Quick answer yes, HT must match 4 or 8 etc. Cache architecture is currently 1 way set associative, but more Blockrams would allow more ways. Question of whether the FPGA should hold lots of lite cpus or 1 monster cpu or maybe combinations of both! Regards johnjakson_usa_com 508 4800777 EST after 8pm
John,

In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for
it.  I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz.  It was a
pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep
register file like the 1802, and it was a harvard architecture like the PIC.   Like the 1802, the operands
for the ALU were fetched from the register file and results returned to the register file.  The beauty of it
was that for control applications, you often did not even need any memory beyond the register file.  The
processor size was about 80 CLBs (translates to 80 slices in current architectures).   I'm not a compiler
person, so the big difficulty I had with it was the compiler.

I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
code for it, although folks who are more saavy than I on the software side might argue that the high speed
hardware design is the hard part.


john jakson wrote:

> jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0402230332.3e1160e@posting.google.com>... > > > I am even tempted to max the datapath to 64b as it only > > > adds 3-4 pipestages and not much to the control. > > > > Sure, but the more pipeline stages you add, the longer the latency is > > for each instruction. How many cycles latency will there be for a > > single add instruction? Do you intend to make sure that the number of > > threads is equal to this latency, so that the latency as perceived the > > thread executing the instruction is 0? > > > > What's your cache / memory architecture? Handling lots of threads > > could be tricky. > > > > Cheers, > > JonB > > I just posted a very long reply but the server just xxxxed it so I > will write it again later offline. > > Quick answer yes, HT must match 4 or 8 etc. Cache architecture is > currently 1 way set associative, but more Blockrams would allow more > ways. Question of whether the FPGA should hold lots of lite cpus or 1 > monster cpu or maybe combinations of both! > > Regards > > johnjakson_usa_com > > 508 4800777 EST after 8pm
-- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759
Ray Andraka wrote:
> John, > > In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for > it. I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz. It was a > pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep > register file like the 1802, and it was a harvard architecture like the PIC. Like the 1802, the operands > for the ALU were fetched from the register file and results returned to the register file. The beauty of it > was that for control applications, you often did not even need any memory beyond the register file. The > processor size was about 80 CLBs (translates to 80 slices in current architectures). I'm not a compiler > person, so the big difficulty I had with it was the compiler. > > I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the > code for it, although folks who are more saavy than I on the software side might argue that the high speed > hardware design is the hard part.
This is right, and John admits this in another reply. You should also add DEBUG support, as that's more important as the CPU targets bigger applications. Once you have a compiler, users will want to do more and more, and then debug becomes very important. It depends a lot on the target use. Something that runs from a Block RAM inside the FPGA, can be very small/very fast, but is probably best coded in some form of Assembler. Best example of 'Advanced Assembler Art' is Randy Hyde's HLA (High level Assembler) but that currently targets only x86 - tho I'm sure that's not hard to fix :) This HLA allows IF..THEN..ELSIF etc, and handles the labels needed, as well as giving local scope (so is a big step-up from vanilla ASM). -jg
Ray Andraka <ray@andraka.com> wrote in message news:<403A43E8.6338D0C1@andraka.com>...
> John, > > In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for > it. I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz. It was a > pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep > register file like the 1802, and it was a harvard architecture like the PIC. Like the 1802, the operands > for the ALU were fetched from the register file and results returned to the register file. The beauty of it > was that for control applications, you often did not even need any memory beyond the register file. The > processor size was about 80 CLBs (translates to 80 slices in current architectures). I'm not a compiler > person, so the big difficulty I had with it was the compiler. > > I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the > code for it, although folks who are more saavy than I on the software side might argue that the high speed > hardware design is the hard part. > >
Hi Ray Half agreed, as Jan has shown any std risc cpu project can grab lcc to do the task quite quickly by messing with the emit tables. If this were just another std risc project I'd probably do same, but then it wouldn't be anywhere near 300MHz either, more like MicroBlaze. Only hyperthreading allows max speed, but if the processes don't communicate with each other then lcc could still be used as is and ignore the HT stuff. Some of my background is in compilers and other tools but I never worked for anybody doing that. The lcc compiler (Hanson & Fraser) is possibly the best documented C compiler writing text book around and highly recomended as it explains thoroughly just how horrible C really is where most C books gloss over it's complexity. The complexity for me comes because I am combining essentially 3 langs together and putting in a mini OS runtime. The Transputer did it before but chose an unfriendly syntax and supported C only as an afterthought. I will probably get through it ok but I would love to pass that part on but then that person would be knee deep in it instead. The HW part is more fun though. The 1802 takes me back, not bad in a twisted sort of way, it certainly used very little logic, I had it under a scope at Inmos. Regards johnjakson_usa_com
>I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the >code for it, although folks who are more saavy than I on the software side might argue that the high speed >hardware design is the hard part.
How much code are you writing? Would you be willing/happy to do it in asembler? Assemblers can be pretty simple, especially if the target is raw binary running at loaded at 0 rather than something needing linkers and libraries. Also helps if the target is RISC and doesn't have messy addressing modes. How much would a reasonably clean sample assembler help? There should be a good example from the academic world. Just type in the new opcode table. -- The suespammers.org mail server is located in California. So are all my other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited commercial e-mail to my suespammers.org address or any of my other addresses. These are my opinions, not necessarily my employer's. I hate spam.
The low complexity is why I chose the architecture I did.  Unfortunately,  I did that design in schematics, before
I started using VHDL, so resurrecting it at this point involves more time than can devote to it.

john jakson wrote:

> The HW part is more fun though. The 1802 takes me back, not bad in a > twisted sort of way, it certainly used very little logic, I had it > under a scope at Inmos. > > Regards > > johnjakson_usa_com
-- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759