I'm upgrading a design, and I'm in the early phases of choosing a vendor. I'm trying to compare parts based on experience I've had in the past, so I'm focusing on block RAM clock to out delay as a critical performance number: Altera M4K vs. Xilinx Block RAM clock to out delay, non-registered outputs: Stratix-II -3 2.46 ns Stratix-II -4 2.828 ns Stratix-II -5 3.393 ns Xilinx-V4 -11 1.83 ns Xilinx-V4 -10 2.10 ns Xilinx-V2 -4 2.65 ns (current part) V4 appears to be 1.62 times faster for the slowest speed grade parts (which I'm probably most interested in, though I should really compare equal priced parts), and slower even than the original V2 design. Am I missing something? Several posts here suggest that Stratix-II interconnect is faster- is there any datasheet evidence to back this up? Lets say the RAM output is at least feeding a 2:1 MUX before being registered, and porbably has to travel ~1/3 the width of the chip. Also, help me fill in my chart: LUT delay: Xilinx-V2 -4 439ps Xilinx-V4 -10 200ps Xilinx-V4 -11 170ps Stratix-II ? (can't find any data) Carry delay: Xilinx-V2 -4 106ps Xilinx-V4 -10 90 ps Xilinx-V4 -11 80 ps Stratix-II ? (can't find any data) Routing delay: I can do this with fpga_editor in Xilinx. How to do it for Stratix-II ? -- /* jhallen@world.std.com (192.74.137.5) */ /* Joseph H. Allen */ int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) +r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2 ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
V4 vs. Stratix-II...
Started by ●May 12, 2005
Reply by ●May 12, 20052005-05-12
Hi Joseph, I stopped reading data sheets since they're way too big and the information is never organized the way I need to have it. So I tend to simply write little test cases and let the tools tell me what I need to know. I would personally just compile the design with your new constraints in both ISE and Quartus II (v5 has just been released) and see who comes out best.> Altera M4K vs. Xilinx Block RAM clock to out delay, non-registered > outputs: > > Stratix-II -3 2.46 ns > Stratix-II -4 2.828 ns > Stratix-II -5 3.393 ns > > Xilinx-V4 -11 1.83 ns > Xilinx-V4 -10 2.10 ns > > Xilinx-V2 -4 2.65 ns (current part)I suggest you re-check Stratix-II timing with Quartus II 5.0 - Altera has been doing some re-characterization which seemingly hasn't made it to the handbook yet. In an M4K I am using in a Stratix II I'm getting 1.85ns for a -3 part and 2.4ns for a -5 part.> LUT Delay: > Stratix-II ? (can't find any data)Well, it kind of varies between (off the cuff) 83ps and 400ps depending on the input that changes and the mode the ALM is in. Easy to check in Quartus with, for example, an 8-input AND or so. I'm getting cell delays between 0.047 and 0.404ns depending on the mode and the input of the ALM (see below on how to do this).> Carry delay: > > Xilinx-V2 -4 106ps > Xilinx-V4 -10 90 ps > Xilinx-V4 -11 80 ps > Stratix-II ? (can't find any data) > > Routing delay: > > I can do this with fpga_editor in Xilinx. How to do it for Stratix-II ?Open the timing analyzer. Right-click a path and select "List Paths" from the menu. When expanding the messgaes in the status window you should get detailed info on both cell and routing delay of the path. Best regards, Ben
Reply by ●May 12, 20052005-05-12
Hi Joseph, Remember that in Q II 5.0 the M4k performance has increased from 400 to 550 MHz. It looks like you're using the out-of-date numbers for tCO. The new ones should be ~ 1.88 ns (I'm guessing). There's a few ways to find the routing delays in Q II. The most detailed way is to open the Timing Floorplanner (Assignments/Timing Closure Floorplan), right-click a used logic cell, and choose Locate>Chip Editor.>From here you can multi-select resources, choose View/Show Delays,right-click, and choose "Generate Connections Between Nodes". You can show the actual routes used with View/Highlight Routing. The easier way is to stay in the Timing Floorplanner, Ctrl-click the stuff you want to find delays for, make sure View/Routing/"Show Routing Delays" is selected, and choose View/Routing/"Show Paths Between Nodes". Interesting ... the Sratix II handbook doesn't have LUT timing params. I was sure they were there for Stratix. Well it shouldn't be too difficult with Chip Editor ... maybe someone gets an answer before I do ... -- Pete Joseph H Allen wrote:> I'm upgrading a design, and I'm in the early phases of choosing avendor.> I'm trying to compare parts based on experience I've had in the past,so I'm> focusing on block RAM clock to out delay as a critical performancenumber:> > Altera M4K vs. Xilinx Block RAM clock to out delay, non-registeredoutputs:> > Stratix-II -3 2.46 ns > Stratix-II -4 2.828 ns > Stratix-II -5 3.393 ns > > Xilinx-V4 -11 1.83 ns > Xilinx-V4 -10 2.10 ns > > Xilinx-V2 -4 2.65 ns (current part) > > V4 appears to be 1.62 times faster for the slowest speed grade parts(which> I'm probably most interested in, though I should really compare equalpriced> parts), and slower even than the original V2 design. Am I missing > something? Several posts here suggest that Stratix-II interconnectis> faster- is there any datasheet evidence to back this up? Lets saythe RAM> output is at least feeding a 2:1 MUX before being registered, andporbably> has to travel ~1/3 the width of the chip. > > Also, help me fill in my chart: > > LUT delay: > > Xilinx-V2 -4 439ps > Xilinx-V4 -10 200ps > Xilinx-V4 -11 170ps > Stratix-II ? (can't find any data) > > Carry delay: > > Xilinx-V2 -4 106ps > Xilinx-V4 -10 90 ps > Xilinx-V4 -11 80 ps > Stratix-II ? (can't find any data) > > Routing delay: > > I can do this with fpga_editor in Xilinx. How to do it forStratix-II ?> > -- > /* jhallen@world.std.com (192.74.137.5) */ /* JosephH. Allen */> inta[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)>+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2> ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n","#"[!a[q-1]]);}
Reply by ●May 12, 20052005-05-12
Joseph, I just saw a presentation that shows that V4 is faster on all interconnet paths (by as much as 500 ps for long paths) except the immediate neighbor paths, where we are just ever to slightly slower than S2 neighbor paths. I also saw LUT comparisons, which took 8 slides, with animations, as comparing the 4LUTs to the ALM-LUT is not trivial: you have to look at each and every input to output delay. And then you have to make a guess as to how your logic will get synthesized. Yes, we are faster for 4 LUT (most inputs), and they are faaster for wider functions (but not all inputs). For example: S2 4LUT input delays to output (in order): 155ps, 382ps, 360ps, 275ps. V4 4LUT: 165ps, 165ps, 165ps, 165ps. (fastest speed grades, both companies). Then there is the interconnect. V4 is 500 ps faster for full chip routes, 400 ps faster for 1/2 chip routes, 100-200 ps faster for a few CLBs, LABs, and 100-200ps for neighbor routes. Some very short routes are 30ps better in S2. Below 32 bits, S2 is slightly better for an adder, and over 32 bits, V4 is better. Same for cary chain, where S2 is ~ 200 ps better at ~ 16 bits, and V4 is >500ps better at 48 bits, and longer carry chains (equal at 24 bits). In our suite of test designs, we come out ~9% faster (on average) with a +/- 4% error margin. Of course some designs will be faster than that, and some slower, too. We generally favor wider arithemetic, and pipelining, where S2 favors empty designs, and small arithemetic functions. We tend to excell when the design gets full, and complex (like it does at the end of your project!). BRAM functionality depends a lot on the use of registers, as use of the fabric registers really slows things down (and takes more power) than using the registers built into the BRAM. Of course, anythign you can direct into the DSP48s will just scream, and outperform anything S2 has. I think that the newsgroup here will basically tell you to try a design in both architectures, and play with the constraints to see how well it does. Or, what I prefer, is to contact the FAEs of the respective companies, and ask them to show you how your design will perform (let them drive the tools). Or, do both. Austin
Reply by ●May 12, 20052005-05-12
Austin Lesea wrote:> Joseph, > > I just saw a presentation that shows that V4 is faster on all > interconnet paths (by as much as 500 ps for long paths) except the > immediate neighbor paths, where we are just ever to slightly slower than > S2 neighbor paths....(lots of numbers deleted)... Without detailing what you're comparing (ie., which device at which speed grade) none of this is meaningful. Tommy -- not affiliated with either fighting bulls.
Reply by ●May 12, 20052005-05-12
Tommy, I thought I was clear, fastest speed grade, S2 and V4. Austin Tommy Thorn wrote:> Austin Lesea wrote: > >> Joseph, >> >> I just saw a presentation that shows that V4 is faster on all >> interconnet paths (by as much as 500 ps for long paths) except the >> immediate neighbor paths, where we are just ever to slightly slower >> than S2 neighbor paths. > > > ...(lots of numbers deleted)... > > Without detailing what you're comparing (ie., which device at which > speed grade) none of this is meaningful. > > Tommy -- not affiliated with either fighting bulls.
Reply by ●May 13, 20052005-05-13
Austin Lesea wrote: <snip>> For example: S2 4LUT input delays to output (in order): 155ps, 382ps, > 360ps, 275ps. V4 4LUT: 165ps, 165ps, 165ps, 165ps. (fastest speed > grades, both companies).Since this is side-by-side, I was wondering why Xilinx spec all paths the same. Is that actually the worst path, and then the SW is free to use any path ? [but your physical speed margin might change, on a re-route] Or is there really such a difference in the implementation that Xilinx's end up precisely identical, and Altera's vary over 2:1 ? -jg
Reply by ●May 13, 20052005-05-13
Joesph, I agree with Ben. With so many variables and so much marketing B.S., your best bet is to compile using both a V4 and SII. I've found that performance is highly dependent on implementation, synthesis tools, and how full the device is. These are all variables outside of your FPGA vendor selection. You also note that you're probably going with the slowest speed grade, so I assume cost is an issue. A true comparison cannot be made with cost included. In addition, you should also consider whether EasyPath for Xilinx or Hardcopy for Altera are alternatives to help lower your cost. Finally, I would like to make one point about interconnect. Who cares if V4 or SII is slightly faster? It's the routing software that is going to make the major difference. Whichever software requires me to do the least amount of floorplanning is the one that wins. Also, how well does the software perform as the chip gets full? Personally, I think the floorplanning tools of ISE are easier to use than Quartus. However, I think Quartus does a much better job at placement and routing as a design gets very full (>90% utilization). John
Reply by ●May 13, 20052005-05-13
Jim, I have been corrected by many. No, they are not all the same (in the hardware, and as an IC designer, I already knew that). However, in the past they were treated as all equal (for efficiency, finding and using the faster path is not necessarily a big benefit). I do not know if the paths are treated the same or not (on the 4LUT) in V4 p&r. I am sure someone will tell me (now). I think the point I was trying to make is that the 4LUT is faster than the ALM for a class of functions (4 inputs or less), and slower for wider functions (on some pins). So, the quality of the synthesis, followed by the place and route (constraints) will make a huge difference in the performance. I have been told that for every design that is better in S2, after some work, can be made even better than S2 in V4. I do not doubt that Altera can, and does, make the exact same claim. I disagree that the ultimate (best) performance in S2 is better, as that is not what our research has shown. Again, Altera has their own suite of XX designs that they use to benchmark their device, and they also make exactly the same claim. Given the state of the marketing wars (see the "mine is...." thread), I think I'll stay safely in the engineering camp, and say: if you are really adamant about comparing the two, go take your finished design, and run it through both design tools, and make your own decision. Our FAEs are available to help you with that chore. And please take into account that we offer: DSP48, EMAC, PPC, FIFO-BRAM that can be used to even greater advantage. Austin
Reply by ●May 13, 20052005-05-13
Austin Lesea wrote:> Jim, >...> > I disagree that the ultimate (best) performance in S2 is better, as that > is not what our research has shown. Again, Altera has their own suite > of XX designs that they use to benchmark their device, and they also > make exactly the same claim.Austin, to settle this argument once and for all, why not take a bunch of designs that are freely available on OpenCores, and present utilization and performance reports without doing any tweaking of the designs ? There are many VHDL and Verilog deigns available on OpenCores from CPUs, to Crypto cores to communication cores. Both companies could present their own results including with a script as to how to reproduce the results, in case somebody wanted to double check. If you could agree to do this fir Xilinx, and perhaps we ghet a volunteer from the Altera Camp, we can openly chose some designs ... Best Regards, rudi ============================================================= Rudolf Usselmann, ASICS World Services, http://www.asics.ws Your Partner for IP Cores, Design, Verification and Synthesis