FPGARelated.com
Forums

V4 vs. Stratix-II...

Started by Joseph H Allen May 12, 2005
Rudi, nice idea, but it won't work, with the two companies involved.
Many years ago, there was PREP,  with a very similar idea. It died
because the FPGA manufacturers could not resist the temptation to
tinker with the results ( I used the words "lied and cheated"). Our
"friends" presented designs with "virtual" flip-flops, to improve the
packing density. It became one big shouting match.

The stakes are just too high for either of the marketing departments to
admit "defeat", and there are too many subtle aspects of designing with
FPGAs, hardware and software.
"Everybody is the winner" will be the unavoidable outcome.

It seems that the user community likes the intense competition and
diversity.
And we like the fact that FPGAs have not become a commodity where price
is the only differentiator. There is still lots of room for creativity
and innovation.
Peter Alfke

Warning:  Ranty, opinionated (and quite probably wrong):


Stupid question on the X vs A urination (in a hurricane) contest:

How much does performance really matter?

First, how many FPGA tasks are not defined by an external clock or
clocks?  If you are doing GigE, your clock is 125 MHz (8 bit path) or
62.5 MHz (16 bit path).  The PCI-X bus is 33/66/99/133 MHz.

Second, how many designs have single-cycle latency requirements?  PCI
does, but your part either can or can't make the PCI spec with the
provided IP core (so thats a pass/fail metric, not a performance
metric).

If the task is latency bound overall, then performance matters.  But
otherwise, just add more registers & pipeline more finely.

Thus I personally wonder whether the primary focus of the pissin match
should be mostly about tools (both the vendor tools and support for
third party tools, especially easy floorplanning, datapath aware
placement, & retiming), density ($/LE), and features (Brand X has a
big lead here), rather than who's lut is 10% faster on what functions,
and who's interconnect might be slightly faster on some designs and
slower on others.
-- 
Nicholas C. Weaver.  to reply email to "nweaver" at the domain
icsi.berkeley.edu
"Rudolf Usselmann" <russelmann@hotmail.com> schrieb im Newsbeitrag
news:d62jct$gjj$1@nobel.pacific.net.sg...
> Austin Lesea wrote: > > > Jim, > > > ... > > > > I disagree that the ultimate (best) performance in S2 is better, as that > > is not what our research has shown. Again, Altera has their own suite > > of XX designs that they use to benchmark their device, and they also > > make exactly the same claim. > > Austin, > > to settle this argument once and for all, why not take a bunch > of designs that are freely available on OpenCores, and present > utilization and performance reports without doing any tweaking > of the designs ? There are many VHDL and Verilog deigns available > on OpenCores from CPUs, to Crypto cores to communication cores. > > Both companies could present their own results including with > a script as to how to reproduce the results, in case somebody > wanted to double check. > > If you could agree to do this fir Xilinx, and perhaps we ghet a > volunteer from the Altera Camp, we can openly chose some designs ... > > Best Regards, > rudi
Rudi, it would not work that way and you get nil support to the idea (officially at least) from any FPGA vendor. There is just too much on the stake. But some companies are doing something similar by having test environment which are run agains the latest tools for multi FPGA vendors. Those are the companies that design FPGA/ASIC tools. And to my knowledge most of those companies are pissed to FPGA companies because ah their bread is getting less as the FPGA vendor tools are getting better (or including new functionality in it) and I think there are some other problems also. Anyway those companies run testbenches. For a little different reason, but I think they pretty much 'see' and 'know' the differencies between the FPGA fabrics from different vendors. But all that benchmarking is strictly inside those companies and there is no public info. The 'fpga' benchmarking in open, has failed. It is virtually not possible to be done wihout some kind of biasing and the results are not useable without very strict explanatians under what circumstances the compare results are valid. The hdl to fabric mapping is too complex (the all process) and there are too many small things that may or may not have impact on the results. Antti with his last 2 cents :)
Rudi,

The problem is that without any regard to device specific features, the 
results will vary by a tremendous amount.

Austin

I recently wasted several hours trying to get a project from ISE 6.1i to 
*load* correctly in ISE 6.3.03i.

I always seem to loose more time to the tools (either due to bugs or the 
tools being downright awfull compared to a software compilation toolflow 
for providing easily usefull data etc.) than I do to reaching timing 
requirements, or adapting a design to live with a lower clock.

So yes, in my view I'm more interested in toolflows lately, although with 
only to fish in the pond hardware wise, and in both cases the hardware and 
tools being intematly linked (at lower levels, and at higher levels for 
those on restricted budgets), their is sadly little choice :-(

An interesting asside, I've been getting involved with work in using FPGAs 
in high performance computing, coming from a background in both.  Meeting 
people coming from a background that is software/HPC and no FPGAs, they 
tend to be appauled by the FPGA software flows. 

I'm reasonably convinced that a 'proper' implementation of the modular 
design stuff from Xilinx (i.e. not relying on using the disapearing 
tristate bus emulation from the Virtex architecture) would make the tools 
more usable, not least in reducing raw hardware and time requirements for 
big PARs.

Mind you it could be argued that the reason I have the luxury to be pissed 
at the tools is because the hardware is now at a state where it does what 
I want most of the time  :-)

 ---

cds


Nicholas Weaver (nweaver@soda.csua.berkeley.edu) wrote:
: Warning:  Ranty, opinionated (and quite probably wrong):


: Stupid question on the X vs A urination (in a hurricane) contest:

: How much does performance really matter?

: First, how many FPGA tasks are not defined by an external clock or
: clocks?  If you are doing GigE, your clock is 125 MHz (8 bit path) or
: 62.5 MHz (16 bit path).  The PCI-X bus is 33/66/99/133 MHz.

: Second, how many designs have single-cycle latency requirements?  PCI
: does, but your part either can or can't make the PCI spec with the
: provided IP core (so thats a pass/fail metric, not a performance
: metric).

: If the task is latency bound overall, then performance matters.  But
: otherwise, just add more registers & pipeline more finely.

: Thus I personally wonder whether the primary focus of the pissin match
: should be mostly about tools (both the vendor tools and support for
: third party tools, especially easy floorplanning, datapath aware
: placement, & retiming), density ($/LE), and features (Brand X has a
: big lead here), rather than who's lut is 10% faster on what functions,
: and who's interconnect might be slightly faster on some designs and
: slower on others.
: -- 
: Nicholas C. Weaver.  to reply email to "nweaver" at the domain
: icsi.berkeley.edu
Thanks you all.  This has been very helpful.


-- 
/*  jhallen@world.std.com (192.74.137.5) */               /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
I think you are somewhat missing the point with the A & X question..   in
that you ask the wrong question.

Its not who has the best architecture or which one is fastest.. it actually
doesn't  really matter... for 99% of the designs, as Austin's pointed out
before... either is good enough... and if your in the 1% that matters, then
anything that you do won't give you a good enough idea until you try and fit
the final FF or CLB, and even then your design will be so customised that an
A design is almost impossible to translate to X and visa versa.

What really matters is what price X or A's FAE  will sell you the parts at,
what support they will give you, what evaluation boards are about that do
some if not all your needs.

The decision at my work was which company gave us the best discount, That
happened to be Xilinx.  It also happened that they do bus LVDS which we are
using so our design naturally forced A out anyway, we just didn't tell
anyone :-)

If you are building a one off then it really doesn't matter anyway.  Use a
dartboard and a blindfold it will be as accurate as a detailed study... for
one off.. just choose a eval board with a largish device, get it all working
and see how big it is, then choose a device twice the size required (for the
inevitable fixups)

my two cents

Simon


"Joseph H Allen" <jhallen@TheWorld.com> wrote in message
news:d633bt$g1u$1@pcls4.std.com...
> Thanks you all. This has been very helpful. > > > -- > /* jhallen@world.std.com (192.74.137.5) */ /* Joseph H.
Allen */
> int
a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
>
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+ q*2
> ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n","
#"[!a[q-1]]);}
Nicholas Weaver wrote:
<snip>
> Thus I personally wonder whether the primary focus of the pissin match > should be mostly about tools (both the vendor tools and support for > third party tools, especially easy floorplanning, datapath aware > placement, & retiming), density ($/LE), and features (Brand X has a > big lead here), rather than who's lut is 10% faster on what functions, > and who's interconnect might be slightly faster on some designs and > slower on others.
Or the number of user designs that broke on the latest Vxyz release ? -jg
Hi Joseph,

First, I must stress that comparing "micro parameters" is difficult at best 
and dangerous at worst.  There are fairly arbitrary decisions made during 
timing modeling about where you lump various delays.  For example, where 
does "LUT delay" begin and end -- is it at the output of the 1st stage 
buffer after the multiplexor before the LUT?  Or is that multiplexor's delay 
included as part of LUT delay?

The Stratix/Stratix II/Cyclone/Cyclone II/Max II timing models are 
sufficiently complicated that there is little point to making datasheet 
entries for various internal timing parameters.  For example, the ALM is 
fairly complicated and depending on how your logic is synthesized and 
exactly how the router chooses to hook it up, your delay can vary 
considerably.  So your best bet is to look at real circuits with real timing 
constraints, since Quartus II will do its best to put the critical signals 
on the fastest paths.  That said...

As some posters have already pointed out, RAM speeds have increased in 
Quartus 5.0.  The latest comparison I've seen shows us with a Tco advantage 
vs. Virtex-4 when the RAM output registers are used, and a slight 
disadvantage when the RAM is unregistered -- in either case a few hundred ps 
difference.

As for LUT delays, here are the latest numbers I've got for a fastest speed 
grade 7-input LUT (ALM can do some inputs of 7-inputs, and all functions of 
6-inputs), as well as for a 4-LUT (the ALM can do two independent 4-LUTs).

Input    7-LUT    4-LUT
A        378 ps   366 ps
B        357 ps   228 ps
C        240 ps   225 ps
D        240 ps   53 ps
E        144 ps
F         53 ps
G        234 ps

According to Austin's post, Virtex-4 (fastest speed grade -- I dare you to 
try to buy one ;-)) shows 165 ps across-the-board (seems bogus to me, but 
what do I know).  So which LUT is faster based on this data?  Well, it 
depends on how we lumped our delays into logic vs. routing (see above).  It 
also depends on how often Quartus II will manage to route your critical 
signal on the fast LUT inputs -- usually it does a very good job of this.

The other critical component for logic fabric performance is the routing. 
Based on an analysis of routing delay between registers placed a varying 
distance apart in the X- and Y-directions, we've found that we have a ~20% 
delay advantage (fastest speed grade vs. fastest speed grade).  Of course, 
even this type of study has its caveats -- how do you normalize distance to 
take into account differences in logic density?

Stratix II employs a low-k inter-metal diaelectric (k = 2.9) vs. Virtex-4's 
"reduced-k" diaelectric (k = 3.6), given us a ~20% metal capacitance 
advantage.  If you set aside architectural and circuit differences, to first 
order you'd expect this to translate into a performance advantage for 
Stratix II.

Regards,

Paul Leventis
Altera Corp. 


> Then there is the interconnect. V4 is 500 ps faster for full chip routes, > 400 ps faster for 1/2 chip routes, 100-200 ps faster for a few CLBs, LABs, > and 100-200ps for neighbor routes. Some very short routes are 30ps better > in S2.
I would guess that you did not normalize to take into account packing density. How do you define a "short" route? Do you multiply the # of CLBs and # of LABs by the right ratio of logic? I'd argue that 1 LAB = 8 ALMs = ~10-10.5 slices (based on our density analysis). Anyway, the average distance of a hop in a critical path is roughly 3 LABs, so short connections are the most important. Our data shows a performance advantage in hops of this length.
> Of course, anythign you can direct into the DSP48s will just scream, and > outperform anything S2 has.
That's interesting... did you miss the news that we've increased Stratix II DSP performance to 550 Mhz in Quartus II 5.0? Not to mention that the S2 DSP can do 36-bit multiplies in hardware (vs. 18-bit for DSP48)... but I will not digress into a feature pissing contest.
> I think that the newsgroup here will basically tell you to try a design in > both architectures, and play with the constraints to see how well it does.
On this, I agree with Austin. Kick the tires. Just be sure to set timing constraints before doing so, and also make sure not use "toy" designs (neither tool is particularly well optimized for very small designs in very large chips). And beware numerical noise -- placement & routing is a heuristic. If you perturb any aspect of the input, the output can change due to random differences in algorithm outcome. Regards, Paul Leventis Altera Corp.