FPGARelated.com
Forums

Performance claims

Started by Austin Lesea December 7, 2004
All,

http://www.xilinx.com/bvdocs/whitepapers/wp218.pdf

For anyone interested in how V4 really stacks up.

Austin
Austin Lesea wrote:
> > All, > > http://www.xilinx.com/bvdocs/whitepapers/wp218.pdf > > For anyone interested in how V4 really stacks up.
Stacks up to what? FPGA-90 is no product that I am aware of. Why can't Xilinx use the name of the competition part? Otherwise this is a pretty pointless paper. -- Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX
Austin Lesea wrote:
> All, > > http://www.xilinx.com/bvdocs/whitepapers/wp218.pdf > > For anyone interested in how V4 really stacks up. > > Austin
Recheck Table 2. The VHDL code is swapped. Kolja Sulimma
Austin,
Are the blocks of code misplaced in Table 2? I'd say the code on the left 
had two levels of pipeline.
Cheers, Syms.
"Austin Lesea" <austin@xilinx.com> wrote in message 
news:cp4k3n$puu1@cliff.xsj.xilinx.com...
> All, > > http://www.xilinx.com/bvdocs/whitepapers/wp218.pdf > > For anyone interested in how V4 really stacks up. > > Austin
Hi Austin,

I just had a quick look, and there seems to be a mistake in table 2, p.5
(Verilog descriptions should be swapped for one stage vs two stage 
pipeline).

Regards,

Steven


Austin Lesea wrote:
> All, > > http://www.xilinx.com/bvdocs/whitepapers/wp218.pdf > > For anyone interested in how V4 really stacks up. > > Austin
"Austin Lesea" <austin@xilinx.com> wrote in message
news:cp4k3n$puu1@cliff.xsj.xilinx.com...
> All, > > http://www.xilinx.com/bvdocs/whitepapers/wp218.pdf > > For anyone interested in how V4 really stacks up. > > Austin
there was one good pointer in the above Xilinx white paper! its on page 6 www.opencores.org ! :) and yes looks like Stratix just got a new name: "FPGA-90nm"! LOL, if "FPGA-90nm" is now reference/alias to Altera Stratix then its good add for them! or? Antti
Symon,

Checking.....

Austin

Symon wrote:
> Austin, > Are the blocks of code misplaced in Table 2? I'd say the code on the left > had two levels of pipeline. > Cheers, Syms. > "Austin Lesea" <austin@xilinx.com> wrote in message > news:cp4k3n$puu1@cliff.xsj.xilinx.com... > >>All, >> >>http://www.xilinx.com/bvdocs/whitepapers/wp218.pdf >> >>For anyone interested in how V4 really stacks up. >> >>Austin > > >
Yes,

Code is swapped in the table.

Will be fixed shortly.

Thank you to all who caught it.

It is not supposed to be a test!

Austin

steven derrien wrote:
> Hi Austin, > > I just had a quick look, and there seems to be a mistake in table 2, p.5 > (Verilog descriptions should be swapped for one stage vs two stage > pipeline). > > Regards, > > Steven > > > Austin Lesea wrote: > >> All, >> >> http://www.xilinx.com/bvdocs/whitepapers/wp218.pdf >> >> For anyone interested in how V4 really stacks up. >> >> Austin > >

I would like to offer some clarification of points raised in this
whitepaper, first in summary and then in some detail.  I will occasionally
refer to our web-based performance seminar
(http://seminar2.techonline.com/s/altera_dec0704) for further details.



* Constraints.  The clock constraint methodology we employ matches

  that outlined in the whitepaper.  It is good to see that both

  companies can agree on something!



* High-Effort Compiles.  We run the ISE software in the mode that

  yields the highest results across our benchmark set.  We also run a

  seed sweep ("multi-pass") for ISE at the end of the process.



* Retiming.  ISE does not offer physical synthesis during place and

  route.  Quartus II does.  We do not use XST (and hence XST

  retiming) since we find this results in a far greater disadvantage

  for Xilinx than when we use a common synthesis tool (Synplicity in

  this case).



* Block Performance.  Maximum block toggle rates are pretty worthless

  if the fabric that stitches the blocks together can't keep up.  Our

  design set includes a variety of types of resources including RAMs

  and DSPs, yet yields +39% performance advantage.  Why?  Our blocks

  have comparable propagation delays which it turns out matters more,

  and our logic & routing are substantially faster.  Also, our Fmax

  limits have increased in Quartus II 4.2 and will continue to

  increase as we complete our detailed characterization process.



* Design entry.  Good advice that applies to any modern FPGA

  (Stratix II and Virtex-4).



* Speed Grades.  We compare to what's available in the software.  If

  users know how much faster a -12 device will be (we do not), they

  can derate our 39% average performance advantage accordingly.





Clock Constraints

^^^^^^^^^^^^^^^^^

    We appear to agree on how to constrain clocks.

    For synthesis, we employ the flow suggested by Synplify to optimize
multiple clock designs.  This results in optimization of all clock domains.
Are there other ways to do it?  Probably -- but since Synplicity Pro 7.7 is
a common-denominator in our comparisons, it is hard to see how changing this
would affect the 39% average performance advantage that we see for Stratix
II.

    For ISE, as outlined in the web-seminar (slide #9) and other locations,
we constrain each clock independently and iterate to find the best such
(tight) constraints.  As you suggest, we do not look at paths that cross
clock domains (difficult to do in an apples-to-apples way).  We do not over
constrain ISE as we have found this degrades Xilinx performance.  Slide #10
shows the results of the iterative constraint process for one design (with
two clocks); I think it highlights the rigour and correctness of this
process.

    I should point out that for Quartus II, we don't need to jump through
hoops since applying a global 1 Ghz constraint on the clocks will result in
each clock being optimized as best as possible.





Synthesis/P&R Effort

^^^^^^^^^^^^^^^^^^^^

    On the P&R front, we use the ISE settings that yield the best
performance results across our benchmark set.  We also run a seed-sweep (or
"multi-pass" compile) using ISE at the end of our iterative process.

    For synthesis, we have no reason to believe that enabling a high-effort
mode in Synplicity would change the conclusions of our comparison, since we
are using the same synthesis tool for both Stratix II and Virtex-4.





Register Retiming/Physical Synthesis

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    Quartus II can perform physical synthesis optimizations during
place-and-route.  These algorithms have access to detailed placement and
timing information, enabling further optimization that synthesis just can't
know about.  ISE does not provide any such optimizations.  Note: We always
include Tco and Tsu constraints, so our re-timer will not violate I/O timing
to improve core speed.

    We did not use Synplicity's retiming options during these comparisons,
and are in the process of evaluating how the comparison changes when we use
these options.  While one might guess that these optimizations would reduce
Quartus' physical synthesis upside, register retiming is only one of the
many algorithms employed in Quartus physical synthesis and is responsible
for a very small part of +39% performance we see.

    I'm told that ISE also offers some sort of retiming option during
synthesis with XST.  We find that using XST yields much worse Xilinx results
(which make us look much better), so do not use XST, and hence do not use
that retiming option.





Block Performance

^^^^^^^^^^^^^^^^^

    Our benchmarking results address overall performance across real
designs.  These designs contain RAMs, DSP/MAC/Multipliers, adders, counters,
and other such building blocks in a large variety of sizes and varying
quantities.  We do not claim that Stratix II is 39% faster on all building
blocks, but rather that when you put it all together Stratix II is 39%
faster.

    Why is this?  Fundamentally, the logic and routing of Stratix II is
significantly faster -- and you need logic & routing to stitch together the
blocks.  Also, critical paths often start or end on a RAM/DSP, and are very
rarely just a RAM/DSP toggling in isolation.  The timing microparameters of
the RAM/DSP are quite comparable between the two families.  According to the
Virtex 4 data sheet, the DSP microparameters are faster in the -12 device
and we will certainly rerun the analysis when Xilinx releases software that
enables this fastest speed grade.

    Our Fmax limit is not simply just 1/Tco.  The block toggle rate limits
imposed by Quartus II are selected based on characterization to guarantee
operation of our devices in all environments, under all noise and switching
conditions.  When you clock a block very quickly, you start getting
interesting effects that can affect operation.  As we complete the
characterization of hard IP blocks, we will raise these limits.  The Quartus
II 4.2 software introduces higher Fmax limits than stated in this table, and
further increases are likely in future software releases.





Speed Grades

^^^^^^^^^^^^

    I believe we have addressed this in numerous forums.  We use the
available speed grades in the software.  We can't compare to something we
can't get our hands on.  Users can derate our +39% average performance
result by the difference between our fastest and medium speed grade to get a
flavour for how things will compare if & when a fast Virtex-4 speed grade is
made available.



Regards,



Paul Leventis

Altera Corp.


[In my attempt to text-format my first posting, I somehow double spaced
it... weird.  If I fail this time, my descent into management is complete.]

I would like to offer some clarification of points raised in this
whitepaper, first in summary and then in some detail.  I will occasionally
refer to our web-based performance seminar
http://seminar2.techonline.com/s/altera_dec0704) for further details.

* Constraints.  The clock constraint methodology we employ matches
  that outlined in the whitepaper.  It is good to see that both
  companies can agree on something!

* High-Effort Compiles.  We run the ISE software in the mode that
  yields the highest results across our benchmark set.  We also run a
  seed sweep ("multi-pass") for ISE at the end of the process.

* Retiming.  ISE does not offer physical synthesis during place and
  route.  Quartus II does.  We do not use XST and hence XST retiming)
  since we find this results in a far greater disadvantage for Xilinx
  than when we use a common synthesis tool (Synplicity in this case).

* Block Performance.  Maximum block toggle rates are pretty worthless
  if the fabric that stitches the blocks together can't keep up.  Our
  design set includes a variety of types of resources including RAMs
  and DSPs, yet yields +39% performance advantage.  Why?  Our blocks
  have comparable propagation delays which it turns out matters more,
  and our logic & routing are substantially faster.  Also, our Fmax
  limits have increased in Quartus II 4.2 and will continue to increase
  as we complete our detailed characterization process.

* Design entry.  Good advice that applies to any modern FPGA (Stratix
  II and Virtex-4).

* Speed Grades.  We compare to what's available in the software.  If
  users know how much faster a -12 device will be (we do not), they
  can derate our 39% average performance advantage accordingly.

Clock Constraints
^^^^^^^^^^^^^^^^^
    We appear to agree on how to constrain clocks.
    For synthesis, we employ the flow suggested by Synplify to optimize
multiple clock designs.  This results in optimization of all clock domains.
Are there other ways to do it?  Probably -- but since Synplicity Pro 7.7 is
a common-denominator in our comparisons, it is hard to see how changing this
would affect the 39% average performance advantage that we see for Stratix
II.
    For ISE, as outlined in the web-seminar (slide #9) and other locations,
we constrain each clock independently and iterate to find the best such
(tight) constraints.  As you suggest, we do not look at paths that cross
clock domains (difficult to do in an apples-to-apples way).  We do not over
constrain ISE as we have found this degrades Xilinx performance.  Slide #10
shows the results of the iterative constraint process for one design (with
two clocks); I think it highlights the rigour and correctness of this
process.
    I should point out that for Quartus II, we don't need to jump through
hoops since applying a global 1 Ghz constraint on the clocks will result in
each clock being optimized as best as possible.

Synthesis/P&R Effort
^^^^^^^^^^^^^^^^^^^^
    On the P&R front, we use the ISE settings that yield the best
performance results across our benchmark set.  We also run a seed-sweep (or
"multi-pass" compile) using ISE at the end of our iterative process.
    For synthesis, we have no reason to believe that enabling a high-effort
mode in Synplicity would change the conclusions of our comparison, since we
are using the same synthesis tool for both Stratix II and Virtex-4.

Register Retiming/Physical Synthesis
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    Quartus II can perform physical synthesis optimizations during
place-and-route.  These algorithms have access to detailed placement and
timing information, enabling further optimization that synthesis just can't
know about.  ISE does not provide any such optimizations.  Note: We always
include Tco and Tsu constraints, so our re-timer will not violate I/O timing
to improve core speed.
    We did not use Synplicity's retiming options during these comparisons,
and are in the process of evaluating how the comparison changes when we use
these options.  While one might guess that these optimizations would reduce
Quartus' physical synthesis upside, register retiming is only one of the
many algorithms employed in Quartus physical synthesis and is responsible
for a very small part of +39% performance we see.
    I'm told that ISE also offers some sort of retiming option during
synthesis with XST.  We find that using XST yields much worse Xilinx results
(which make us look much better), so do not use XST, and hence do not use
that retiming option.

Block Performance
^^^^^^^^^^^^^^^^^
    Our benchmarking results address overall performance across real
designs.  These designs contain RAMs, DSP/MAC/Multipliers, adders, counters,
and other such building blocks in a large variety of sizes and varying
quantities.  We do not claim that Stratix II is 39% faster on all building
blocks, but rather that when you put it all together Stratix II is 39%
faster.
    Why is this?  Fundamentally, the logic and routing of Stratix II is
significantly faster -- and you need logic & routing to stitch together the
blocks.  Also, critical paths often start or end on a RAM/DSP, and are very
rarely just a RAM/DSP toggling in isolation.  The timing microparameters of
the RAM/DSP are quite comparable between the two families.  According to the
Virtex 4 data sheet, the DSP microparameters are faster in the -12 device
and we will certainly rerun the analysis when Xilinx releases software that
enables this fastest speed grade.
    Our Fmax limit is not simply just 1/Tco.  The block toggle rate limits
imposed by Quartus II are selected based on characterization to guarantee
operation of our devices in all environments, under all noise and switching
conditions.  When you clock a block very quickly, you start getting
interesting effects that can affect operation.  As we complete the
characterization of hard IP blocks, we will raise these limits.  The Quartus
II 4.2 software introduces higher Fmax limits than stated in this table, and
further increases are likely in future software releases.

Speed Grades
^^^^^^^^^^^^
    I believe we have addressed this in numerous forums.  We use the
available speed grades in the software.  We can't compare to something we
can't get our hands on.  Users can derate our +39% average performance
result by the difference between our fastest and medium speed grade to get a
flavour for how things will compare if & when a fast Virtex-4 speed grade is
made available.

Regards,

Paul Leventis
Altera Corp.