FPGARelated.com
Forums

Best FPGA for floating point performance

Started by Marc Battyani August 19, 2005
Hello,

Does anybody already made a comparison of the high performance FPGA (Stratix
II, V4, ?) relative to double precision floating point performance (add,
mult, div, etc.) ?

It's for an HPC aplication.

Thanks

Marc


Marc,

IEEE floating point standard?  You need to be more specific.

Does it need to integrate with a processor?

I believe the Xilinx IBM 405 Power PC using the APU interface in Virtex 
4 with the floating point IP core provides the best and fastest performance.

Especially since no other FPGA vendor has a hardened processor to 
compete with us.

If all you want is the floating point processing, without a 
microprocessor, then I think you will find similar performance between 
Xilinx and our competition, with us (of course) claiming the superior 
performance edge.

It would not surprise me at all to see them also post claiming they are 
superior.

For a specific floating point core, with a given precision, for given 
features, it would be pretty easy to bench mark, so there is very little 
wiggle room here for marketing nonsense.

I would be interested to hear from others (not competitors) about what 
floating point cores they use, and how well they perform (as you 
obviously are interested).

Austin


Marc Battyani wrote:

> Hello, > > Does anybody already made a comparison of the high performance FPGA (Stratix > II, V4, ?) relative to double precision floating point performance (add, > mult, div, etc.) ? > > It's for an HPC aplication. > > Thanks > > Marc > >
Marc Battyani (Marc.Battyani@fractalconcept.com) wrote:
: Hello,

: Does anybody already made a comparison of the high performance FPGA (Stratix
: II, V4, ?) relative to double precision floating point performance (add,
: mult, div, etc.) ?

: It's for an HPC aplication.

Hi Marc,
     I don't have a comparisom of various cores but a lot of info is out 
there in datasheets.

However, in an HPC application the performance of your maths cores may not 
be the bottleneck, rather it is likely to be a question of how fast can 
you interface the host system to the FPGA, how fast can you shunt data 
around between CPU, CPU RAM, FPGA and FPGA RAM etc.

The heavyweight HPC/FPGA hybrid systems I have seen, such as the Cray-XD1 
and SGI NUMAflex/Altix stuff use Xilinx FPGAs.

Although I wouldn't want to generalise for the whole field, other 
interested parties such as Nallatech and Starbridge Systems tend to go 
for Xilinx.

Certianly Xilinx seem to have a head start in the field (not thanks to 
their tools from the word on the street :-) - possibly this has more to do 
with interfacing than FP core performance.

Not answering the origional question, but there you go :-)

Cheers,
     Chris

(A strong believer in FPGA type stuff for HPC, although perhaps the 
granularity is less than optional and the tools not very well suited, but 
hey it's early days.)

 : Thanks

: Marc


While an x86, or cell cluster could whip FPGA at IEEE FPU in raw clock
speed ( I am not sure about cost though), you can flip the odds some by
defining your own numerics with a direct mapping to the plentifull
18bit muls.

If I am not mistaken IEEE is not the be all and end all of FPU and has
a certain no of detractors esp in some fields regarding rounding,
exceptions etc. If you do define your own FP set you can simulate it
farely easly right on your HPC app and see if it gives comparable
results. For instance 1,2 or4 multipliers running a 37b mantissa might
be enough to not use double IEEE, only you can figure that out.

I think I even go for a custom cpu design with a highly serial by 18.18
datapath and try to pump it as fast as the fabric will allow. I notice
that the soft core FPUs out there don't run anywhere near the 300MHz
speeds being quoted for mul units. Perhaps the V4 500MHz DSP block can
be microcoded into a decent FPU unit but as soon as you need the odd
features,

Anyway I think thats what I  would do, if that doesn't work too well
then I look at qinetix and other vendors, these links can be found on
the X,A sites.

So what is your app and what hardware are you running on?

"Austin Lesea" <austin@xilinx.com> wrote
> > Marc, > > IEEE floating point standard? You need to be more specific.
IEEE 754. It's for a computational accelerator. It will get values from a general purpose processor (Xeon, Itanium, etc.) and send the results back in the same format. Though the internal computations could be done in another format. The other stuff needed is pretty standard (PCI(-X or Express), DDR2, etc. )
> Does it need to integrate with a processor?
No.
> I believe the Xilinx IBM 405 Power PC using the APU interface in Virtex > 4 with the floating point IP core provides the best and fastest
performance.
> > Especially since no other FPGA vendor has a hardened processor to > compete with us.
OK, that one is easy. ;-)
> If all you want is the floating point processing, without a > microprocessor, then I think you will find similar performance between > Xilinx and our competition, with us (of course) claiming the superior > performance edge.
The idea is to hardwire some formula by doing the maximum of concurrent FLOP. This is the only way to go faster than a very fast processor like an Itanuim II or even a simple Xeon.
> It would not surprise me at all to see them also post claiming they are > superior. > > For a specific floating point core, with a given precision, for given > features, it would be pretty easy to bench mark, so there is very little > wiggle room here for marketing nonsense. > > I would be interested to hear from others (not competitors) about what > floating point cores they use, and how well they perform (as you > obviously are interested).
Sure! And this time it should be easy to get useful technical numbers. Marc
JJ,

Perhaps you should read:

http://www.xilinx.com/bvdocs/ipcenter/data_sheet/floating_point.pdf

first?

At 429 MHz for a Virtex 4 for a square root, that is 56 clocks, or 130.5 
ns for the answer.  7.663 million floating point sqrure roots per second.

And, if you need more, you can implement more than one core, and get 
more than one answer per 56 clocks....

I am not aware of any x86 that can run quite that fast (even for one 
core).  Their claims are that the floating point hardware unit speeds up 
the software exection by at least a factor of 5.  We are talking here 
about a speedup of 80 to 100 times over using fixed point integer 
software to emulate a floating point square root....not a factor of 5!

Austin

JJ wrote:

> While an x86, or cell cluster could whip FPGA at IEEE FPU in raw clock > speed ( I am not sure about cost though), you can flip the odds some by > defining your own numerics with a direct mapping to the plentifull > 18bit muls. > > If I am not mistaken IEEE is not the be all and end all of FPU and has > a certain no of detractors esp in some fields regarding rounding, > exceptions etc. If you do define your own FP set you can simulate it > farely easly right on your HPC app and see if it gives comparable > results. For instance 1,2 or4 multipliers running a 37b mantissa might > be enough to not use double IEEE, only you can figure that out. > > I think I even go for a custom cpu design with a highly serial by 18.18 > datapath and try to pump it as fast as the fabric will allow. I notice > that the soft core FPUs out there don't run anywhere near the 300MHz > speeds being quoted for mul units. Perhaps the V4 500MHz DSP block can > be microcoded into a decent FPU unit but as soon as you need the odd > features, > > Anyway I think thats what I would do, if that doesn't work too well > then I look at qinetix and other vendors, these links can be found on > the X,A sites. > > So what is your app and what hardware are you running on? >
"c d saunter" <christopher.saunter@durham.ac.uk> wrote :
> Marc Battyani (Marc.Battyani@fractalconcept.com) wrote: > : Hello, > > : Does anybody already made a comparison of the high performance FPGA
(Stratix
> : II, V4, ?) relative to double precision floating point performance (add, > : mult, div, etc.) ? > > : It's for an HPC aplication. > > Hi Marc, > I don't have a comparisom of various cores but a lot of info is out > there in datasheets. > > However, in an HPC application the performance of your maths cores may not > be the bottleneck, rather it is likely to be a question of how fast can > you interface the host system to the FPGA, how fast can you shunt data > around between CPU, CPU RAM, FPGA and FPGA RAM etc.
Yes, memory bandwidth is one of the bottlenecks, especially for the general purpose processors.
> The heavyweight HPC/FPGA hybrid systems I have seen, such as the Cray-XD1 > and SGI NUMAflex/Altix stuff use Xilinx FPGAs.
Very interesting. In fact this is what we want to do (on a smaller scale probably ;-) I find it somewhat depressing to see that Cray can't come up with something much better than a bunch of FPGAs but at the same time it's very cool to have access to the same technology than Cray. Or even better as they seem to use Virtex II :)
> Although I wouldn't want to generalise for the whole field, other > interested parties such as Nallatech and Starbridge Systems tend to go > for Xilinx.
OK.
> Certianly Xilinx seem to have a head start in the field (not thanks to > their tools from the word on the street :-) - possibly this has more to do > with interfacing than FP core performance. > > Not answering the origional question, but there you go :-)
Well in fact I'm also interested by all the HPC/FPGA question anyway.
> Cheers, > Chris > > (A strong believer in FPGA type stuff for HPC, although perhaps the > granularity is less than optional and the tools not very well suited, but > hey it's early days.)
Sure, much fun anyway. Marc
"JJ" <johnjakson@yahoo.com> wrote in message
news:1124484934.397020.194050@g43g2000cwa.googlegroups.com...
> While an x86, or cell cluster could whip FPGA at IEEE FPU in raw clock > speed ( I am not sure about cost though), you can flip the odds some by > defining your own numerics with a direct mapping to the plentifull > 18bit muls.
Using a grid is fine when the problem can be parallelized with a rather coarse granularity but it's not always the case.
> If I am not mistaken IEEE is not the be all and end all of FPU and has > a certain no of detractors esp in some fields regarding rounding, > exceptions etc. If you do define your own FP set you can simulate it > farely easly right on your HPC app and see if it gives comparable > results. For instance 1,2 or4 multipliers running a 37b mantissa might > be enough to not use double IEEE, only you can figure that out.
Yes, I though about using a 36 bit mantissa to reduce the number of hard multiplier needed and the latency. The input/ouputs need to be in IEEE754 though.
> I think I even go for a custom cpu design with a highly serial by 18.18 > datapath and try to pump it as fast as the fabric will allow. I notice > that the soft core FPUs out there don't run anywhere near the 300MHz > speeds being quoted for mul units. Perhaps the V4 500MHz DSP block can > be microcoded into a decent FPU unit but as soon as you need the odd > features, > > Anyway I think thats what I would do, if that doesn't work too well > then I look at qinetix and other vendors, these links can be found on > the X,A sites. > > So what is your app and what hardware are you running on?
The apps can be rather diverse. In fact as Chirstopher pointed out, it looks like we are doing some kind of small Cray-XD1 ;-) As for the hardware, we are designing it. Marc
Hi Austin

Very interesting, but V4,S3E is still pretty darn new, I don't check on
it every 5mins but QinetiQ is definitely hot in this area (not
surprising given their (sq) roots at RSRE).

At some point I will do a detailed study of FPGA FPU design v x86 FPU
numbers for my transputer project.

>From the OPs website I can't guess what iron he'd use but the
application seems a bit clearer now. Usually when I see HPC-FPGA, I might infer somebody working with Opteron+VirtexII Pro sytems like Cray, SGI kits but doesn't look like it here. Regards JJ
JJ,

Something I just couldn't find anywhere was the actual performance of 
the x86 co-processor for something like a floating point square root.

We have clock cycles for each IEEE floating point operator, and the 
speed of the synthesized palced and routed core for various families, 
from Spartan 3 to Virtex4 in that pdf file.

I suppose uP software people don't really care about performance in 
terms of cycles or ns or mops....its all about what game screen graphics 
are displayed in the coolest fashion....

Does anyone have a link to such a site that has 'real' data of floating 
point op performance?

Austin



JJ wrote:

> Hi Austin > > Very interesting, but V4,S3E is still pretty darn new, I don't check on > it every 5mins but QinetiQ is definitely hot in this area (not > surprising given their (sq) roots at RSRE). > > At some point I will do a detailed study of FPGA FPU design v x86 FPU > numbers for my transputer project. > >>From the OPs website I can't guess what iron he'd use but the > application seems a bit clearer now. > > Usually when I see HPC-FPGA, I might infer somebody working with > Opteron+VirtexII Pro sytems like Cray, SGI kits but doesn't look like > it here. > > Regards > > JJ >