Hello, I need the highest possible number of multiplication operations per second at low cost. I know that several factors affect the overall performance, but since I have no idea which FPGA chips might be worth to be considered, I'd like to ask what you think is the chip with the lowest ratio R=(prize of chip)* (delay time)/(number of multipliers) 18x18bit multipliers seem to be quite common, so lets assume this design for the estimate. For example for the Spartan XC3S1000 (~60$, 24 multipliers, 4ns delay) I have R= 10$ per (Billion multiplications/s). The Cyclone EP2C70 (~230$, 150 multipliers, 4ns delay) has R=6.13$ per (Billion multiplications/s). Do other FPGAs exist that are maybe specialized for multiplication- intensive tasks and which therefore are much cheaper? Best regards, Ryan
high number of multipliers / low cost
Started by ●April 4, 2007
Reply by ●April 4, 20072007-04-04
Look at the CycloneIII EP3C family that was just released. <ryan_usenet@yahoo.com> wrote in message news:1175686533.726030.62610@d57g2000hsg.googlegroups.com...> Hello, > I need the highest possible number of multiplication operations per > second at low cost. I know that several factors affect the overall > performance, but since I have no idea which FPGA chips might be worth > to be considered, I'd like to ask what you think is the chip with the > lowest ratio > > R=(prize of chip)* (delay time)/(number of multipliers) > > 18x18bit multipliers seem to be quite common, so lets assume this > design for the estimate. > > For example for the Spartan XC3S1000 (~60$, 24 multipliers, 4ns delay) > I have > R= 10$ per (Billion multiplications/s). The Cyclone EP2C70 (~230$, 150 > multipliers, 4ns delay) has R=6.13$ per (Billion multiplications/s). > > Do other FPGAs exist that are maybe specialized for multiplication- > intensive tasks and which therefore are much cheaper? > > Best regards, > Ryan >
Reply by ●April 4, 20072007-04-04
<ryan_usenet@yahoo.com> wrote in message news:1175686533.726030.62610@d57g2000hsg.googlegroups.com...> Hello, > I need the highest possible number of multiplication operations per > second at low cost.Wow, FPGA marketing departments are going to be swarming all over you like it's Christmas! Wait, are you sure you're real...? ;-) -Ben- <removes tongue from cheek>
Reply by ●April 4, 20072007-04-04
ryan_usenet@yahoo.com wrote:> Hello, > I need the highest possible number of multiplication operations per > second at low cost. I know that several factors affect the overall > performance, but since I have no idea which FPGA chips might be worth > to be considered, I'd like to ask what you think is the chip with the > lowest ratio > > R=(prize of chip)* (delay time)/(number of multipliers) > > 18x18bit multipliers seem to be quite common, so lets assume this > design for the estimate. > > For example for the Spartan XC3S1000 (~60$, 24 multipliers, 4ns delay) > I have > R= 10$ per (Billion multiplications/s). The Cyclone EP2C70 (~230$, 150 > multipliers, 4ns delay) has R=6.13$ per (Billion multiplications/s). > > Do other FPGAs exist that are maybe specialized for multiplication- > intensive tasks and which therefore are much cheaper? >Look at the Xilinx virtex 5 sxt series The V5 SX95T for example has 640 multipliers running at 450 MHz. I don't know it's cost, but even if it's 750$ (and it's probably less), that would make R = 2.6 $ per (Billion multiplications/s) You can also look at the spartan 3A DSP series. The 3SD3400A costs around 60$ and can do 30 billion multiplication per seconds so that would make R = 2 Sylvain
Reply by ●April 4, 20072007-04-04
Thank you for your replies so far. It is no problem if the chip is ~1000$, because the aim is some 10^12 multiplications/s. The simplistic estimate needs 3 chips V5 SX95T or 27 chips EP2C70. For easier design a lower number of chips is preferred. Is there some more comprehensive overview of FPGA prizes, especially for the larger devices, than what I find at www.digikey.com? Although your suggestions have already been useful, I'd appreciate more hints maybe concerning exotic manufacturers that I have never heard of. Or are Xilinx and Altera definitely the only choices? Best regards, Ryan (>Wow, FPGA marketing departments are going to be swarming all over you like >it's Christmas! > >Wait, are you sure you're real...? ;-) > > -Ben-Why do you doubt this? Because the question sounds stupid? I don't pretend that I am close to production of the planned FPGA board, but I need an idea of what will be possible at what costs.)
Reply by ●April 4, 20072007-04-04
<ryan_usenet@yahoo.com> wrote in message news:1175696938.323383.273760@o5g2000hsb.googlegroups.com...> > Although your suggestions have already been useful, I'd appreciate > more hints maybe concerning exotic manufacturers that I have never > heard of. Or are Xilinx and Altera definitely the only choices? > > Best regards, > Ryan >Hi Ryan, Don't forget FPGAs have other resources apart from 'hard' multipliers. Check this page on Mr. Andraka'a website about distributed arithmetic. You can make a _lot_ of multipliers out of the ordinary fabric of FPGAs. HTH, Syms. http://www.andraka.com/distribu.htm
Reply by ●April 4, 20072007-04-04
Ryan, Your metric is extremely simple. Perhaps too simple? What is it that you wish to do? 1E12 multiplies per second is a bit too simplistic. You need to consider the 'care and feeding' of this 'monster'. What is the resolution required? 18X18? or 9X9? Or really 25 X 18? Is it all multiplies, and no accumulates? Hard to imagine a problem where no addition whatsoever is required. Multipliers alone may not suffice. What about accumulators? What number of bits? The DSP48 blocks in the Xilinx architectures are intended for most common needs. The DSP48 in V5 also has the traditional 4 bit control (16 function) ALU as part of the DSP block. Very wide AND, OR, XOR, etc. are all provided. You should also consider how much SRAM is on chip. If you have insufficient RAM, you can not "feed your monster." Similarly, IO is required to feed the RAM, and get the results. You need to consider if the IO is a bottleneck (limits your performance). Austin
Reply by ●April 4, 20072007-04-04
On 4 Apr., 17:22, "Symon" <symon_bre...@hotmail.com> wrote:> Hi Ryan, > Don't forget FPGAs have other resources apart from 'hard' multipliers.Well part of the other logic is of course needed for accumulation, registers etc. If huge amounts of logic will be free, my plan is to use those as additional 'software' multipliers. However, I have no idea how efficient these multipliers built from logic blocks can be. As far as I know you have the choice of implementing either parallel or sequential 'software' multipliers. The parallel architecture has the advantage that it can calculate 1 product per clock, but requires lots of logic, so that not many of them could be used. The sequential architecture needs about 1/nth of the logic space, but finishes after about n clocks, so that I don't expect either of these solutions to significantly contribute to the dedicated 'hardware' multipliers performance. If the sequential multiplier could be used in 'pipeline' operation (meaning the result is delayed by n clocks, but then a new result is produce in each cycle), this would be interesting, but I guess, it cannot? Thanks Ryan
Reply by ●April 4, 20072007-04-04
<ryan_usenet@yahoo.com> wrote in message news:1175686533.726030.62610@d57g2000hsg.googlegroups.com...> Hello, > I need the highest possible number of multiplication operations per > second at low cost. I know that several factors affect the overall > performance, but since I have no idea which FPGA chips might be worth > to be considered, I'd like to ask what you think is the chip with the > lowest ratio > > R=(price of chip)* (delay time)/(number of multipliers) > > 18x18bit multipliers seem to be quite common, so lets assume this > design for the estimate. > > For example for the Spartan XC3S1000 (~60$, 24 multipliers, 4ns delay) > I have > R= 10$ per (Billion multiplications/s). The Cyclone EP2C70 (~230$, 150 > multipliers, 4ns delay) has R=6.13$ per (Billion multiplications/s). > > Do other FPGAs exist that are maybe specialized for multiplication- > intensive tasks and which therefore are much cheaper? > > Best regards, > RyanAnother chip to consider: the Lattice ECP2(M) series. The Lattice approach was to stuff the inexpensive chips with DSP resources since many massively parallel algorithms can't afford the super-performance chips. Their sysDSP blocks can be configured for 8x9-bit, 4x18-bit or 1x36-bit multipliers per block. A MAC structure is included for FIR style applications. For 18-bit multipliers, an 88 multiplier solution has a "marketing" price around $35 (ECP2-70). I think your 18-bit R value is about $1.60 but I have a little difficulty figuring out the multiplier speeds. The families with the most multipliers at reasonable costs - Altera Cyclone III, Xilinx Spartan-3DSP, Lattice ECP2 - are all rather new. The costs might be larger in the near term than marketing announcements would suggest. You really need to get conversations going with the sales reps from these three companies so that they can work the numbers toward your goal. Getting them to where they can understand your pipeline needs, they should be able to give you a true attainable frequency and a price or price range that would allow you to compare to your economy of scale. Something else that might interest you is the FPOA from Mathstar. Their "Field Programmable Object Array" has MACs, ALUs, and Register Files as the distributed elements capable of 1 GHz speeds according to their literature. Since this product is far outside my tecnical needs, I haven't delved too far into it but your application might be one of the few FPGA-style designs that can seriously leverage this not-so-mainstream technology.
Reply by ●April 4, 20072007-04-04
<ryan_usenet@yahoo.com> wrote in message news:1175701234.081630.27560@d57g2000hsg.googlegroups.com...> On 4 Apr., 17:22, "Symon" <symon_bre...@hotmail.com> wrote: > > As far as I know you have the choice of > implementing either parallel or sequential 'software' multipliers. The > parallel architecture has the advantage that it can calculate 1 > product per clock, but requires lots of logic, so that not many of > them could be used.You would have to define "lots". Also, you might calculate one product per clock with a fabric-based multiplier, but it will be a pretty long clock cycle.> If the sequential multiplier could be used in 'pipeline' operation > (meaning the result is delayed by n clocks, but then a new result is > produce in each cycle), this would be interesting, but I guess, it > cannot?No, but the "parallel" multiplier architecture could. Indeed, it must be, or your clock speed would be ridiculously slow. It would be interesting to know what you think "a multiplier" is - i.e. what word-length, and are you talking about fixed-point or floating-point operations? -Ben-




