http://www.dailytech.com/article.aspx?newsid=1920 So... I do see a possibility here.
DRC has announced its newest FPGA that drops into AMD's Socket 940
Started by ●April 28, 2006
Reply by ●April 28, 20062006-04-28
Jan Panteltje wrote:> http://www.dailytech.com/article.aspx?newsid=1920 > > So... I do see a possibility here.Definitely cool. But only where an FPGA is truly handy. E.g. grid work. I think servers [e.g. SSL work] is best served with two processors than one and the FPGA. 8x200Mhz only provides 400MB/sec traffic to the CPU so really this is useful for tasks which either totally reside on the FPGA side of the board or have really high latency (e.g. PK work). The FPGA would have to beat ~3500 RSA-1024/sec before it would be worth more than an Opteron 275 (even more for the 285s) in the same socket. I'd see use for this in animation work though where an FPGA can raytrace a scene much faster than a CPU can and the work is high latency. Still cool though. Good to see people using the 940 socket for more than just Opterons :-) Tom
Reply by ●April 28, 20062006-04-28
tomstdenis@gmail.com writes:> The FPGA would have to beat ~3500 RSA-1024/sec before it would be worth > more than an Opteron 275 (even more for the 285s) in the same socket.What about the number of AES/sec?
Reply by ●April 28, 20062006-04-28
tomstdenis@gmail.com wrote: : Jan Panteltje wrote: : > http://www.dailytech.com/article.aspx?newsid=1920 <snip> : 8x200Mhz only provides 400MB/sec traffic to the CPU so really this is : useful for tasks which either totally reside on the FPGA side of the : board or have really high latency (e.g. PK work). Sitting on the HT bus like that offers residence about as close as you can get to a mainstream CPU. Given the new HT3 stuff - faster and links possible over 1 meter - i.e. directly joining blades - I really like this aproach. Especially given the memory architecture that goes along with HT/Opterons. It's bringing mainstream CPUs and FPGAs back into the point to point multiple interconnect world of the TigerSHARCs and the old TI C40s. It feels a bit like a resurgence to the old British Transputer except with gate arrays mixing with CPUs on an equal footing in terms of connectivity. cds
Reply by ●April 28, 20062006-04-28
Paul Rubin wrote:> tomstdenis@gmail.com writes: > > The FPGA would have to beat ~3500 RSA-1024/sec before it would be worth > > more than an Opteron 275 (even more for the 285s) in the same socket. > > What about the number of AES/sec?If it were triggered independently it'd be worth it. A 2.6Ghz processor [less than half the cost of this FPGA] can do upto 10,156,250 AES-128-ECB/sec with plain C code. That's roughly 160MiB/sec of throughput. Now, if you could have this thing trigger automatically. For instance, have an APIC that responds to interrupts from a network controller that would be a boost. The typical AES core takes ~14 cycles to encrypt but in FPGAs normally run at most at a couple hundred MHz at most [usually topping out between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256 cycles that the Opteron takes but only marginally so. For the cost of it you'd be better served by dropping another Opteron in. A single 285 core could top out at 20.3M AES/sec which way more than the typical FPGA can hope to achieve. Where this would fly I think is on PDU work as I described tying directly to the network controller. You really need higher latency work. It should also be trivial to get ECC [especially binary field] PK much faster and lower latency on an FPGA than the typical Opteron. Tom
Reply by ●April 28, 20062006-04-28
c d saunter wrote:> tomstdenis@gmail.com wrote: > > : Jan Panteltje wrote: > : > http://www.dailytech.com/article.aspx?newsid=1920 > > <snip> > > : 8x200Mhz only provides 400MB/sec traffic to the CPU so really this is > : useful for tasks which either totally reside on the FPGA side of the > : board or have really high latency (e.g. PK work). > > Sitting on the HT bus like that offers residence about as close as you can > get to a mainstream CPU. Given the new HT3 stuff - faster and links > possible over 1 meter - i.e. directly joining blades - I really like this > aproach. Especially given the memory architecture that goes along with > HT/Opterons. It's bringing mainstream CPUs and FPGAs back into the point > to point multiple interconnect world of the TigerSHARCs and the old TI > C40s.HT links are not solely designed for speed. Latency is the key. 16 lanes of PCIe can compete just fine with a 16x16 1Ghz HT link in terms of bandwidth. Oddly enough the best tasks for this are things which don't return back to back [e.g. raytrace a scene]. What this does open the door for though is for mixed architecture systems. E.g. synthesize a MIPS core in the FPGA and map the DDR controller on to it. Then you have x86 and MIPS in the same system. That'd be cool. Tom
Reply by ●April 28, 20062006-04-28
tomstdenis@gmail.com wrote: : HT links are not solely designed for speed. Latency is the key. 16 : lanes of PCIe can compete just fine with a 16x16 1Ghz HT link in terms : of bandwidth. : Oddly enough the best tasks for this are things which don't return back : to back [e.g. raytrace a scene]. I wouldn't call that odd - a modern CPU hiding behind caches with long pipelines is always going to struggle with low latency back/forewards/back/forewards shared tasks with an FPGA/Clearspeed/xxx - certainly interesting things happen with FPGA silicon and CPU silicon coupled in a SOC or on an FPGA but the clock rates are far below a dedicated CPU. On the serial / parallel issue I have a leaning towards parallel for simplicity when it comes to the FPGA code and latency, although serial has benefits for physical complexity and routing. Also it feels like they leap frog each other every few months in terms of bandwidth! The world is squeezing itself down a thin pipe these days though... : What this does open the door for though is for mixed architecture : systems. E.g. synthesize a MIPS core in the FPGA and map the DDR : controller on to it. : Then you have x86 and MIPS in the same system. : That'd be cool. An awfull lot of cool things are on their way...
Reply by ●April 28, 20062006-04-28
tomstdenis@gmail.com writes:> The typical AES core takes ~14 cycles to encrypt but in FPGAs normally > run at most at a couple hundred MHz at most [usually topping out > between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz > which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256 > cycles that the Opteron takes but only marginally so.I'd think if you're going to use such an expensive and exotic approach at all, you'd pipeline it to get one AES operation per cycle, maybe even more than one if you're doing something like EAX mode, or CTR mode ona large block in parallel.
Reply by ●April 28, 20062006-04-28
Paul Rubin wrote:> tomstdenis@gmail.com writes: > > The typical AES core takes ~14 cycles to encrypt but in FPGAs normally > > run at most at a couple hundred MHz at most [usually topping out > > between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz > > which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256 > > cycles that the Opteron takes but only marginally so. > > I'd think if you're going to use such an expensive and exotic approach > at all, you'd pipeline it to get one AES operation per cycle, maybe > even more than one if you're doing something like EAX mode, or CTR > mode ona large block in parallel.Even with pipelining you're still on a fairly limited bus. At best you top out at whatever the bus between the two actually is. Keep in mind this is an FPGA and not ASIC. So chances are it won't clock that high anyways. My 200Mhz quote is just a really really optimistic quote.>From what I recall from my past job you'd be lucky to get somethingcomplicated like a PDU clocking higher than PCI freq [33Mhz]. So while you could get an AES core ~100Mhz it would only be doing CTR mode at most. Block ciphers are not where this will shine. Specially when the other processor is an Opteron. The trick to making good use of something like an FPGA isn't serial speed. Even if you designed a custom RISC ALU on the FPGA it'd clock probably around 50Mhz. Even with the best ISA you can craft for it the Opteron could EMULATE the thing faster than you could run it. Where the FPGA will shine is for tasks with a LOT of parallel computation. Think like 16 FPU pipelines or a single cycle GF(2) multiplier, etc, etc, etc. Other tasks where this would shine would be custom DSP filters, e.g. offload MPEG work. A FIR or IIR filter of significant delay [e.g. accuracy] could be constructed in a pipeline to get 1 sample/cycle at decent clock rates. Tom
Reply by ●April 28, 20062006-04-28
tomstdenis, http://www.xilinx.com/bvdocs/ipcenter/data_sheet/Helion_Standard_AES_AllianceCORE_data_sheet.pdf Shows some of the claimed clock rates for their AES encrypt/decrypt IP core. 257 MHz (V2 Pro) to 252 MHz (V4). Throughput in b/s is ~ 2 to 3 X the clock rate (per this datasheet). Other cores run just shy of 200 MHz. Other data from this same vendor makes claims of up to 20 Gbs for throughput of their 'fast' FPGA based AES encryptors and decryptors. At one time we made a 10 Gbs decryptor to prove that distributing full resolution theater real time movies could be done with one FPGA in the 'projector.' This prevents piracy by decrypting the movie at the projector itself (at no time is the full digital information available for copying). This was back in the Virtex II days, so the 20 Gbs claim is perfectly reasonable for V4 today (IMO). There are a number of other IP vendors with encryptors and decryptors for our FPGAs. http://xgoogle.xilinx.com/search?output=xml_no_dtd&ie=UTF-8&oe=UTF-8&client=iplocator&proxystylesheet=iplocator&site=IPLocator&filter=0&_ResultsView=Standard&num=25&q=aes&as_q=&getfields=*&newSearch=http://www.xilinx.com/xlnx/xebiz/search/ipsrch.jsp&formAction=http://www.xilinx.com/cgi-bin/search/iplocator.pl&IPCategory=&IPSubcategory=&sGlobalNavPick=&sSecondaryNavPick=&requiredfields=IPProducts&partialfields= or http://tinyurl.com/hajhj Austin tomstdenis@gmail.com wrote:> Paul Rubin wrote: > >>tomstdenis@gmail.com writes: >> >>>The typical AES core takes ~14 cycles to encrypt but in FPGAs normally >>>run at most at a couple hundred MHz at most [usually topping out >>>between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz >>>which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256 >>>cycles that the Opteron takes but only marginally so. >> >>I'd think if you're going to use such an expensive and exotic approach >>at all, you'd pipeline it to get one AES operation per cycle, maybe >>even more than one if you're doing something like EAX mode, or CTR >>mode ona large block in parallel. > > > Even with pipelining you're still on a fairly limited bus. At best you > top out at whatever the bus between the two actually is. Keep in mind > this is an FPGA and not ASIC. So chances are it won't clock that high > anyways. My 200Mhz quote is just a really really optimistic quote. >>From what I recall from my past job you'd be lucky to get something > complicated like a PDU clocking higher than PCI freq [33Mhz]. So while > you could get an AES core ~100Mhz it would only be doing CTR mode at > most. > > Block ciphers are not where this will shine. Specially when the other > processor is an Opteron. > > The trick to making good use of something like an FPGA isn't serial > speed. Even if you designed a custom RISC ALU on the FPGA it'd clock > probably around 50Mhz. Even with the best ISA you can craft for it the > Opteron could EMULATE the thing faster than you could run it. Where > the FPGA will shine is for tasks with a LOT of parallel computation. > Think like 16 FPU pipelines or a single cycle GF(2) multiplier, etc, > etc, etc. > > Other tasks where this would shine would be custom DSP filters, e.g. > offload MPEG work. A FIR or IIR filter of significant delay [e.g. > accuracy] could be constructed in a pipeline to get 1 sample/cycle at > decent clock rates. > > Tom >





