FPGARelated.com
Forums

DRC has announced its newest FPGA that drops into AMD's Socket 940

Started by Jan Panteltje April 28, 2006
http://www.dailytech.com/article.aspx?newsid=1920

So... I do see a possibility here.
Jan Panteltje wrote:
> http://www.dailytech.com/article.aspx?newsid=1920 > > So... I do see a possibility here.
Definitely cool. But only where an FPGA is truly handy. E.g. grid work. I think servers [e.g. SSL work] is best served with two processors than one and the FPGA. 8x200Mhz only provides 400MB/sec traffic to the CPU so really this is useful for tasks which either totally reside on the FPGA side of the board or have really high latency (e.g. PK work). The FPGA would have to beat ~3500 RSA-1024/sec before it would be worth more than an Opteron 275 (even more for the 285s) in the same socket. I'd see use for this in animation work though where an FPGA can raytrace a scene much faster than a CPU can and the work is high latency. Still cool though. Good to see people using the 940 socket for more than just Opterons :-) Tom
tomstdenis@gmail.com writes:
> The FPGA would have to beat ~3500 RSA-1024/sec before it would be worth > more than an Opteron 275 (even more for the 285s) in the same socket.
What about the number of AES/sec?
tomstdenis@gmail.com wrote:

: Jan Panteltje wrote:
: > http://www.dailytech.com/article.aspx?newsid=1920

<snip>

: 8x200Mhz only provides 400MB/sec traffic to the CPU so really this is
: useful for tasks which either totally reside on the FPGA side of the
: board or have really high latency (e.g. PK work).

Sitting on the HT bus like that offers residence about as close as you can 
get to a mainstream CPU.  Given the new HT3 stuff - faster and links 
possible over 1 meter - i.e. directly joining blades - I really like this 
aproach.  Especially given the memory architecture that goes along with 
HT/Opterons. It's bringing mainstream CPUs and FPGAs back into the point 
to point multiple interconnect world of the TigerSHARCs and the old TI 
C40s.  

It feels a bit like a resurgence to the old British Transputer except with 
gate arrays mixing with CPUs on an equal footing in terms of connectivity.

cds
Paul Rubin wrote:
> tomstdenis@gmail.com writes: > > The FPGA would have to beat ~3500 RSA-1024/sec before it would be worth > > more than an Opteron 275 (even more for the 285s) in the same socket. > > What about the number of AES/sec?
If it were triggered independently it'd be worth it. A 2.6Ghz processor [less than half the cost of this FPGA] can do upto 10,156,250 AES-128-ECB/sec with plain C code. That's roughly 160MiB/sec of throughput. Now, if you could have this thing trigger automatically. For instance, have an APIC that responds to interrupts from a network controller that would be a boost. The typical AES core takes ~14 cycles to encrypt but in FPGAs normally run at most at a couple hundred MHz at most [usually topping out between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256 cycles that the Opteron takes but only marginally so. For the cost of it you'd be better served by dropping another Opteron in. A single 285 core could top out at 20.3M AES/sec which way more than the typical FPGA can hope to achieve. Where this would fly I think is on PDU work as I described tying directly to the network controller. You really need higher latency work. It should also be trivial to get ECC [especially binary field] PK much faster and lower latency on an FPGA than the typical Opteron. Tom
c d saunter wrote:
> tomstdenis@gmail.com wrote: > > : Jan Panteltje wrote: > : > http://www.dailytech.com/article.aspx?newsid=1920 > > <snip> > > : 8x200Mhz only provides 400MB/sec traffic to the CPU so really this is > : useful for tasks which either totally reside on the FPGA side of the > : board or have really high latency (e.g. PK work). > > Sitting on the HT bus like that offers residence about as close as you can > get to a mainstream CPU. Given the new HT3 stuff - faster and links > possible over 1 meter - i.e. directly joining blades - I really like this > aproach. Especially given the memory architecture that goes along with > HT/Opterons. It's bringing mainstream CPUs and FPGAs back into the point > to point multiple interconnect world of the TigerSHARCs and the old TI > C40s.
HT links are not solely designed for speed. Latency is the key. 16 lanes of PCIe can compete just fine with a 16x16 1Ghz HT link in terms of bandwidth. Oddly enough the best tasks for this are things which don't return back to back [e.g. raytrace a scene]. What this does open the door for though is for mixed architecture systems. E.g. synthesize a MIPS core in the FPGA and map the DDR controller on to it. Then you have x86 and MIPS in the same system. That'd be cool. Tom
tomstdenis@gmail.com wrote:

: HT links are not solely designed for speed.  Latency is the key.  16
: lanes of PCIe can compete just fine with a 16x16 1Ghz HT link in terms
: of bandwidth.

: Oddly enough the best tasks for this are things which don't return back
: to back [e.g. raytrace a scene].

I wouldn't call that odd - a modern CPU hiding behind caches with long 
pipelines is always going to struggle with low latency 
back/forewards/back/forewards shared tasks with an FPGA/Clearspeed/xxx
- certainly interesting things happen with FPGA silicon and CPU 
silicon coupled in a SOC or on an FPGA but the clock rates are far below a 
dedicated CPU.  

On the serial / parallel issue I have a leaning towards parallel for 
simplicity when it comes to the FPGA code and latency, although serial has 
benefits for physical complexity and routing.  Also it feels like they 
leap frog each other every few months in terms of bandwidth!  The world is 
squeezing itself down a thin pipe these days though...

: What this does open the door for though is for mixed architecture
: systems.  E.g. synthesize a MIPS core in the FPGA and map the DDR
: controller on to it.

: Then you have x86 and MIPS in the same system.  

: That'd be cool.

An awfull lot of cool things are on their way...
tomstdenis@gmail.com writes:
> The typical AES core takes ~14 cycles to encrypt but in FPGAs normally > run at most at a couple hundred MHz at most [usually topping out > between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz > which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256 > cycles that the Opteron takes but only marginally so.
I'd think if you're going to use such an expensive and exotic approach at all, you'd pipeline it to get one AES operation per cycle, maybe even more than one if you're doing something like EAX mode, or CTR mode ona large block in parallel.
Paul Rubin wrote:
> tomstdenis@gmail.com writes: > > The typical AES core takes ~14 cycles to encrypt but in FPGAs normally > > run at most at a couple hundred MHz at most [usually topping out > > between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz > > which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256 > > cycles that the Opteron takes but only marginally so. > > I'd think if you're going to use such an expensive and exotic approach > at all, you'd pipeline it to get one AES operation per cycle, maybe > even more than one if you're doing something like EAX mode, or CTR > mode ona large block in parallel.
Even with pipelining you're still on a fairly limited bus. At best you top out at whatever the bus between the two actually is. Keep in mind this is an FPGA and not ASIC. So chances are it won't clock that high anyways. My 200Mhz quote is just a really really optimistic quote.
>From what I recall from my past job you'd be lucky to get something
complicated like a PDU clocking higher than PCI freq [33Mhz]. So while you could get an AES core ~100Mhz it would only be doing CTR mode at most. Block ciphers are not where this will shine. Specially when the other processor is an Opteron. The trick to making good use of something like an FPGA isn't serial speed. Even if you designed a custom RISC ALU on the FPGA it'd clock probably around 50Mhz. Even with the best ISA you can craft for it the Opteron could EMULATE the thing faster than you could run it. Where the FPGA will shine is for tasks with a LOT of parallel computation. Think like 16 FPU pipelines or a single cycle GF(2) multiplier, etc, etc, etc. Other tasks where this would shine would be custom DSP filters, e.g. offload MPEG work. A FIR or IIR filter of significant delay [e.g. accuracy] could be constructed in a pipeline to get 1 sample/cycle at decent clock rates. Tom
tomstdenis,

http://www.xilinx.com/bvdocs/ipcenter/data_sheet/Helion_Standard_AES_AllianceCORE_data_sheet.pdf

Shows some of the claimed clock rates for their AES encrypt/decrypt IP 
core.  257 MHz (V2 Pro) to 252 MHz (V4).  Throughput in b/s is ~ 2 to 3 
X the clock rate (per this datasheet).  Other cores run just shy of 200 MHz.

Other data from this same vendor makes claims of up to 20 Gbs for 
throughput of their 'fast' FPGA based AES encryptors and decryptors.

At one time we made a 10 Gbs decryptor to prove that distributing full 
resolution theater real time movies could be done with one FPGA in the 
'projector.'  This prevents piracy by decrypting the movie at the 
projector itself (at no time is the full digital information available 
for copying).

This was back in the Virtex II days, so the 20 Gbs claim is perfectly 
reasonable for V4 today (IMO).

There are a number of other IP vendors with encryptors and decryptors 
for our FPGAs.

http://xgoogle.xilinx.com/search?output=xml_no_dtd&ie=UTF-8&oe=UTF-8&client=iplocator&proxystylesheet=iplocator&site=IPLocator&filter=0&_ResultsView=Standard&num=25&q=aes&as_q=&getfields=*&newSearch=http://www.xilinx.com/xlnx/xebiz/search/ipsrch.jsp&formAction=http://www.xilinx.com/cgi-bin/search/iplocator.pl&IPCategory=&IPSubcategory=&sGlobalNavPick=&sSecondaryNavPick=&requiredfields=IPProducts&partialfields=
or
http://tinyurl.com/hajhj

Austin

tomstdenis@gmail.com wrote:

> Paul Rubin wrote: > >>tomstdenis@gmail.com writes: >> >>>The typical AES core takes ~14 cycles to encrypt but in FPGAs normally >>>run at most at a couple hundred MHz at most [usually topping out >>>between 100 and 200Mhz at most]. 200Mhz is 13 times less than 2.6Ghz >>>which is equivalent to 182 cycles at 2.6Ghz. This is less than the 256 >>>cycles that the Opteron takes but only marginally so. >> >>I'd think if you're going to use such an expensive and exotic approach >>at all, you'd pipeline it to get one AES operation per cycle, maybe >>even more than one if you're doing something like EAX mode, or CTR >>mode ona large block in parallel. > > > Even with pipelining you're still on a fairly limited bus. At best you > top out at whatever the bus between the two actually is. Keep in mind > this is an FPGA and not ASIC. So chances are it won't clock that high > anyways. My 200Mhz quote is just a really really optimistic quote. >>From what I recall from my past job you'd be lucky to get something > complicated like a PDU clocking higher than PCI freq [33Mhz]. So while > you could get an AES core ~100Mhz it would only be doing CTR mode at > most. > > Block ciphers are not where this will shine. Specially when the other > processor is an Opteron. > > The trick to making good use of something like an FPGA isn't serial > speed. Even if you designed a custom RISC ALU on the FPGA it'd clock > probably around 50Mhz. Even with the best ISA you can craft for it the > Opteron could EMULATE the thing faster than you could run it. Where > the FPGA will shine is for tasks with a LOT of parallel computation. > Think like 16 FPU pipelines or a single cycle GF(2) multiplier, etc, > etc, etc. > > Other tasks where this would shine would be custom DSP filters, e.g. > offload MPEG work. A FIR or IIR filter of significant delay [e.g. > accuracy] could be constructed in a pipeline to get 1 sample/cycle at > decent clock rates. > > Tom >