FPGARelated.com
Forums

Looking for fast AES cores with low latency

Started by Allan Herriman September 18, 2007
Hi,

Since the initial rash of AES / Rijndael cores a few years ago, I
haven't seen much research at the high speed end.

Does anyone know how low the latency is for a recent high-end core in
a current FPGA family?
A quick web search reveals plenty of heavily pipelined implementations
with poor latency, but none that are really quick in terms of latency.

Thanks,
Allan
On Sep 18, 5:35 pm, Allan Herriman <allanherri...@hotmail.com> wrote:
> Hi, > > Since the initial rash of AES / Rijndael cores a few years ago, I > haven't seen much research at the high speed end. > > Does anyone know how low the latency is for a recent high-end core in > a current FPGA family? > A quick web search reveals plenty of heavily pipelined implementations > with poor latency, but none that are really quick in terms of latency. > > Thanks, > Allan
What kind of frequency / latency are you looking for ? Most core can pretty easily be "de-pipelinined" to diminish latency but degrade frequency ... Sylvain
On Sep 19, 12:34 am, "Sylvain Munaut <Some...@SomeDomain.com>"
<246...@gmail.com> wrote:
> On Sep 18, 5:35 pm, Allan Herriman <allanherri...@hotmail.com> wrote: > > > Hi, > > > Since the initial rash of AES / Rijndael cores a few years ago, I > > haven't seen much research at the high speed end. > > > Does anyone know how low the latency is for a recent high-end core in > > a current FPGA family? > > A quick web search reveals plenty of heavily pipelined implementations > > with poor latency, but none that are really quick in terms of latency. > > > Thanks, > > Allan > > What kind of frequency / latency are you looking for ? > > Most core can pretty easily be "de-pipelinined" to diminish > latency but degrade frequency ... > > Sylvain
I realized the AES algorithm several months ago and tried to find out the highest frequency. However, using the GF calculation, the cost of FPGA resource may be less.
Hi Allan,
the minimum latency of an AES-Core (at a reasonable clock frequency) is 
limited by the number of rounds (iterations) needed. That number depends 
mainly on the keylength.
  128 Bit Key : Round Number 10
  192 Bit Key : Round Number 12
  256 Bit Key : Round Number 14

There is an initial Round 0, but the latency of that can be eliminated 
by design. So the latency for a simple AES-128 Core will always be at 
least 10 clock cycles.
If you have enough chip area to unroll the rounds, only the initial 
latency (for the first conversion) needs that number of clock cycles. 
All following blocks are calculated on each following clock cycle 
because of the data pipelining in the unrolled architecture.

You may take a look at this paper:
http://www.i3m.hs-bremen.de/internet/download/elis/aes_i3m_overview.pdf

Please keep in mind that the clock frequencies given in this paper are 
examples only for the old Virtex-E FPGAs. Actual FPGAs perform much better.

Best regards
   Eilert


Allan Herriman schrieb:
> Hi, > > Since the initial rash of AES / Rijndael cores a few years ago, I > haven't seen much research at the high speed end. > > Does anyone know how low the latency is for a recent high-end core in > a current FPGA family? > A quick web search reveals plenty of heavily pipelined implementations > with poor latency, but none that are really quick in terms of latency. > > Thanks, > Allan
On Wed, 19 Sep 2007 01:35:11 +1000, Allan Herriman
<allanherriman@hotmail.com> wrote:

>Hi, > >Since the initial rash of AES / Rijndael cores a few years ago, I >haven't seen much research at the high speed end. > >Does anyone know how low the latency is for a recent high-end core in >a current FPGA family? >A quick web search reveals plenty of heavily pipelined implementations >with poor latency, but none that are really quick in terms of latency.
I did some tests today... I unrolled our (conventional) 14 round implementation into one big mess of combinatorial logic with FFs at either end and ran it through the tools: V5, using 8.2 software: Par spat the dummy after six hours, claiming it was too hard. I added a bunch of area constraints. It's still running. StratixII gave sixty-something ns (=14MHz clock) in the slowest speed grade, but that was without timing constraints. A version with a 30ns clock constraint is still running. 14MHz results in feedback modes giving about 1.8Gb/s encryption throughput. I guess that's enough for GbEthernet, but we already know GbE can be done with a conventional pipelined AES implementation. I'll post tomorrow on the results. Regards, Allan
On Sep 19, 9:59 pm, Allan Herriman <allanherri...@hotmail.com> wrote:
> On Wed, 19 Sep 2007 01:35:11 +1000, Allan Herriman > > <allanherri...@hotmail.com> wrote: > >Hi, > > >Since the initial rash of AES / Rijndael cores a few years ago, I > >haven't seen much research at the high speed end. > > >Does anyone know how low the latency is for a recent high-end core in > >a current FPGA family? > >A quick web search reveals plenty of heavily pipelined implementations > >with poor latency, but none that are really quick in terms of latency. > > I did some tests today... > > I unrolled our (conventional) 14 round implementation into one big > mess of combinatorial logic with FFs at either end and ran it through > the tools: > > V5, using 8.2 software: Par spat the dummy after six hours, claiming > it was too hard. > I added a bunch of area constraints. It's still running. > > StratixII gave sixty-something ns (=14MHz clock) in the slowest speed > grade, but that was without timing constraints. A version with a 30ns > clock constraint is still running. > > 14MHz results in feedback modes giving about 1.8Gb/s encryption > throughput. I guess that's enough for GbEthernet, but we already know > GbE can be done with a conventional pipelined AES implementation. > > I'll post tomorrow on the results. > > Regards, > Allan
Allan, you want to encrypt the data from the GbEthernet interface? Whether the GbEthernet interface is in the same the FPGA board? If not, even you find the maximum frequency for the AES algorithm, you should consider the delay of the OS.
Hi Allan.
You hit the point.
That is exactly why there are no such designs in real life.

You always can trade comb. delay vs. latency, but you have to look for 
the solution that suits your needs the best.

Now look at your example result with 14MHz. Theoretical data throughput 
is the same as in an iterative design running at 14*14Mhz, which I think 
is a clock frequency that can be achieved by modern FPGAs.
But: With the iterative design you save about 90% area and don't have to 
worry so much about moving the data from one clock domain to another. In 
the best possible case you can run the AES and all other circuits at the 
same (high) clock frequency.

The same thing is also valid for the S-Boxes in an AES design. Often 
made with Blockrams, out of convenience. But there are solutions 
published that use very small combinatorical circuits. These solutions 
have the disadvantage of large delays (20 to 30 ns) thus reducing the 
clock rate of the whole AES design. Now what do you do in such a case? 
Find out how to pipeline that solution. If you can increase the clock 
frequency into a range where it fits into the overall design, you can 
save all the valuable and rare BRAMs. It may cost you some clock cycles 
of additional latency, but it depends on the application if that is a 
problem or not.

So back to your original postings title:
Complex cores with low latency have high combinatorical delays.
The problems that arise from such solutions are in most cases larger 
than the benefits, if there are any at all.

Have a nice synthesis
   Eilert

Allan Herriman schrieb:
...snip...
> I did some tests today... > > I unrolled our (conventional) 14 round implementation into one big > mess of combinatorial logic with FFs at either end and ran it through > the tools: >
...snip...
> 14MHz results in feedback modes giving about 1.8Gb/s encryption > throughput. I guess that's enough for GbEthernet, but we already know > GbE can be done with a conventional pipelined AES implementation.
On Thu, 20 Sep 2007 08:12:03 +0200, backhus <nix@nirgends.xyz> wrote:

>Hi Allan. >You hit the point. >That is exactly why there are no such designs in real life.
Ah yes. I have come to the same conclusion. A few years ago, I designed what I believe was the first 10Gb/s AES256 encryptor on the market. It used CTR mode, because that was the only mode suitable to run at those rates in the FPGAs that were available then. I recall thinking that feedback modes (e.g. CFB) would be possible at 10Gb/s in FPGAs in a few years time. I'll try this test again when the next generation of FPGAs come out. For the crypto naive: The throughput of a block cypher with feedback is determined by the delay through the block cypher calculation. Pipelining is good for getting impressive clock numbers, but it actually hurts throughput. http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation Thanks to all who responded, Allan.
Allan Herriman wrote:

(snip)

> A few years ago, I designed what I believe was the first 10Gb/s AES256 > encryptor on the market. It used CTR mode, because that was the only > mode suitable to run at those rates in the FPGAs that were available > then.
(snip)
> For the crypto naive: The throughput of a block cypher with feedback > is determined by the delay through the block cypher calculation. > Pipelining is good for getting impressive clock numbers, but it > actually hurts throughput.
You should be able to process multiple data streams, though. (Similar to the multithreading processors popular a few years ago.) I would expect that anyone needed such high speed would have more than one document to encrypt or decrypt.
> http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation
-- glen
On Fri, 21 Sep 2007 09:52:39 -0800, glen herrmannsfeldt
<gah@ugcs.caltech.edu> wrote:

>Allan Herriman wrote: > >(snip) > >> A few years ago, I designed what I believe was the first 10Gb/s AES256 >> encryptor on the market. It used CTR mode, because that was the only >> mode suitable to run at those rates in the FPGAs that were available >> then. >(snip) > >> For the crypto naive: The throughput of a block cypher with feedback >> is determined by the delay through the block cypher calculation. >> Pipelining is good for getting impressive clock numbers, but it >> actually hurts throughput. > >You should be able to process multiple data streams, though. >(Similar to the multithreading processors popular a few >years ago.) I would expect that anyone needed such high >speed would have more than one document to encrypt or >decrypt. > >> http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation
I wish it could work that way. But the problem is that you don't use 10Gb/s encryptors to encrypt "documents", just a single continuous stream (or context) at 10Gb/s. Well, at least that's the way our customers use them. On a brighter note, it is possible to interleave CFB, so that each "engine" has to sample every 2nd (for two way interleave) 128 bit block. This is discussed in Schneier. Regards, Allan