comp.arch.fpga | RISC-V Support in FPGA| page 7

On Thu, 04 May 2017 10:56:56 -0700, Kevin Neilson wrote:

>> I use Vivado to do GF multiplications that wide using purely
>> behavioural VHDL.  BTW, A straightforward behavioural implementation
>> will *not* give good results with a wide bus.
>> I believe the problem is that most tools (in particular Vivado) do a
>> poor job of synthesising xor trees with a massive fanin (e.g. >> 100
>> bits). The optimisers have a poor complexity (I guess at least O(N^2),
>> but it might be exponential) wrt the size of the function.
>> 
>> You can use all sorts of mathematical tricks to make it work without
>> need to go "low level".
>> For example, to deal with large fanin, partition your 512 bit input
>> into N slices of 512/N bits each.  Use N multipliers, one for each
>> slice, put a keep (or equivalent) attribute on the outputs, then xor
>> the outputs together.  This gives the same result, uses about the same
>> number of LUTs,
>> but gives the optimiser in the tool a chance to do a good job.
>> 
>> 
>> I use the same GF multiplier code in ISE and Quartus, too (but not on
>> buses that wide).
>> 
>> The entire flow is in VHDL and works in any LRM-compliant tool.  It's
>> parameterised, too, so I don't need to rewrite for a different bus
>> width.
>> 
>> 
>> I've been using similar approaches in VHDL since the turn of the
>> century and have never been burned.
>> 
>> YMMV.
>> 
>> Regards,
>> Allan
> 
> I used to do big GF matrix multiplications in which you could set
> parameters for the field size and field generator poly, etc.  Vivado
> just gets bogged down.  Now I just expand that into a GF(2) matrix in
> Matlab and dump it to a parameter and all Vivado has to know how to do
> is XOR.
> 
> I also have problems with the wide XORs.  Multiplication by a big GF(2)
> matrix means a wide XOR for each column.  Vivado tries to share LUTs
> with common subexpressions across the columns.  Too much sharing.  That
> sounds like a good thing, but it's not smart enough to know how much
> it's impacting timing.  You save LUTs, but you end up with a routing
> mess and too many levels of logic and you don't come close to meeting
> timing at all.  So then I have to make a generate loop and put
> subsections of the matrix in separate modules and use directives to
> prevent optimizing across boundaries.  (KEEPs don't work.)  It's all a
> pain.  But then I end up with something a little bigger but which meets
> timing.


I thought about my historical code some more, and I realised that I did 
have some examples of behavioural GF multipliers that didn't work as well 
as the same function expressed as a bunch of wide xors.

The particular example I'm thinking of had a 128 in, 128 xor tree that 
really shouldn't be any harder to synth than a CRC.  It's a linear 
mapping stage in an SP block cipher (like AES, but not AES (which has a 
relatively weak mixing function)).

Vivado gave (IIRC) 11 or 12 levels of logic rather than the expected 3 
levels of logic.  Hmmm.  The revised source code (expressed as a bunch of 
xors) produced 4 levels of logic, and routed to speed.

BTW, I used my VHDL testbench for the original function to write out the 
VHDL for the xor tree.

 
> I really wish there were a way to use the carry chains for wide XORs.

I think that carry chains (and similar structures) became less important 
for wide functions once six input LUTs became commonplace.

The Xilinx DSP48E2 has a wide xor mode that I think can give a 96 input 
xor in a single DSP48E2 slice.  I've never tried it.

Regards,
Allan

> The particular example I'm thinking of had a 128 in, 128 xor tree that 
> really shouldn't be any harder to synth than a CRC.  It's a linear 
> mapping stage in an SP block cipher (like AES, but not AES (which has a 
> relatively weak mixing function)).
> 
> Vivado gave (IIRC) 11 or 12 levels of logic rather than the expected 3 
> levels of logic.  Hmmm.  The revised source code (expressed as a bunch of 
> xors) produced 4 levels of logic, and routed to speed.
> 
Same here.  I have constant multiplier matrices and each has a column weight of about 160 so I end up with a 160-input XOR for each column.  Ideally that would be log6(160)=2.8 levels.  First I have to use very low-level code and even then Vivado shares subexpressions too much and I end up with 6 levels unless I isolate column groups in different modules.  If I isolate each column in its own module I can get the 3 levels.  Isolating column groups also means they are placed as a group which reduces wirelengths.

> The Xilinx DSP48E2 has a wide xor mode that I think can give a 96 input 
> xor in a single DSP48E2 slice.  I've never tried it.

Yeah, I looked into this at one point but decided against it for a few reasons.  I thought a nice feature would be to be able to turn off the carries in the DSP48 and then you could use them for GF multipliers.  I have used DSP48s as GF(2) accumulators and I've used them as transposers to extract column data from rows stored in RAMs.

RISC-V Support in FPGA

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group