FPGARelated.com
Forums

RISC-V Support in FPGA

Started by rickman April 28, 2017
On Thu, 04 May 2017 10:56:56 -0700, Kevin Neilson wrote:

>> I use Vivado to do GF multiplications that wide using purely >> behavioural VHDL. BTW, A straightforward behavioural implementation >> will *not* give good results with a wide bus. >> I believe the problem is that most tools (in particular Vivado) do a >> poor job of synthesising xor trees with a massive fanin (e.g. >> 100 >> bits). The optimisers have a poor complexity (I guess at least O(N^2), >> but it might be exponential) wrt the size of the function. >> >> You can use all sorts of mathematical tricks to make it work without >> need to go "low level". >> For example, to deal with large fanin, partition your 512 bit input >> into N slices of 512/N bits each. Use N multipliers, one for each >> slice, put a keep (or equivalent) attribute on the outputs, then xor >> the outputs together. This gives the same result, uses about the same >> number of LUTs, >> but gives the optimiser in the tool a chance to do a good job. >> >> >> I use the same GF multiplier code in ISE and Quartus, too (but not on >> buses that wide). >> >> The entire flow is in VHDL and works in any LRM-compliant tool. It's >> parameterised, too, so I don't need to rewrite for a different bus >> width. >> >> >> I've been using similar approaches in VHDL since the turn of the >> century and have never been burned. >> >> YMMV. >> >> Regards, >> Allan > > I used to do big GF matrix multiplications in which you could set > parameters for the field size and field generator poly, etc. Vivado > just gets bogged down. Now I just expand that into a GF(2) matrix in > Matlab and dump it to a parameter and all Vivado has to know how to do > is XOR. > > I also have problems with the wide XORs. Multiplication by a big GF(2) > matrix means a wide XOR for each column. Vivado tries to share LUTs > with common subexpressions across the columns. Too much sharing. That > sounds like a good thing, but it's not smart enough to know how much > it's impacting timing. You save LUTs, but you end up with a routing > mess and too many levels of logic and you don't come close to meeting > timing at all. So then I have to make a generate loop and put > subsections of the matrix in separate modules and use directives to > prevent optimizing across boundaries. (KEEPs don't work.) It's all a > pain. But then I end up with something a little bigger but which meets > timing.
I thought about my historical code some more, and I realised that I did have some examples of behavioural GF multipliers that didn't work as well as the same function expressed as a bunch of wide xors. The particular example I'm thinking of had a 128 in, 128 xor tree that really shouldn't be any harder to synth than a CRC. It's a linear mapping stage in an SP block cipher (like AES, but not AES (which has a relatively weak mixing function)). Vivado gave (IIRC) 11 or 12 levels of logic rather than the expected 3 levels of logic. Hmmm. The revised source code (expressed as a bunch of xors) produced 4 levels of logic, and routed to speed. BTW, I used my VHDL testbench for the original function to write out the VHDL for the xor tree.
> I really wish there were a way to use the carry chains for wide XORs.
I think that carry chains (and similar structures) became less important for wide functions once six input LUTs became commonplace. The Xilinx DSP48E2 has a wide xor mode that I think can give a 96 input xor in a single DSP48E2 slice. I've never tried it. Regards, Allan
> The particular example I'm thinking of had a 128 in, 128 xor tree that > really shouldn't be any harder to synth than a CRC. It's a linear > mapping stage in an SP block cipher (like AES, but not AES (which has a > relatively weak mixing function)). > > Vivado gave (IIRC) 11 or 12 levels of logic rather than the expected 3 > levels of logic. Hmmm. The revised source code (expressed as a bunch of > xors) produced 4 levels of logic, and routed to speed. >
Same here. I have constant multiplier matrices and each has a column weight of about 160 so I end up with a 160-input XOR for each column. Ideally that would be log6(160)=2.8 levels. First I have to use very low-level code and even then Vivado shares subexpressions too much and I end up with 6 levels unless I isolate column groups in different modules. If I isolate each column in its own module I can get the 3 levels. Isolating column groups also means they are placed as a group which reduces wirelengths.
> The Xilinx DSP48E2 has a wide xor mode that I think can give a 96 input > xor in a single DSP48E2 slice. I've never tried it.
Yeah, I looked into this at one point but decided against it for a few reasons. I thought a nice feature would be to be able to turn off the carries in the DSP48 and then you could use them for GF multipliers. I have used DSP48s as GF(2) accumulators and I've used them as transposers to extract column data from rows stored in RAMs.