I've been looking at the various core/macro generators and they all seem horribly large and slow, almost like student designs. Has anyone seriously taken a good look at hand fitting multipliers and squarers into Altera/Xilinx FPGA's?
What's the smallest/fastest CLB/Slice/Lut based unsigned integer 64 bit by 64 bit multiplier/squarer design?
Started by ●June 2, 2007
Reply by ●June 2, 20072007-06-02
In article <1180769858.872726.132740@i13g2000prf.googlegroups.com>, Totally_Lost <air_bits@yahoo.com> wrote:>I've been looking at the various core/macro generators and they all >seem horribly large and slow, almost like student designs. Has anyone >seriously taken a good look at hand fitting multipliers and squarers >into Altera/Xilinx FPGA's?I believe Dan Bernstein has, in the context of fast factorisation machines, but I don't think he's published yet. http://cr.yp.to/djb.html is his rather disorganised Web page. Tom
Reply by ●June 4, 20072007-06-04
Totally_Lost wrote:> I've been looking at the various core/macro generators and they all > seem horribly large and slow, almost like student designs. Has anyone > seriously taken a good look at hand fitting multipliers and squarers > into Altera/Xilinx FPGA's? >If you haven't already, you might look at the multiplication in FPGAs page on my website at http://www.andraka.com/multipli.htm. The partial product LUT multipliers are about as efficient as you are going to get with the FPGA LUT structure. If you can use multiple cycles, then you can combine that with sequential multiplication to make the multiplier smaller at the expense of more clock cycles per product.
Reply by ●June 5, 20072007-06-05
In article <an29i.92548$vE1.9245@newsfe24.lga>, Ray Andraka <ray@andraka.com> wrote:>Totally_Lost wrote: >> I've been looking at the various core/macro generators and they all >> seem horribly large and slow, almost like student designs. Has anyone >> seriously taken a good look at hand fitting multipliers and squarers >> into Altera/Xilinx FPGA's? >> > >If you haven't already, you might look at the multiplication in FPGAs >page on my website at http://www.andraka.com/multipli.htm. The partial >product LUT multipliers are about as efficient as you are going to get >with the FPGA LUT structure.I agree that partial-product LUT multipliers are pretty much optimal for multiplying three-bit quantities. For wide multipliers, I think the relevant literature is the multi-precision arithmetic stuff by people like Knuth; you can multiply two 64-bit numbers by using three 32x32 multipliers in parallel. Relevant search terms are Karatsuba and Toom-Cook. (trying to multiply 2^32A+B by 2^32C+D: compute AC, BD, and (A+B)*(C+D), then do 2^64*AC + BD + 2^32*((A+B)*(C+D)-AC-BD). This takes one 32-bit add delay in which you get A+B and C+D one 33x33->66 multiply delay for (A+B)(C+D), parallelly AC and BD one 64-bit add delay to get AC+BD one 66-bit subtract delay to get ((A+B)*(C+D)-AC-BD) one 64-bit add delay to add the top 34 bits of that product to AC and you can decompose the 33x33->66 multiply in the same kind of way, and if you're clever you can probably pipeline the wide additions) Toom-Cook lets you multiply two (A*B)-bit numbers by using (2A-1) multipliers a bit wider than B, and a lot of rather strange additional tidying-up circuitry. The idea is that multiplying numbers is a lot like multiplying polynomials, and you can reconstruct the coefficients of a polynomial of degree 2d from its values at 2d+1 places by multiplying by a constant matrix; pick the places to evaluate to make the coefficients in the matrix nice. Tom
Reply by ●June 5, 20072007-06-05
On Jun 4, 6:45 pm, Ray Andraka <r...@andraka.com> wrote:> Totally_Lost wrote: > > I've been looking at the various core/macro generators and they all > > seem horribly large and slow, almost like student designs. Has anyone > > seriously taken a good look at hand fitting multipliers and squarers > > into Altera/Xilinx FPGA's? > > If you haven't already, you might look at the multiplication in FPGAs > page on my website athttp://www.andraka.com/multipli.htm. The partial > product LUT multipliers are about as efficient as you are going to get > with the FPGA LUT structure. If you can use multiple cycles, then you > can combine that with sequential multiplication to make the multiplier > smaller at the expense of more clock cycles per product.Hi Ray, I certainly had took another cruise past that web site, as it's certainly a useful resource. The logic depths to build a wide 64x64=>128 bit multiplier that way do get a bit excessive, of which some pipelining does help a bit. The last time I faced this for a smaller 20x20=>40 bit multiplier I did better with an odd carry-save construction using the carry chain in a non-standard way, but took a couple days in the fpga editor to pack cleanly. The 64x64 isn't looking nearly as fun, so I was looking to see if someone has already been down this path.
Reply by ●June 6, 20072007-06-06
On Jun 5, 11:56 am, Thomas Womack <twom...@chiark.greenend.org.uk> wrote:> I agree that partial-product LUT multipliers are pretty much optimal for > multiplying three-bit quantities. For wide multipliers, I think the > relevant literature is the multi-precision arithmetic stuff by people like > Knuth; you can multiply two 64-bit numbers by using three 32x32 multipliers > in parallel. Relevant search terms are Karatsuba and Toom-Cook.Thanks Tom, I'll explore optimizations around that structural model as well, and compare with the carry-save model optimized around the extra half adders and mux in the carry chain hardware that worked well for the 20x20 multiplier.
Reply by ●June 6, 20072007-06-06
On Jun 4, 6:45 pm, Ray Andraka <r...@andraka.com> wrote:> Totally_Lost wrote: > > I've been looking at the various core/macro generators and they all > > seem horribly large and slow, almost like student designs. Has anyone > > seriously taken a good look at hand fitting multipliers and squarers > > into Altera/Xilinx FPGA's? > > If you haven't already, you might look at the multiplication in FPGAs > page on my website athttp://www.andraka.com/multipli.htm. The partial > product LUT multipliers are about as efficient as you are going to get > with the FPGA LUT structure. If you can use multiple cycles, then you > can combine that with sequential multiplication to make the multiplier > smaller at the expense of more clock cycles per product.Hi Ray ... Just for reference, what would you consider an optimal number of LUTs/Slices using this approach for an 8x8=>16 and 16x16=>32 multiplier? Thanks
Reply by ●June 7, 20072007-06-07
On 6 Jun., 08:14, Totally_Lost <air_b...@yahoo.com> wrote:> On Jun 4, 6:45 pm, Ray Andraka <r...@andraka.com> wrote:> Hi Ray ... Just for reference, what would you consider an optimal > number of LUTs/Slices using this approach for an 8x8=>16 and 16x16=>32 > multiplier? ThanksTricks like FFT based multiplications or Toom-Cook often are not beneficial in the range up to 64 bits because any improvement in LUTs (bothh area and depth) are killed by the area and delay required by the irregular routing. Therefor almost any hardware multiplier around is based on addition trees that are sum up partial products. (booth, wallace-tree, array, sequential, etc.) All of these architectures require N*N/K 1-bit adders when multiplying N bit numbers in K clock cycles. It turns out that all these architectures can be created from each other by just swapping wires around. This means, that if you are the type that uses RLOCs you can optimze the routing after placement has been fixed to equalize the critical path. (See our paper on that topic: http://eis.eit.uni-kl.de/eis/research/publications/papers/iccd04.pdf ) A script that implements this with a greedy algorithm is not extremely difficult to write. Kolja Sulimma
Reply by ●June 8, 20072007-06-08
On Jun 7, 10:24 am, "comp.arch.fpga" <ksuli...@googlemail.com> wrote:> All of these architectures require N*N/K 1-bit adders when multiplying > N bit numbers in K clock cycles.Hi Kolja, Without a doubt the product of two numbers of length M and N, yields at most M*N full adders, and LUT's, by most decompositions we teach students ... or N*N = N^2 in your form. For the product of two 64 bits, that yield roughly 4096 LUTs using the basic student forms which are taught in most EE courses. Actually, it's (N-1) rows of adders of length N, or 4064 full adders or LUTs. However, that is about what we expect from students, and novice engineers which fail to study best practice algorithms, and think about the problem in a bit more detail. Karatsuba's form requires two products of (N/2) plus a product of ((N/ 2)+1) bits, plus three more adders of length (M/2)+1, plus some careful merging of the data streams. Roughly 992+992+1056+99=3139 full adders, or LUTs which is a significant savings over N^2. A good student would produce this result without question. An excellent student would recursively implement the two 32 bit multiplies using Karatsuba's form as well, saving roughly another 378 full adders or LUTs, for a rough total of 2761 LUTs. And probably one more time to pickup another hundred or so full adders, or LUTs, but also to better constrain the input routing and locality. One of the side effects of using Karatsuba's form, is that it reduces the fan-in required for the input digits, and greatly relieves input routing over a basic student form solution where every input term is in a full N bit row or column. It's with this starting point, significantly better than N^2, (about 60% the size of the basic student adder tree from) that we can do even better with careful optimization and layout, using a host of salty dog tricks to implement a a slightly better variation of this model.
Reply by ●June 8, 20072007-06-08
"Totally_Lost" <air_bits@yahoo.com> wrote in message news:1180769858.872726.132740@i13g2000prf.googlegroups.com...> I've been looking at the various core/macro generators and they all > seem horribly large and slow, almost like student designs. Has anyone > seriously taken a good look at hand fitting multipliers and squarers > into Altera/Xilinx FPGA's?This has been an interesting thread. But I still wonder why you want to implement a LUT-based 64x64 multiplier? The reason the multiplier generator core in Xilinx Coregen doesn't use a whole lot of tricks in this configuration is that it seems like a very unlikely use case, and not one worth concentrating a lot of effort on. Ever since hard multiplier/DSP blocks began appearing in FPGAs, they have been the best building block for these large multiplier structures. Maybe you can share some details of why exactly you need such a macro? Cheers, -Ben-





