FPGARelated.com
Forums

Question about partial multiplication result in transposed FIR filter

Started by fl September 25, 2015
On 9/27/2015 6:44 PM, Tim Wescott wrote:
> On Sun, 27 Sep 2015 18:28:12 -0400, rickman wrote: > >> On 9/27/2015 4:08 PM, Tim Wescott wrote: >>> On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote: >>> >>>> On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote: >>>>>> Hi, >>>>>> >>>>>> When I read a tutorial on FIR implementation on FPGA, I am not clear >>>>> about >>>>>> "partial results can be used for many multiplications (regardless of >>>>>> symmetry)" That slide may be based on multiplier with logic cell in >>>>> FPGA, >>>>>> not a dedicated MAC in FPGA. Anyhow, I don't know why 'partial >>>>>> results can >>>>> be >>>>>> used for many multiplications (regardless of symmetry)'? I only >>>>>> think >>>>> that >>>>>> to save 50% multiplier taking advantage of FIR coef symmetric >>>>> characteristic. >>>>>> >>>>>> >>>>>> Could you tell me how to understand about the partial results? >>>>>> >>>>>> Thank, >>>>> >>>>> can we read that tutorial? >>>>> >>>>> Kaz --------------------------------------- Posted through >>>>> http://www.FPGARelated.com >>>> >>>> Here is the link:https://www.google.ca/url? >>> > sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http >>> %3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic% >>> 2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w >>>> >>>> Thanks, >>> >>> Well, the guy throws out one unfounded statement and then never >>> supports it. In a live presentation you could raise your hand and ask >>> about it. >>> Can you send him an email? >>> >>> I can see ways that, given a predetermined set of coefficients, you may >>> be able to get the gate count down, but that's not really within the >>> scope of the talk. >> >> I think it is clear that he is talking about a hard coded set of >> coefficients. If the coefficients are variables in registers, there is >> no efficient way to optimize the multipliers. With fixed coefficients >> and multipliers in the fabric, this would be an important optimization. >> He discusses using fabric for multipliers in later slides. It is not >> unreasonable to expect the tools to do this optimization automatically. > > I totally missed that -- yes, one would hope that in 2015 the tools would > be able to figure out how to optimize fixed multipliers. > > This is a tangent, but it makes me wonder -- I saw a paper ages ago that > was basically saying that if addition and subtraction are equally costly, > then you can optimize a multiplication by using both -- i.e., if you're > multiplying (x) by 11110001111b, then you can either do eight adds, or > you can do (x << 11) - (x << 6) + (x << 4) - x, for a 4x savings. > > So -- do modern optimizers do this when multiplying by fixed > coefficients, or not?
I can't imagine they wouldn't. But mostly, multiplications of any type are done using hardware multipliers available in the vast majority of FPGAs. -- Rick
You might be referring to the technique of expressing a number in CSD (Canonical Signed Digits), which reduces the number of nonzero bits in a number.  
I just built an optimized Galois Field vector multiplier, which multiplies a vector by a scalar.  As an experiment, I split it into two parts, one that was common to all elements of the vector, and then the parts that were unique to each element of the vector.  I had assumed this is something the synthesizer would do anyway, but I was surprised to find that writing it the way I did cut down the number of LUTs by a big margin.
On 10/3/2015 3:37 PM, Kevin Neilson wrote:
> I just built an optimized Galois Field vector multiplier, which multiplies a vector by a scalar. As an experiment, I split it into two parts, one that was common to all elements of the vector, and then the parts that were unique to each element of the vector. I had assumed this is something the synthesizer would do anyway, but I was surprised to find that writing it the way I did cut down the number of LUTs by a big margin.
Why is it using LUTs instead of multipliers? Are these numbers too small and too many to use the multipliers efficiently? -- Rick
On Sat, 03 Oct 2015 15:51:34 -0400, rickman wrote:

> On 10/3/2015 3:37 PM, Kevin Neilson wrote: >> I just built an optimized Galois Field vector multiplier, which >> multiplies a vector by a scalar. As an experiment, I split it into two >> parts, one that was common to all elements of the vector, and then the >> parts that were unique to each element of the vector. I had assumed >> this is something the synthesizer would do anyway, but I was surprised >> to find that writing it the way I did cut down the number of LUTs by a >> big margin. > > Why is it using LUTs instead of multipliers? Are these numbers too > small and too many to use the multipliers efficiently?
Kevin is talking about Galois Field multipliers. The integer multiplier blocks are useless for that. I occasionally implement cryptographic primitives in FPGAs. These often use a combination of linear mixing functions and (other stuff which provides nonlinearity, which I don't need to talk about here). The linear mixing functions often contain GF multiplications (sometimes by constants). It's possible to express the linear mixing functions in HDL behaviourally (e.g. including things that are recognisably GF multipliers), and it's also possible to express them as a sea of XOR gates. My experience using the latest tools from Xilinx is that for a particular 128 bit mixing function I was getting three times as many levels of logic from from the behavioural source as I was from the sea of XOR gates source, even though both described the same function. BTW, one runs into similar problems when calculating CRCs of wide buses. The CRCs also simplify to XOR trees, but in this case we are calculating the remainder after a division, rather than a multiplication. (And yes, the integer DSP blocks are useless for this too.) Regards, Allan
On 10/4/2015 2:01 AM, Allan Herriman wrote:
> On Sat, 03 Oct 2015 15:51:34 -0400, rickman wrote: > >> On 10/3/2015 3:37 PM, Kevin Neilson wrote: >>> I just built an optimized Galois Field vector multiplier, which >>> multiplies a vector by a scalar. As an experiment, I split it into two >>> parts, one that was common to all elements of the vector, and then the >>> parts that were unique to each element of the vector. I had assumed >>> this is something the synthesizer would do anyway, but I was surprised >>> to find that writing it the way I did cut down the number of LUTs by a >>> big margin. >> >> Why is it using LUTs instead of multipliers? Are these numbers too >> small and too many to use the multipliers efficiently? > > > Kevin is talking about Galois Field multipliers. The integer multiplier > blocks are useless for that.
We are talking modulo 2 multiplies at every bit, otherwise known as AND gates with no carry? I'm a bit fuzzy on this. Now I'm confused by Kevin's description. If the vector is multiplied by a scalar, what parts are common and what parts are unique? What parts of this are fixed vs. variable? The only parts a tool can optimize are the fixed operands. Or I am totally missing the concept.
> I occasionally implement cryptographic primitives in FPGAs. These often > use a combination of linear mixing functions and (other stuff which > provides nonlinearity, which I don't need to talk about here). The > linear mixing functions often contain GF multiplications (sometimes by > constants). > > It's possible to express the linear mixing functions in HDL behaviourally > (e.g. including things that are recognisably GF multipliers), and it's > also possible to express them as a sea of XOR gates. > > My experience using the latest tools from Xilinx is that for a particular > 128 bit mixing function I was getting three times as many levels of logic > from from the behavioural source as I was from the sea of XOR gates > source, even though both described the same function.
I don't know enough of how the tools work to say what is going on. I just know that when I dig into the output of the tools for poorly synthesized code, I find the problems are that my code doesn't specify the simple structure I had imagined it did. So I fix my code. :) Mostly this has to do with trying to use the carry out the top of adders. Sometimes I get two adders.
> BTW, one runs into similar problems when calculating CRCs of wide buses. > The CRCs also simplify to XOR trees, but in this case we are calculating > the remainder after a division, rather than a multiplication. (And yes, > the integer DSP blocks are useless for this too.)
Yep, you don't want a carry, so don't even think about using adders or multipliers. They are not adders and multipliers in every type of algebra. -- Rick
Not only can I not use the DSP blocks, but the carry chains don't work either.  The only way to do a big XOR is with LUTs, and at my clock speeds, I can only do about 3 LUT levels.  At least there are plenty of BRAMs in Virtex parts now, so it's easy to do logs, inverses, and exponentiation.
> > We are talking modulo 2 multiplies at every bit, otherwise known as AND > gates with no carry? I'm a bit fuzzy on this. > > Now I'm confused by Kevin's description. If the vector is multiplied by > a scalar, what parts are common and what parts are unique? What parts > of this are fixed vs. variable? The only parts a tool can optimize are > the fixed operands. Or I am totally missing the concept. >
Say you're multiplying a by a vector [b c d]. Let's say we're using the field GF(8) so a is 3 bits. Now a can be thought of as ( a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and alpha is the primitive element of the field. Then a*b or a*c or a*d is just a sum of some combination of those 3 values in the parentheses, depending upon the locations of the 1s in b, c, or d. So you can premultiply the three values in the parentheses (the common part) and then take sums of subsets of those three (the individual parts). It's all a bunch of XORs at the end. This is just a complicated way of saying that by writing the HDL at a more explicit level, the synthesizer is better able to find common factors and use a lot fewer gates.
On 10/5/2015 1:29 PM, Kevin Neilson wrote:
>> >> We are talking modulo 2 multiplies at every bit, otherwise known as >> AND gates with no carry? I'm a bit fuzzy on this. >> >> Now I'm confused by Kevin's description. If the vector is >> multiplied by a scalar, what parts are common and what parts are >> unique? What parts of this are fixed vs. variable? The only parts >> a tool can optimize are the fixed operands. Or I am totally >> missing the concept. >> > Say you're multiplying a by a vector [b c d]. Let's say we're using > the field GF(8) so a is 3 bits. Now a can be thought of as ( > a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and > alpha is the primitive element of the field. Then a*b or a*c or a*d > is just a sum of some combination of those 3 values in the > parentheses, depending upon the locations of the 1s in b, c, or d. > So you can premultiply the three values in the parentheses (the > common part) and then take sums of subsets of those three (the > individual parts). It's all a bunch of XORs at the end. This is > just a complicated way of saying that by writing the HDL at a more > explicit level, the synthesizer is better able to find common factors > and use a lot fewer gates..
Ok, I'm not at all familiar with GFs. I see now a bit of what you are saying. But to be honest, I don't know the tools would have any trouble with the example you give. The tools are pretty durn good at optimizing... *but*... there are two things to optimize for, size and performance. They are sometimes mutually exclusive, sometimes not. If you ask the tool to give you the optimum size, I don't think you will do better if you code it differently, while describing *exactly* the same behavior. If you ask the tool to optimize for speed, the tool will feel free to duplicate logic if it allows higher performance, for example, by combining terms in different ways. Or less logic may require a longer chain of LUTs which will be slower. LUT sizes in FPGAs don't always match the logical breakdown so that speed or size can vary a lot depending on the partitioning. -- Rick
On 07/10/15 05:02, rickman wrote:
> On 10/5/2015 1:29 PM, Kevin Neilson wrote: >>> >>> We are talking modulo 2 multiplies at every bit, otherwise known as >>> AND gates with no carry? I'm a bit fuzzy on this. >>> >>> Now I'm confused by Kevin's description. If the vector is >>> multiplied by a scalar, what parts are common and what parts are >>> unique? What parts of this are fixed vs. variable? The only parts >>> a tool can optimize are the fixed operands. Or I am totally >>> missing the concept. >>> >> Say you're multiplying a by a vector [b c d]. Let's say we're using >> the field GF(8) so a is 3 bits. Now a can be thought of as ( >> a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and >> alpha is the primitive element of the field. Then a*b or a*c or a*d >> is just a sum of some combination of those 3 values in the >> parentheses, depending upon the locations of the 1s in b, c, or d. >> So you can premultiply the three values in the parentheses (the >> common part) and then take sums of subsets of those three (the >> individual parts). It's all a bunch of XORs at the end. This is >> just a complicated way of saying that by writing the HDL at a more >> explicit level, the synthesizer is better able to find common factors >> and use a lot fewer gates.. > > Ok, I'm not at all familiar with GFs. I see now a bit of what you are > saying. But to be honest, I don't know the tools would have any trouble > with the example you give. The tools are pretty durn good at > optimizing... *but*... there are two things to optimize for, size and > performance. They are sometimes mutually exclusive, sometimes not. If > you ask the tool to give you the optimum size, I don't think you will do > better if you code it differently, while describing *exactly* the same > behavior. > > If you ask the tool to optimize for speed, the tool will feel free to > duplicate logic if it allows higher performance, for example, by > combining terms in different ways. Or less logic may require a longer > chain of LUTs which will be slower. LUT sizes in FPGAs don't always > match the logical breakdown so that speed or size can vary a lot > depending on the partitioning. >
GF(8) is a Galios Field with 8 elements. That means that there are 8 different elements. These can be written in different ways. Sometimes people like to be completely abstract and write 0, 1, a, b, c, d, e, f. <http://www.wolframalpha.com/input/?i=GF%288%29> Others will want to use the polynomial forms (which are used in the definition of GF(8)) and write: 0, 1, x, x+1, x&#4294967295;, x&#4294967295;+1, x&#4294967295;+x, x&#4294967295;+x+1. <http://www.math.uic.edu/~leon/mcs425-s08/handouts/field.pdf> That form is easily written in binary: 000, 001, 010, 011, 100, etc. One key point about a GF (or any field) is that you can combine elements through addition, subtraction, multiplication, and division (except by 0) and get another element of the field. To make this work, however, addition and multiplication (and therefore their inverses) are very different from how you normally define them. A GF is always of size p^n - in this case, p is 2 and n is 3. Your elements are a set of n elements from Z/p (the integers modulo p). Addition is just pairwise addition of elements, modulo p. For the case p = 2 (which is commonly used for practical applications, because it suits computing so neatly), this means you can hold your elements as a simple binary string of n digits, and addition is just xor. Multiplication is a little more complex. Treat your elements as polynomials (as shown earlier), and multiply them. Any factors of x^i can be reduced modulo p. Then the whole thing is reduced modulo the field's defining irreducible degree n polynomial (in this case, x&#4294967295; + x + 1). This always gets you back to another element in the field. The real magic here is that the multiplication table (excluding 0) forms a Latin square - every element appears exactly once in each row and column. This means that division "works". It's easy to get a finite field like this for size p, where p is prime - it's just the integers modulo p, and you can use normal addition and multiplication modulo p. But the beauty of GF's is that they give you fields of different sizes - in particular, size 2^n. Back to the task in hand - implementing GF(8) on an FPGA. Addition is just xor - the tools should not have trouble optimising that! Multiplication in GF is usually done with lookup tables. For a field as small as GF(8), you would have a table with all the elements. This could be handled in a 6x3 bit "ROM", or the tools could generate logic and then reduce it to simple gates (for example, the LSB bit of the product of non-zero elements is the xnor of the LSB bits of the operands). For larger fields, such as the very useful GF(2^8), you will probably use lookup tables for the logs and antilogs, and use them for multiplication and division. A key area of practical application for GF's is in error detection and correction. A good example is for RAID6 (two redundant disks in a RAID array) - there is an excellent paper on the subject by the guy who implemented RAID 6 in Linux, using GF(2^8). It gives a good introduction to Galois fields. <https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>