comp.arch.fpga | Question about partial multiplication result in transposed FIR filter| page 2

Reply by rickman ●September 27, 20152015-09-27

On 9/27/2015 6:44 PM, Tim Wescott wrote:
> On Sun, 27 Sep 2015 18:28:12 -0400, rickman wrote:
>
>> On 9/27/2015 4:08 PM, Tim Wescott wrote:
>>> On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote:
>>>
>>>> On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
>>>>>> Hi,
>>>>>>
>>>>>> When I read a tutorial on FIR implementation on FPGA, I am not clear
>>>>> about
>>>>>> "partial results can be used for many multiplications (regardless of
>>>>>> symmetry)" That slide may be based on multiplier with logic cell in
>>>>> FPGA,
>>>>>> not a dedicated MAC in FPGA. Anyhow, I don't know why 'partial
>>>>>> results can
>>>>> be
>>>>>> used for many multiplications (regardless of symmetry)'? I only
>>>>>> think
>>>>> that
>>>>>> to save 50% multiplier taking advantage of FIR coef symmetric
>>>>> characteristic.
>>>>>>
>>>>>>
>>>>>> Could you tell me how to understand about the partial results?
>>>>>>
>>>>>> Thank,
>>>>>
>>>>> can we read that tutorial?
>>>>>
>>>>> Kaz --------------------------------------- Posted through
>>>>> http://www.FPGARelated.com
>>>>
>>>> Here is the link:https://www.google.ca/url?
>>>
> sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http
>>> %3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%
>>> 2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w
>>>>
>>>> Thanks,
>>>
>>> Well, the guy throws out one unfounded statement and then never
>>> supports it.  In a live presentation you could raise your hand and ask
>>> about it.
>>> Can you send him an email?
>>>
>>> I can see ways that, given a predetermined set of coefficients, you may
>>> be able to get the gate count down, but that's not really within the
>>> scope of the talk.
>>
>> I think it is clear that he is talking about a hard coded set of
>> coefficients.  If the coefficients are variables in registers, there is
>> no efficient way to optimize the multipliers.  With fixed coefficients
>> and multipliers in the fabric, this would be an important optimization.
>>    He discusses using fabric for multipliers in later slides.  It is not
>> unreasonable to expect the tools to do this optimization automatically.
>
> I totally missed that -- yes, one would hope that in 2015 the tools would
> be able to figure out how to optimize fixed multipliers.
>
> This is a tangent, but it makes me wonder -- I saw a paper ages ago that
> was basically saying that if addition and subtraction are equally costly,
> then you can optimize a multiplication by using both -- i.e., if you're
> multiplying (x) by 11110001111b, then you can either do eight adds, or
> you can do (x << 11) - (x << 6) + (x << 4) - x, for a 4x savings.
>
> So -- do modern optimizers do this when multiplying by fixed
> coefficients, or not?

I can't imagine they wouldn't.  But mostly, multiplications of any type 
are done using hardware multipliers available in the vast majority of 
FPGAs.

-- 

Rick

Reply by Kevin Neilson ●October 3, 20152015-10-03

You might be referring to the technique of expressing a number in CSD (Canonical Signed Digits), which reduces the number of nonzero bits in a number.

Reply by Kevin Neilson ●October 3, 20152015-10-03

I just built an optimized Galois Field vector multiplier, which multiplies a vector by a scalar.  As an experiment, I split it into two parts, one that was common to all elements of the vector, and then the parts that were unique to each element of the vector.  I had assumed this is something the synthesizer would do anyway, but I was surprised to find that writing it the way I did cut down the number of LUTs by a big margin.

Reply by rickman ●October 3, 20152015-10-03

On 10/3/2015 3:37 PM, Kevin Neilson wrote:
> I just built an optimized Galois Field vector multiplier, which multiplies a vector by a scalar.  As an experiment, I split it into two parts, one that was common to all elements of the vector, and then the parts that were unique to each element of the vector.  I had assumed this is something the synthesizer would do anyway, but I was surprised to find that writing it the way I did cut down the number of LUTs by a big margin.

Why is it using LUTs instead of multipliers?  Are these numbers too 
small and too many to use the multipliers efficiently?

-- 

Rick

Reply by Allan Herriman ●October 4, 20152015-10-04

On Sat, 03 Oct 2015 15:51:34 -0400, rickman wrote:

> On 10/3/2015 3:37 PM, Kevin Neilson wrote:
>> I just built an optimized Galois Field vector multiplier, which
>> multiplies a vector by a scalar.  As an experiment, I split it into two
>> parts, one that was common to all elements of the vector, and then the
>> parts that were unique to each element of the vector.  I had assumed
>> this is something the synthesizer would do anyway, but I was surprised
>> to find that writing it the way I did cut down the number of LUTs by a
>> big margin.
> 
> Why is it using LUTs instead of multipliers?  Are these numbers too
> small and too many to use the multipliers efficiently?

Kevin is talking about Galois Field multipliers.  The integer multiplier 
blocks are useless for that.

I occasionally implement cryptographic primitives in FPGAs.  These often 
use a combination of linear mixing functions and (other stuff which 
provides nonlinearity, which I don't need to talk about here).  The 
linear mixing functions often contain GF multiplications (sometimes by 
constants).

It's possible to express the linear mixing functions in HDL behaviourally 
(e.g. including things that are recognisably GF multipliers), and it's 
also possible to express them as a sea of XOR gates.

My experience using the latest tools from Xilinx is that for a particular 
128 bit mixing function I was getting three times as many levels of logic 
from from the behavioural source as I was from the sea of XOR gates 
source, even though both described the same function.

BTW, one runs into similar problems when calculating CRCs of wide buses.  
The CRCs also simplify to XOR trees, but in this case we are calculating 
the remainder after a division, rather than a multiplication.  (And yes, 
the integer DSP blocks are useless for this too.)

Regards,
Allan

Reply by rickman ●October 4, 20152015-10-04

On 10/4/2015 2:01 AM, Allan Herriman wrote:
> On Sat, 03 Oct 2015 15:51:34 -0400, rickman wrote:
>
>> On 10/3/2015 3:37 PM, Kevin Neilson wrote:
>>> I just built an optimized Galois Field vector multiplier, which
>>> multiplies a vector by a scalar.  As an experiment, I split it into two
>>> parts, one that was common to all elements of the vector, and then the
>>> parts that were unique to each element of the vector.  I had assumed
>>> this is something the synthesizer would do anyway, but I was surprised
>>> to find that writing it the way I did cut down the number of LUTs by a
>>> big margin.
>>
>> Why is it using LUTs instead of multipliers?  Are these numbers too
>> small and too many to use the multipliers efficiently?
>
>
> Kevin is talking about Galois Field multipliers.  The integer multiplier
> blocks are useless for that.

We are talking modulo 2 multiplies at every bit, otherwise known as AND 
gates with no carry?  I'm a bit fuzzy on this.

Now I'm confused by Kevin's description.  If the vector is multiplied by 
a scalar, what parts are common and what parts are unique?  What parts 
of this are fixed vs. variable?  The only parts a tool can optimize are 
the fixed operands.  Or I am totally missing the concept.

> I occasionally implement cryptographic primitives in FPGAs.  These often
> use a combination of linear mixing functions and (other stuff which
> provides nonlinearity, which I don't need to talk about here).  The
> linear mixing functions often contain GF multiplications (sometimes by
> constants).
>
> It's possible to express the linear mixing functions in HDL behaviourally
> (e.g. including things that are recognisably GF multipliers), and it's
> also possible to express them as a sea of XOR gates.
>
> My experience using the latest tools from Xilinx is that for a particular
> 128 bit mixing function I was getting three times as many levels of logic
> from from the behavioural source as I was from the sea of XOR gates
> source, even though both described the same function.

I don't know enough of how the tools work to say what is going on.  I 
just know that when I dig into the output of the tools for poorly 
synthesized code, I find the problems are that my code doesn't specify 
the simple structure I had imagined it did.  So I fix my code.  :)

Mostly this has to do with trying to use the carry out the top of 
adders.  Sometimes I get two adders.

> BTW, one runs into similar problems when calculating CRCs of wide buses.
> The CRCs also simplify to XOR trees, but in this case we are calculating
> the remainder after a division, rather than a multiplication.  (And yes,
> the integer DSP blocks are useless for this too.)

Yep, you don't want a carry, so don't even think about using adders or 
multipliers.  They are not adders and multipliers in every type of algebra.

-- 

Rick

Reply by Kevin Neilson ●October 5, 20152015-10-05

Not only can I not use the DSP blocks, but the carry chains don't work either.  The only way to do a big XOR is with LUTs, and at my clock speeds, I can only do about 3 LUT levels.  At least there are plenty of BRAMs in Virtex parts now, so it's easy to do logs, inverses, and exponentiation.

Reply by Kevin Neilson ●October 5, 20152015-10-05

> 
> We are talking modulo 2 multiplies at every bit, otherwise known as AND 
> gates with no carry?  I'm a bit fuzzy on this.
> 
> Now I'm confused by Kevin's description.  If the vector is multiplied by 
> a scalar, what parts are common and what parts are unique?  What parts 
> of this are fixed vs. variable?  The only parts a tool can optimize are 
> the fixed operands.  Or I am totally missing the concept.
> 
Say you're multiplying a by a vector [b c d].  Let's say we're using the field GF(8) so a is 3 bits.  Now a can be thought of as ( a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and alpha is the primitive element of the field.  Then a*b or a*c or a*d is just a sum of some combination of those 3 values in the parentheses, depending upon the locations of the 1s in b, c, or d.  So you can premultiply the three values in the parentheses (the common part) and then take sums of subsets of those three (the individual parts).  It's all a bunch of XORs at the end.  This is just a complicated way of saying that by writing the HDL at a more explicit level, the synthesizer is better able to find common factors and use a lot fewer gates.

Reply by rickman ●October 7, 20152015-10-07

On 10/5/2015 1:29 PM, Kevin Neilson wrote:
>>
>> We are talking modulo 2 multiplies at every bit, otherwise known as
>> AND gates with no carry?  I'm a bit fuzzy on this.
>>
>> Now I'm confused by Kevin's description.  If the vector is
>> multiplied by a scalar, what parts are common and what parts are
>> unique?  What parts of this are fixed vs. variable?  The only parts
>> a tool can optimize are the fixed operands.  Or I am totally
>> missing the concept.
>>
> Say you're multiplying a by a vector [b c d].  Let's say we're using
> the field GF(8) so a is 3 bits.  Now a can be thought of as (
> a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and
> alpha is the primitive element of the field.  Then a*b or a*c or a*d
> is just a sum of some combination of those 3 values in the
> parentheses, depending upon the locations of the 1s in b, c, or d.
> So you can premultiply the three values in the parentheses (the
> common part) and then take sums of subsets of those three (the
> individual parts).  It's all a bunch of XORs at the end.  This is
> just a complicated way of saying that by writing the HDL at a more
> explicit level, the synthesizer is better able to find common factors
> and use a lot fewer gates..

Ok, I'm not at all familiar with GFs.  I see now a bit of what you are 
saying.  But to be honest, I don't know the tools would have any trouble 
with the example you give.  The tools are pretty durn good at 
optimizing... *but*... there are two things to optimize for, size and 
performance.  They are sometimes mutually exclusive, sometimes not.  If 
you ask the tool to give you the optimum size, I don't think you will do 
better if you code it differently, while describing *exactly* the same 
behavior.

If you ask the tool to optimize for speed, the tool will feel free to 
duplicate logic if it allows higher performance, for example, by 
combining terms in different ways.  Or less logic may require a longer 
chain of LUTs which will be slower.  LUT sizes in FPGAs don't always 
match the logical breakdown so that speed or size can vary a lot 
depending on the partitioning.

-- 

Rick

Reply by David Brown ●October 7, 20152015-10-07

On 07/10/15 05:02, rickman wrote:
> On 10/5/2015 1:29 PM, Kevin Neilson wrote:
>>>
>>> We are talking modulo 2 multiplies at every bit, otherwise known as
>>> AND gates with no carry?  I'm a bit fuzzy on this.
>>>
>>> Now I'm confused by Kevin's description.  If the vector is
>>> multiplied by a scalar, what parts are common and what parts are
>>> unique?  What parts of this are fixed vs. variable?  The only parts
>>> a tool can optimize are the fixed operands.  Or I am totally
>>> missing the concept.
>>>
>> Say you're multiplying a by a vector [b c d].  Let's say we're using
>> the field GF(8) so a is 3 bits.  Now a can be thought of as (
>> a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and
>> alpha is the primitive element of the field.  Then a*b or a*c or a*d
>> is just a sum of some combination of those 3 values in the
>> parentheses, depending upon the locations of the 1s in b, c, or d.
>> So you can premultiply the three values in the parentheses (the
>> common part) and then take sums of subsets of those three (the
>> individual parts).  It's all a bunch of XORs at the end.  This is
>> just a complicated way of saying that by writing the HDL at a more
>> explicit level, the synthesizer is better able to find common factors
>> and use a lot fewer gates..
> 
> Ok, I'm not at all familiar with GFs.  I see now a bit of what you are
> saying.  But to be honest, I don't know the tools would have any trouble
> with the example you give.  The tools are pretty durn good at
> optimizing... *but*... there are two things to optimize for, size and
> performance.  They are sometimes mutually exclusive, sometimes not.  If
> you ask the tool to give you the optimum size, I don't think you will do
> better if you code it differently, while describing *exactly* the same
> behavior.
> 
> If you ask the tool to optimize for speed, the tool will feel free to
> duplicate logic if it allows higher performance, for example, by
> combining terms in different ways.  Or less logic may require a longer
> chain of LUTs which will be slower.  LUT sizes in FPGAs don't always
> match the logical breakdown so that speed or size can vary a lot
> depending on the partitioning.
> 

GF(8) is a Galios Field with 8 elements.  That means that there are 8
different elements.  These can be written in different ways.  Sometimes
people like to be completely abstract and write 0, 1, a, b, c, d, e, f.

<http://www.wolframalpha.com/input/?i=GF%288%29>

Others will want to use the polynomial forms (which are used in the
definition of GF(8)) and write: 0, 1, x, x+1, x&#4294967295;, x&#4294967295;+1, x&#4294967295;+x, x&#4294967295;+x+1.

<http://www.math.uic.edu/~leon/mcs425-s08/handouts/field.pdf>

That form is easily written in binary: 000, 001, 010, 011, 100, etc.

One key point about a GF (or any field) is that you can combine elements
through addition, subtraction, multiplication, and division (except by
0) and get another element of the field.  To make this work, however,
addition and multiplication (and therefore their inverses) are very
different from how you normally define them.

A GF is always of size p^n - in this case, p is 2 and n is 3.  Your
elements are a set of n elements from Z/p (the integers modulo p).
Addition is just pairwise addition of elements, modulo p.  For the case
p = 2 (which is commonly used for practical applications, because it
suits computing so neatly), this means you can hold your elements as a
simple binary string of n digits, and addition is just xor.

Multiplication is a little more complex.  Treat your elements as
polynomials (as shown earlier), and multiply them.  Any factors of x^i
can be reduced modulo p.  Then the whole thing is reduced modulo the
field's defining irreducible degree n polynomial (in this case, x&#4294967295; + x +
1).  This always gets you back to another element in the field.

The real magic here is that the multiplication table (excluding 0) forms
a Latin square - every element appears exactly once in each row and
column.  This means that division "works".

It's easy to get a finite field like this for size p, where p is prime -
it's just the integers modulo p, and you can use normal addition and
multiplication modulo p.  But the beauty of GF's is that they give you
fields of different sizes - in particular, size 2^n.

Back to the task in hand - implementing GF(8) on an FPGA.  Addition is
just xor - the tools should not have trouble optimising that!
Multiplication in GF is usually done with lookup tables.  For a field as
small as GF(8), you would have a table with all the elements.  This
could be handled in a 6x3 bit "ROM", or the tools could generate logic
and then reduce it to simple gates (for example, the LSB bit of the
product of non-zero elements is the xnor of the LSB bits of the
operands).  For larger fields, such as the very useful GF(2^8), you will
probably use lookup tables for the logs and antilogs, and use them for
multiplication and division.

A key area of practical application for GF's is in error detection and
correction.  A good example is for RAID6 (two redundant disks in a RAID
array) - there is an excellent paper on the subject by the guy who
implemented RAID 6 in Linux, using GF(2^8).  It gives a good
introduction to Galois fields.

<https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>

Previous 123 Next

Question about partial multiplication result in transposed FIR filter

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group