On 23/10/15 20:55, Kevin Neilson wrote:
> David, I like your explanation of mapping the field elements to
> something abstract, like 0, 1, a, b, c, d, e, f.  I've only seen one
> textbook that mentioned something like this (it used Greek letters)
> and I found it very helpful to realize that the field elements are
> only arbitrarily *mapped* to integers and are not equivalent to
> them.

For non-algebraists, the use of numbers and "addition" and
"multiplication" in field theory can be very confusing - "normal" people
are too used to identities such as "two times x equals x plus x" that
have no equivalent in fields such as GF(2^n).  Using letters or other
symbols helps make things clear.

> 
> I don't use ROMs for field multiplication.  A GF(2048) multiplier
> takes about 64 6-input LUTs, as I recall, with about 3 levels of
> logic.  A fixed multiplier (multiplying by a constant) is more like
> 20 LUTs.  I use ROMs for division (find the reciprocal), cubing, etc.
> If you use ROMs for multiplication it induces a lot of latency,
> because you have 2-3 cycles for each blockRAM, and you have to do a
> log and antilog lookup, so you can end up with 5+ cycles of latency. 
> Kevin
> 

There's always a balance between throughput, latency and resource usage,
depending on your requirements.  My experience with these is in software
rather than FPGA, where the balances are a little different.

David,
I like your explanation of mapping the field elements to something abstract, like 0, 1, a, b, c, d, e, f.  I've only seen one textbook that mentioned something like this (it used Greek letters) and I found it very helpful to realize that the field elements are only arbitrarily *mapped* to integers and are not equivalent to them.

I don't use ROMs for field multiplication.  A GF(2048) multiplier takes about 64 6-input LUTs, as I recall, with about 3 levels of logic.  A fixed multiplier (multiplying by a constant) is more like 20 LUTs.  I use ROMs for division (find the reciprocal), cubing, etc.  If you use ROMs for multiplication it induces a lot of latency, because you have 2-3 cycles for each blockRAM, and you have to do a log and antilog lookup, so you can end up with 5+ cycles of latency.
Kevin

On Tue, 06 Oct 2015 23:02:27 -0400, rickman wrote:

[snip] 
> Ok, I'm not at all familiar with GFs.  I see now a bit of what you are
> saying.  But to be honest, I don't know the tools would have any trouble
> with the example you give.  The tools are pretty durn good at
> optimizing... *but*... there are two things to optimize for, size and
> performance.  They are sometimes mutually exclusive, sometimes not.  If
> you ask the tool to give you the optimum size, I don't think you will do
> better if you code it differently, while describing *exactly* the same
> behavior.

You seemed to have misinterpreted my earlier post in this thread in which 
I had two implementations of *exactly* the same function that were giving 
very different speed and area results after synthesis.

I think what you say (about the coding not mattering and the tools doing 
a good job) is true for "small" functions.
Once you get past a certain level of complexity, the tools (at least the 
Xilinx ones) seem to do a poor job if the function has been coded the 
canonical way.  Recoding as a tree of XORs can give better results.

This is a fault in the synthesiser and is independent of the speed vs 
area optimisation setting.

Regards,
Allan

On 07/10/15 05:02, rickman wrote:
> On 10/5/2015 1:29 PM, Kevin Neilson wrote:
>>>
>>> We are talking modulo 2 multiplies at every bit, otherwise known as
>>> AND gates with no carry?  I'm a bit fuzzy on this.
>>>
>>> Now I'm confused by Kevin's description.  If the vector is
>>> multiplied by a scalar, what parts are common and what parts are
>>> unique?  What parts of this are fixed vs. variable?  The only parts
>>> a tool can optimize are the fixed operands.  Or I am totally
>>> missing the concept.
>>>
>> Say you're multiplying a by a vector [b c d].  Let's say we're using
>> the field GF(8) so a is 3 bits.  Now a can be thought of as (
>> a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and
>> alpha is the primitive element of the field.  Then a*b or a*c or a*d
>> is just a sum of some combination of those 3 values in the
>> parentheses, depending upon the locations of the 1s in b, c, or d.
>> So you can premultiply the three values in the parentheses (the
>> common part) and then take sums of subsets of those three (the
>> individual parts).  It's all a bunch of XORs at the end.  This is
>> just a complicated way of saying that by writing the HDL at a more
>> explicit level, the synthesizer is better able to find common factors
>> and use a lot fewer gates..
> 
> Ok, I'm not at all familiar with GFs.  I see now a bit of what you are
> saying.  But to be honest, I don't know the tools would have any trouble
> with the example you give.  The tools are pretty durn good at
> optimizing... *but*... there are two things to optimize for, size and
> performance.  They are sometimes mutually exclusive, sometimes not.  If
> you ask the tool to give you the optimum size, I don't think you will do
> better if you code it differently, while describing *exactly* the same
> behavior.
> 
> If you ask the tool to optimize for speed, the tool will feel free to
> duplicate logic if it allows higher performance, for example, by
> combining terms in different ways.  Or less logic may require a longer
> chain of LUTs which will be slower.  LUT sizes in FPGAs don't always
> match the logical breakdown so that speed or size can vary a lot
> depending on the partitioning.
> 

GF(8) is a Galios Field with 8 elements.  That means that there are 8
different elements.  These can be written in different ways.  Sometimes
people like to be completely abstract and write 0, 1, a, b, c, d, e, f.

<http://www.wolframalpha.com/input/?i=GF%288%29>

Others will want to use the polynomial forms (which are used in the
definition of GF(8)) and write: 0, 1, x, x+1, x&#4294967295;, x&#4294967295;+1, x&#4294967295;+x, x&#4294967295;+x+1.

<http://www.math.uic.edu/~leon/mcs425-s08/handouts/field.pdf>

That form is easily written in binary: 000, 001, 010, 011, 100, etc.

One key point about a GF (or any field) is that you can combine elements
through addition, subtraction, multiplication, and division (except by
0) and get another element of the field.  To make this work, however,
addition and multiplication (and therefore their inverses) are very
different from how you normally define them.

A GF is always of size p^n - in this case, p is 2 and n is 3.  Your
elements are a set of n elements from Z/p (the integers modulo p).
Addition is just pairwise addition of elements, modulo p.  For the case
p = 2 (which is commonly used for practical applications, because it
suits computing so neatly), this means you can hold your elements as a
simple binary string of n digits, and addition is just xor.

Multiplication is a little more complex.  Treat your elements as
polynomials (as shown earlier), and multiply them.  Any factors of x^i
can be reduced modulo p.  Then the whole thing is reduced modulo the
field's defining irreducible degree n polynomial (in this case, x&#4294967295; + x +
1).  This always gets you back to another element in the field.

The real magic here is that the multiplication table (excluding 0) forms
a Latin square - every element appears exactly once in each row and
column.  This means that division "works".

It's easy to get a finite field like this for size p, where p is prime -
it's just the integers modulo p, and you can use normal addition and
multiplication modulo p.  But the beauty of GF's is that they give you
fields of different sizes - in particular, size 2^n.

Back to the task in hand - implementing GF(8) on an FPGA.  Addition is
just xor - the tools should not have trouble optimising that!
Multiplication in GF is usually done with lookup tables.  For a field as
small as GF(8), you would have a table with all the elements.  This
could be handled in a 6x3 bit "ROM", or the tools could generate logic
and then reduce it to simple gates (for example, the LSB bit of the
product of non-zero elements is the xnor of the LSB bits of the
operands).  For larger fields, such as the very useful GF(2^8), you will
probably use lookup tables for the logs and antilogs, and use them for
multiplication and division.

A key area of practical application for GF's is in error detection and
correction.  A good example is for RAID6 (two redundant disks in a RAID
array) - there is an excellent paper on the subject by the guy who
implemented RAID 6 in Linux, using GF(2^8).  It gives a good
introduction to Galois fields.

<https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>

On 10/5/2015 1:29 PM, Kevin Neilson wrote:
>>
>> We are talking modulo 2 multiplies at every bit, otherwise known as
>> AND gates with no carry?  I'm a bit fuzzy on this.
>>
>> Now I'm confused by Kevin's description.  If the vector is
>> multiplied by a scalar, what parts are common and what parts are
>> unique?  What parts of this are fixed vs. variable?  The only parts
>> a tool can optimize are the fixed operands.  Or I am totally
>> missing the concept.
>>
> Say you're multiplying a by a vector [b c d].  Let's say we're using
> the field GF(8) so a is 3 bits.  Now a can be thought of as (
> a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and
> alpha is the primitive element of the field.  Then a*b or a*c or a*d
> is just a sum of some combination of those 3 values in the
> parentheses, depending upon the locations of the 1s in b, c, or d.
> So you can premultiply the three values in the parentheses (the
> common part) and then take sums of subsets of those three (the
> individual parts).  It's all a bunch of XORs at the end.  This is
> just a complicated way of saying that by writing the HDL at a more
> explicit level, the synthesizer is better able to find common factors
> and use a lot fewer gates..

Ok, I'm not at all familiar with GFs.  I see now a bit of what you are 
saying.  But to be honest, I don't know the tools would have any trouble 
with the example you give.  The tools are pretty durn good at 
optimizing... *but*... there are two things to optimize for, size and 
performance.  They are sometimes mutually exclusive, sometimes not.  If 
you ask the tool to give you the optimum size, I don't think you will do 
better if you code it differently, while describing *exactly* the same 
behavior.

If you ask the tool to optimize for speed, the tool will feel free to 
duplicate logic if it allows higher performance, for example, by 
combining terms in different ways.  Or less logic may require a longer 
chain of LUTs which will be slower.  LUT sizes in FPGAs don't always 
match the logical breakdown so that speed or size can vary a lot 
depending on the partitioning.

-- 

Rick

> 
> We are talking modulo 2 multiplies at every bit, otherwise known as AND 
> gates with no carry?  I'm a bit fuzzy on this.
> 
> Now I'm confused by Kevin's description.  If the vector is multiplied by 
> a scalar, what parts are common and what parts are unique?  What parts 
> of this are fixed vs. variable?  The only parts a tool can optimize are 
> the fixed operands.  Or I am totally missing the concept.
> 
Say you're multiplying a by a vector [b c d].  Let's say we're using the field GF(8) so a is 3 bits.  Now a can be thought of as ( a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and alpha is the primitive element of the field.  Then a*b or a*c or a*d is just a sum of some combination of those 3 values in the parentheses, depending upon the locations of the 1s in b, c, or d.  So you can premultiply the three values in the parentheses (the common part) and then take sums of subsets of those three (the individual parts).  It's all a bunch of XORs at the end.  This is just a complicated way of saying that by writing the HDL at a more explicit level, the synthesizer is better able to find common factors and use a lot fewer gates.

Not only can I not use the DSP blocks, but the carry chains don't work either.  The only way to do a big XOR is with LUTs, and at my clock speeds, I can only do about 3 LUT levels.  At least there are plenty of BRAMs in Virtex parts now, so it's easy to do logs, inverses, and exponentiation.

On 10/4/2015 2:01 AM, Allan Herriman wrote:
> On Sat, 03 Oct 2015 15:51:34 -0400, rickman wrote:
>
>> On 10/3/2015 3:37 PM, Kevin Neilson wrote:
>>> I just built an optimized Galois Field vector multiplier, which
>>> multiplies a vector by a scalar.  As an experiment, I split it into two
>>> parts, one that was common to all elements of the vector, and then the
>>> parts that were unique to each element of the vector.  I had assumed
>>> this is something the synthesizer would do anyway, but I was surprised
>>> to find that writing it the way I did cut down the number of LUTs by a
>>> big margin.
>>
>> Why is it using LUTs instead of multipliers?  Are these numbers too
>> small and too many to use the multipliers efficiently?
>
>
> Kevin is talking about Galois Field multipliers.  The integer multiplier
> blocks are useless for that.

We are talking modulo 2 multiplies at every bit, otherwise known as AND 
gates with no carry?  I'm a bit fuzzy on this.

Now I'm confused by Kevin's description.  If the vector is multiplied by 
a scalar, what parts are common and what parts are unique?  What parts 
of this are fixed vs. variable?  The only parts a tool can optimize are 
the fixed operands.  Or I am totally missing the concept.

> I occasionally implement cryptographic primitives in FPGAs.  These often
> use a combination of linear mixing functions and (other stuff which
> provides nonlinearity, which I don't need to talk about here).  The
> linear mixing functions often contain GF multiplications (sometimes by
> constants).
>
> It's possible to express the linear mixing functions in HDL behaviourally
> (e.g. including things that are recognisably GF multipliers), and it's
> also possible to express them as a sea of XOR gates.
>
> My experience using the latest tools from Xilinx is that for a particular
> 128 bit mixing function I was getting three times as many levels of logic
> from from the behavioural source as I was from the sea of XOR gates
> source, even though both described the same function.

I don't know enough of how the tools work to say what is going on.  I 
just know that when I dig into the output of the tools for poorly 
synthesized code, I find the problems are that my code doesn't specify 
the simple structure I had imagined it did.  So I fix my code.  :)

Mostly this has to do with trying to use the carry out the top of 
adders.  Sometimes I get two adders.

> BTW, one runs into similar problems when calculating CRCs of wide buses.
> The CRCs also simplify to XOR trees, but in this case we are calculating
> the remainder after a division, rather than a multiplication.  (And yes,
> the integer DSP blocks are useless for this too.)

Yep, you don't want a carry, so don't even think about using adders or 
multipliers.  They are not adders and multipliers in every type of algebra.

-- 

Rick

On Sat, 03 Oct 2015 15:51:34 -0400, rickman wrote:

> On 10/3/2015 3:37 PM, Kevin Neilson wrote:
>> I just built an optimized Galois Field vector multiplier, which
>> multiplies a vector by a scalar.  As an experiment, I split it into two
>> parts, one that was common to all elements of the vector, and then the
>> parts that were unique to each element of the vector.  I had assumed
>> this is something the synthesizer would do anyway, but I was surprised
>> to find that writing it the way I did cut down the number of LUTs by a
>> big margin.
> 
> Why is it using LUTs instead of multipliers?  Are these numbers too
> small and too many to use the multipliers efficiently?

Kevin is talking about Galois Field multipliers.  The integer multiplier 
blocks are useless for that.

I occasionally implement cryptographic primitives in FPGAs.  These often 
use a combination of linear mixing functions and (other stuff which 
provides nonlinearity, which I don't need to talk about here).  The 
linear mixing functions often contain GF multiplications (sometimes by 
constants).

It's possible to express the linear mixing functions in HDL behaviourally 
(e.g. including things that are recognisably GF multipliers), and it's 
also possible to express them as a sea of XOR gates.

My experience using the latest tools from Xilinx is that for a particular 
128 bit mixing function I was getting three times as many levels of logic 
from from the behavioural source as I was from the sea of XOR gates 
source, even though both described the same function.

BTW, one runs into similar problems when calculating CRCs of wide buses.  
The CRCs also simplify to XOR trees, but in this case we are calculating 
the remainder after a division, rather than a multiplication.  (And yes, 
the integer DSP blocks are useless for this too.)

Regards,
Allan