comp.arch.fpga | modulo 2**32-1 arith| page 4

Reply by Ilya Kalistru ●December 20, 20152015-12-20

> 
> If that is your requirement, then the other solutions are wrong, like 
> mine.

They are. But they give me ideas and approaches. 

> In your original post you say you want a function to produce "modulo 
> 2**32-1".  The description you supply is *not* "modulo 2**32-1".  It 
> sounds like you have an existing implementation that may or may not be 
> correct but you have been tasked with duplicating it.  If I were tasked 
> with this I would go through channels to get verification that the 
> specification is correct.  It wouldn't be the first time there was an 
> error in an implementation that works most of the time, but fails 1 time 
> in 2^32 in this case.

In my original post I was wrong. I forgot about this oddity and gave you wrong description. This specification is 26 years old and I have tons of implementations :). I don't think it's a good idea to try to get verification. It's an old standard set by government and it's a very long way to go. Especially if the part of work you are responsible for haven't done yet.


> What happens downstream if your function spits out a number that is not 
> in the range of "modulo 2**32-1"?
> 

Nothing special. It's just a way to generate pseudorandom data. It could be done this way or other way, but should be similar on all devices. The only difference - security, but I wouldn't check algorithm after professional mathematicians - not my business.


> 
> The problem I found was the input FFs were shoved into the IOB (a common 
> opimization).  This spreads the FFs around the IOs which are spaced far 
> apart.  You need to tell the tools not to do that *or* place additional 
> FFs in the circuit so the ones in the IOBs are not part of the path 
> being timed.  Maybe you already have IOB FFs disabled.
> 

I think that it's pointless to shove FF inside the IOB if there is a rule that we don't care about timings between ports and FFs. Why could we do that?
I haven't disable IOB FFs.

Reply by Ilya Kalistru ●December 20, 20152015-12-20

> This is absolutely a module function, It just treats 2^32-1 as == 0 
> (which it is module 2^31-1).

Maybe. Let's not fight about mathematical definitions. :)

Reply by rickman ●December 20, 20152015-12-20

On 12/20/2015 1:39 PM, Richard Damon wrote:
> On 12/20/15 3:57 AM, Ilya Kalistru wrote:
>>
>> Ok. It's wrong function to produce the modulo 2**n-1. But I don't need
>> correct modulo function, I need function which gives us that:
>> A+B if A+B<2**32
>> A+B-2**32+1 if A+B>=2**32
>> It's absolutely different from modulo function, but it's what they use
>> in the description of algorithm. It's not my responsibility to change
>> algorithm and this algorithm is used in software implementation
>> already and if I changed this function to correct modulo function my
>> hardware implementation would be incompatible with software
>> implementation and wouldn't comply to requirements of government.
>>
> This is absolutely a module function, It just treats 2^32-1 as == 0
> (which it is module 2^31-1).

That's the problem.  Some of the algorithms provided create mod 2^32-1 
while others include 2^32-1 in the output and so *are not* mod 2^32-1. 
To get mod 2^32-1 you need to add a 1 to the sum that controls if a one 
is added to the sum before taking mod 2^32.  The way Ilya is specifying 
the algorithm the 1 is not added to the sum that controls the adding of 
the 1 to the output sum and so 2^32-1 will show up in the output, not 
converted to 0 as it should be for the mod function.

-- 

Rick

Reply by rickman ●December 20, 20152015-12-20

On 12/20/2015 2:07 PM, Ilya Kalistru wrote:
>>
>> If that is your requirement, then the other solutions are wrong, like
>> mine.
>
> They are. But they give me ideas and approaches.
>
>> In your original post you say you want a function to produce "modulo
>> 2**32-1".  The description you supply is *not* "modulo 2**32-1".  It
>> sounds like you have an existing implementation that may or may not be
>> correct but you have been tasked with duplicating it.  If I were tasked
>> with this I would go through channels to get verification that the
>> specification is correct.  It wouldn't be the first time there was an
>> error in an implementation that works most of the time, but fails 1 time
>> in 2^32 in this case.
>
> In my original post I was wrong. I forgot about this oddity and gave you wrong description. This specification is 26 years old and I have tons of implementations :). I don't think it's a good idea to try to get verification. It's an old standard set by government and it's a very long way to go. Especially if the part of work you are responsible for haven't done yet.

Ok, if they have been using this method for a long time I guess the 
issue of being out of range does not cause a problem.  But certainly if 
this were being used in a critical application (such as cryptography) it 
would be a major concern since I am almost positive the greater 
algorithm is expecting numbers in the more restricted range.  Otherwise 
why specify it as mod 2^n-1?

In any event, remove the " + 1" from the end of the sum I offered and 
you should have the behavior you are looking for.

>> What happens downstream if your function spits out a number that is not
>> in the range of "modulo 2**32-1"?
>>
>
> Nothing special. It's just a way to generate pseudorandom data. It could be done this way or other way, but should be similar on all devices. The only difference - security, but I wouldn't check algorithm after professional mathematicians - not my business.

This sounds familiar.  I think I was in a conversation about this 
algorithm in another group sometime not too long ago.  The mod 2^n-1 
rings that bell.

>> The problem I found was the input FFs were shoved into the IOB (a common
>> opimization).  This spreads the FFs around the IOs which are spaced far
>> apart.  You need to tell the tools not to do that *or* place additional
>> FFs in the circuit so the ones in the IOBs are not part of the path
>> being timed.  Maybe you already have IOB FFs disabled.
>>
>
> I think that it's pointless to shove FF inside the IOB if there is a rule that we don't care about timings between ports and FFs. Why could we do that?
> I haven't disable IOB FFs.

I think I'm not being clear.  In my P&R the input FFs are automatically 
placed into the IOBs.  I don't know if the output FFs are also placed in 
the IOBs, I didn't dig this info out of the report once I saw the inputs 
were not right.

This placement distorts the timing numbers because the routes have to be 
much longer to reach the widely spread IOBs.  I don't know if your tool 
is doing this or not.  In my case I would either need to find the 
setting to prevent placing the FFs in IOBs or I need to add extra FFs to 
the test design so that the routes I want to time are all between fabric 
FFs.

This has nothing to do with timing rules.  In fact, if you add a timing 
rule that says to ignore timings between IOB FFs and the internal FFs it 
may not time your paths at all.  But I expect your tool is not using the 
IOB FFs so you aren't seeing this problem.

Please don't feel you need to continue this conversation if you have 
gotten from it what you need.  I don't want to be bugging you with 
nagging details.

-- 

Rick

Reply by Gabor Szakacs ●December 20, 20152015-12-20

On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
> I thank you all for this very interesting and useful discussion. There is a lot to think about.
> But I should apologize: later on it turns out that the definition of "modulo 2**32-1" which is used in the algorithm description other than I had thought. It is weird from the mathematical point of view:
> A+B if A+B<2**32
> A+B-2**32+1 if A+B>=2**32

So in fact this is what I originally suggested, i.e. "end around carry."
This is often used in checksum algorithms.  it's the equivalent of
(without the type / size changes):

sum <= A + B;
out <= sum(31 downto 0) + sum(32);

Note that if you do the whole thing in one cycle by looping the carry
out of a 32-bit full adder to its carry in, you have not only two passes
through a 32-bit carry chain (worst case) but also the non-dedicated
routing in the loopback connection from carry out to carry in.

If your device supports a full 64-bit addition in one cycle, you could
work around the routing issue by using a 64-bit addition on replicated
inputs like:

sum = (A & A) + (B & B);
out = sum (63 downto 32);

Here the low 32-bits of the adder just provide the original carry out,
and the second adder allows you to use dedicated carry chain routing
assuming your device is large enough to support 64 bits in one
column.

If you can use two clocks of latency, then your pipeline scheme
below should work best:

> I managed to meet timings by using Sum(32) to decide what sum I should use.
>
>      Sum <= resize(A, 33)+resize(B, 33);
>      Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
>
>      process(clk)
>      begin
>          if rising_edge(clk) then
>              A <= A_in;
>              B <= B_in;
>
>              if Sum(32) = '1' then
>                  C<=resize(Sum_P1,32);
>              else
>                  C<=resize(Sum,32);
>              end if;
>          end if;
>      end process;
>
> In my case (Artix-7) this code gives me a critical path of LUT2 then 8 CARRY4 then LUT3 and it's pretty fast (3.09 ns estimated before the routing step). It consumes 96 luts estimated.
>
> If I use
> Sum_P1 <= Sum+1;
> istead, it gives me a critical path of LUT2 + 9 CARRY4 + LUT2 + 8 CARRY4 and it doesn't meet timings (4.232 ns estimated before the routing step). It consumes 64 luts estimated.
>
>
> If we return to our original task with a proper modulo 2*32-1 addition and use Sun_P1(32) for the select input of the multiplexer
>      Sum <= resize(A, 33)+resize(B, 33);
>      Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
>
>      process(clk)
>      begin
>          if rising_edge(clk) then
>              A <= A_in;
>              B <= B_in;
>
>              if Sum_P(32) = '1' then
>                  C<=resize(Sum_P1,32);
>              else
>                  C<=resize(Sum,32);
>              end if;
>          end if;
>      end process;
> we would have exactly the same results on Artix-7 as with "weird" addition. (And it makes sense)
>
> in case of Sum_P1 <= Sum+1; we have LUT2 + 7 CARRI4 + LUT + 2 CARRY4 + LUT2 and we don't meet timings(4.444 ns estimated before the routing step). It consumes 96 luts estimated.
>
> if we do it like that:
>      Y1  <= resize(A, 33) + resize(B, 33) + 1;
>      Y   <= resize(A + B + resize(Y1(32 downto 32),33), 32);
>
>      process(clk)
>      begin
>          if rising_edge(clk) then
>              A <= A_in;
>              B <= B_in;
>
>              C <= Y;
>          end if;
>      end process;
> we have in critical path LUT2 + 8*CARRY4 + LUT2 + 8*CARRY4 and we don't meet timings(4.338 ns estimated before the routing step). It consumes 64 luts estimated.
>

-- 
Gabor

Reply by rickman ●December 21, 20152015-12-21

On 12/20/2015 5:11 PM, Gabor Szakacs wrote:
> On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
>> I thank you all for this very interesting and useful discussion. There
>> is a lot to think about.
>> But I should apologize: later on it turns out that the definition of
>> "modulo 2**32-1" which is used in the algorithm description other than
>> I had thought. It is weird from the mathematical point of view:
>> A+B if A+B<2**32
>> A+B-2**32+1 if A+B>=2**32
>
> So in fact this is what I originally suggested, i.e. "end around carry."
> This is often used in checksum algorithms.  it's the equivalent of
> (without the type / size changes):
>
> sum <= A + B;
> out <= sum(31 downto 0) + sum(32);
>
> Note that if you do the whole thing in one cycle by looping the carry
> out of a 32-bit full adder to its carry in, you have not only two passes
> through a 32-bit carry chain (worst case) but also the non-dedicated
> routing in the loopback connection from carry out to carry in.

Another problem with this method is that it creates a combinatorial loop 
which prevents proper static timing analysis.  It is very efficient with 
the logic however.

I'm just very curious about why this is the desired algorithm while it 
is being called "mod 2^32-1", but obviously we will never know the 
origin of this.

-- 

Rick

Reply by Gabor Szakacs ●December 21, 20152015-12-21

On 12/20/2015 11:23 PM, rickman wrote:
> On 12/20/2015 5:11 PM, Gabor Szakacs wrote:
>> On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
>>> I thank you all for this very interesting and useful discussion. There
>>> is a lot to think about.
>>> But I should apologize: later on it turns out that the definition of
>>> "modulo 2**32-1" which is used in the algorithm description other than
>>> I had thought. It is weird from the mathematical point of view:
>>> A+B if A+B<2**32
>>> A+B-2**32+1 if A+B>=2**32
>>
>> So in fact this is what I originally suggested, i.e. "end around carry."
>> This is often used in checksum algorithms.  it's the equivalent of
>> (without the type / size changes):
>>
>> sum <= A + B;
>> out <= sum(31 downto 0) + sum(32);
>>
>> Note that if you do the whole thing in one cycle by looping the carry
>> out of a 32-bit full adder to its carry in, you have not only two passes
>> through a 32-bit carry chain (worst case) but also the non-dedicated
>> routing in the loopback connection from carry out to carry in.
>
> Another problem with this method is that it creates a combinatorial loop
> which prevents proper static timing analysis.  It is very efficient with
> the logic however.
>
> I'm just very curious about why this is the desired algorithm while it
> is being called "mod 2^32-1", but obviously we will never know the
> origin of this.
>

I have always heard this method referred to as "end around carry,"
however if you are using this as an accumulator, i.e. A is the next
data input and B is the result of the previous sum, then it is in
effect the same as taking the sum of inputs modulo 2**32 - 1, with
the only difference being that the final result can be equal to
2**32 - 1 where it would normally be zero.  Intermediate results
equal to 2**32-1 do not change the final outcome vs. doing a true
modulo operator.

-- 
Gabor

Reply by rickman ●December 21, 20152015-12-21

On 12/21/2015 11:26 AM, Gabor Szakacs wrote:
> On 12/20/2015 11:23 PM, rickman wrote:
>> On 12/20/2015 5:11 PM, Gabor Szakacs wrote:
>>> On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
>>>> I thank you all for this very interesting and useful discussion. There
>>>> is a lot to think about.
>>>> But I should apologize: later on it turns out that the definition of
>>>> "modulo 2**32-1" which is used in the algorithm description other than
>>>> I had thought. It is weird from the mathematical point of view:
>>>> A+B if A+B<2**32
>>>> A+B-2**32+1 if A+B>=2**32
>>>
>>> So in fact this is what I originally suggested, i.e. "end around carry."
>>> This is often used in checksum algorithms.  it's the equivalent of
>>> (without the type / size changes):
>>>
>>> sum <= A + B;
>>> out <= sum(31 downto 0) + sum(32);
>>>
>>> Note that if you do the whole thing in one cycle by looping the carry
>>> out of a 32-bit full adder to its carry in, you have not only two passes
>>> through a 32-bit carry chain (worst case) but also the non-dedicated
>>> routing in the loopback connection from carry out to carry in.
>>
>> Another problem with this method is that it creates a combinatorial loop
>> which prevents proper static timing analysis.  It is very efficient with
>> the logic however.
>>
>> I'm just very curious about why this is the desired algorithm while it
>> is being called "mod 2^32-1", but obviously we will never know the
>> origin of this.
>>
>
> I have always heard this method referred to as "end around carry,"
> however if you are using this as an accumulator, i.e. A is the next
> data input and B is the result of the previous sum, then it is in
> effect the same as taking the sum of inputs modulo 2**32 - 1, with
> the only difference being that the final result can be equal to
> 2**32 - 1 where it would normally be zero.  Intermediate results
> equal to 2**32-1 do not change the final outcome vs. doing a true
> modulo operator.

I guess I see what you are saying, even if this only applies to an 
accumulator rather than taking a sum of two arbitrary numbers.  When a 
sum is equal to 2**N-1 the carry is essentially delayed until the next 
cycle.  But... if the next value to be added is 2**n-1 (don't know if 
this is possible) a carry will happen, but there should be a second 
carry which will be missed.  So as long as the input domain is 
restricted to number from 0 to 2**N-2 this will work if the final result 
is adjusted to zero when equal to 2**N-1.

-- 

Rick

Reply by Ilya Kalistru ●December 21, 20152015-12-21

 
> I'm interested.  Placement will have a huge effect.  I look at routed 
> results for my implementation and I saw one route of 4.5 ns.  Looks like 
> they shoved the input FFs into the IOBs, so the design can't all be 
> close together as the IO is scattered around the chip perimeter.  I'd 
> need to kill that or add another layer of FFs so the routes of interest 
> are all internal FFs which could be co-located.  Still doesn't say they 
> *will* be close together.  I find tools to be hard to manage in that 
> regard unless you do floor-planning.  I much prefer a slow clock... lol

Today I've tried the new method in the real design and it tuned out that it works quite well. Biggest negative timing slack was about -0.047 or so. If I can take an advantage of single cycle operation, I'll try to implement it there.
Total delay of critical path there is about 4 ns and it's bigger then in my simple test mostly because of worse routing to LUTs at the beginning and at the end of a dedicated CARRY network.

Reply by rickman ●December 21, 20152015-12-21

On 12/21/2015 2:33 PM, Ilya Kalistru wrote:
>
>> I'm interested.  Placement will have a huge effect.  I look at routed
>> results for my implementation and I saw one route of 4.5 ns.  Looks like
>> they shoved the input FFs into the IOBs, so the design can't all be
>> close together as the IO is scattered around the chip perimeter.  I'd
>> need to kill that or add another layer of FFs so the routes of interest
>> are all internal FFs which could be co-located.  Still doesn't say they
>> *will* be close together.  I find tools to be hard to manage in that
>> regard unless you do floor-planning.  I much prefer a slow clock... lol
>
> Today I've tried the new method in the real design and it tuned out that it works quite well. Biggest negative timing slack was about -0.047 or so. If I can take an advantage of single cycle operation, I'll try to implement it there.
> Total delay of critical path there is about 4 ns and it's bigger then in my simple test mostly because of worse routing to LUTs at the beginning and at the end of a dedicated CARRY network.

Lots of times routing is half the total delay in a design.  In this case 
it may be less because of there just being two routes and the carry 
chain.  But the point is a bad routing delay in just one place can 
dominate.

-- 

Rick

Previous 2 345 Next

modulo 2**32-1 arith

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group