Reply by rickman December 29, 20152015-12-29
On 12/29/2015 12:17 PM, Kevin Neilson wrote:
>>> This is somewhat unrelated, but I just remembered that recently I had to make a circuit to find modulo-24. I tried several things, but what ended up being fastest and smallest, by far, was mathematically rephrasing the expression, something like this: >>> >>> reg [11:0] x; >>> reg [4:0] x_mod_24; >>> ... >>> x_mod_24 = ((x>>3)%3)<<3 & x[2:0]; >> >> Interesting. I guess a divide by 24 is a lot more complex than a divide >> by 3 even though it doesn't have to be. > > Doing a division is one way to do it, but if you look at what the synthesizer does to do mod-24, it's something like this: > > x_mod_24_stage1 = x[11]*( (1<<11)%24) + x[10]*( (1<<10)%24) ... + x[3:0]; > > adding up the mod-24 values due to each bit (and then adding the last 3 bits, which are already mod-24). These are all 5-bit values. After you add these, the result may be bigger than 24 and there's a second and third stage of this process. When you do it the "rephrased" way, the mod-3 logic is a lot smaller (the sums are all 2 bits) and has fewer stages.
But clearly there is no need for it to be more complex. What you describe above is essentially a product. I think there last term should be x[2:0] which is the three lsbs. These bits on the output are only affected by the three lsbs of the input. The upper bits of the input can only affect the upper bits of the output. So clearly there is no reason for the tool to calculate the entire 5 bits of the output based on the entire N bits of the input. -- Rick
Reply by Kevin Neilson December 29, 20152015-12-29
> > This is somewhat unrelated, but I just remembered that recently I had to make a circuit to find modulo-24. I tried several things, but what ended up being fastest and smallest, by far, was mathematically rephrasing the expression, something like this: > > > > reg [11:0] x; > > reg [4:0] x_mod_24; > > ... > > x_mod_24 = ((x>>3)%3)<<3 & x[2:0]; > > Interesting. I guess a divide by 24 is a lot more complex than a divide > by 3 even though it doesn't have to be.
Doing a division is one way to do it, but if you look at what the synthesizer does to do mod-24, it's something like this: x_mod_24_stage1 = x[11]*( (1<<11)%24) + x[10]*( (1<<10)%24) ... + x[3:0]; adding up the mod-24 values due to each bit (and then adding the last 3 bits, which are already mod-24). These are all 5-bit values. After you add these, the result may be bigger than 24 and there's a second and third stage of this process. When you do it the "rephrased" way, the mod-3 logic is a lot smaller (the sums are all 2 bits) and has fewer stages.
Reply by rickman December 26, 20152015-12-26
On 12/26/2015 9:01 PM, Kevin Neilson wrote:
>> Uh, if you read Ilya's posts the 64 bit carry chain was faster than the >> muxed selection. Or did I misunderstand this? > > I hadn't seen the part about the 64-bit add. That's a nice idea, too. I wouldn't have expected it to be faster than the basic Shannon Expander, but with the quirks of getting data on and off the carry chain, it doesn't surprise me that it is.
I don't think it has anything to do with getting "data on and off the carry chain". The slow part of FPGAs is often just routing. The 64 bit add eliminates a lot of routing that is needed to implement the mux as well as eliminating the mux itself.
> This is somewhat unrelated, but I just remembered that recently I had to make a circuit to find modulo-24. I tried several things, but what ended up being fastest and smallest, by far, was mathematically rephrasing the expression, something like this: > > reg [11:0] x; > reg [4:0] x_mod_24; > ... > x_mod_24 = ((x>>3)%3)<<3 & x[2:0];
Interesting. I guess a divide by 24 is a lot more complex than a divide by 3 even though it doesn't have to be. -- Rick
Reply by Kevin Neilson December 26, 20152015-12-26
> Uh, if you read Ilya's posts the 64 bit carry chain was faster than the > muxed selection. Or did I misunderstand this?
I hadn't seen the part about the 64-bit add. That's a nice idea, too. I wouldn't have expected it to be faster than the basic Shannon Expander, but with the quirks of getting data on and off the carry chain, it doesn't surprise me that it is. This is somewhat unrelated, but I just remembered that recently I had to make a circuit to find modulo-24. I tried several things, but what ended up being fastest and smallest, by far, was mathematically rephrasing the expression, something like this: reg [11:0] x; reg [4:0] x_mod_24; ... x_mod_24 = ((x>>3)%3)<<3 & x[2:0];
Reply by rickman December 26, 20152015-12-26
On 12/25/2015 9:04 PM, Kevin Neilson wrote:
> This has all been pretty interesting to me because I may need to do > this exact thing. I had been planning to do the "end around carry", > as you call it, but if that is too slow, I can use Ilya's Shannon > Expansion. Ilya's method (i.e., calculating both A+B and A+B+1 and > choosing one based on the carry out) would be much faster at the > expense of extra logic. It's just standard Shannon Expansion but I > hadn't thought of it.
Uh, if you read Ilya's posts the 64 bit carry chain was faster than the muxed selection. Or did I misunderstand this?
> I might need to do this for Galois field arithmetic (which is > probably the case for Ilya as well). One way to multiply GF numbers > is to take the logs, add them together mod 2^m-1, and then take the > inverse log, where 2^m is the size of the Galois field. In my case, > m is only 10 or 11, so I can use lookup tables for the logs and > antilogs. > > f*g = alog (mod ( log(f)+log(g), 2^m-1)) > > where f,g are elements of GF(2^m) > > It's the same process as a slide rule. It doesn't matter if the > result of the mod ends up being 2^m-1 instead of 0, because the > alog() function is a lookup table and will still map alog(2^m-1) = > alog(0) = 1. > > This method would be especially helpful if you are finding the > product of several GF numbers. So far I've just been using actual GF > multipliers. Sometimes they are big, but you can get the product in > a single cycle, and the lookup tables entail latency. Also, if you > have to add in a number before the next multiplication, you have to > go back into the alog domain. >
-- Rick
Reply by Kevin Neilson December 25, 20152015-12-25
This has all been pretty interesting to me because I may need to do this exact thing.  I had been planning to do the "end around carry", as you call it, but if that is too slow, I can use Ilya's Shannon Expansion.  Ilya's method (i.e., calculating both A+B and A+B+1 and choosing one based on the carry out) would be much faster at the expense of extra logic.  It's just standard Shannon Expansion but I hadn't thought of it.

I might need to do this for Galois field arithmetic (which is probably the case for Ilya as well).  One way to multiply GF numbers is to take the logs, add them together mod 2^m-1, and then take the inverse log, where 2^m is the size of the Galois field.  In my case, m is only 10 or 11, so I can use lookup tables for the logs and antilogs.

f*g = alog (mod ( log(f)+log(g), 2^m-1))  
         
          where f,g are elements of GF(2^m)

It's the same process as a slide rule.  It doesn't matter if the result of the mod ends up being 2^m-1 instead of 0, because the alog() function is a lookup table and will still map alog(2^m-1) = alog(0) = 1.

This method would be especially helpful if you are finding the product of several GF numbers.  So far I've just been using actual GF multipliers.  Sometimes they are big, but you can get the product in a single cycle, and the lookup tables entail latency.  Also, if you have to add in a number before the next multiplication, you have to go back into the alog domain.
Reply by Richard Damon December 21, 20152015-12-21
On 12/21/15 1:57 PM, rickman wrote:
> On 12/21/2015 11:26 AM, Gabor Szakacs wrote: >> I have always heard this method referred to as "end around carry," >> however if you are using this as an accumulator, i.e. A is the next >> data input and B is the result of the previous sum, then it is in >> effect the same as taking the sum of inputs modulo 2**32 - 1, with >> the only difference being that the final result can be equal to >> 2**32 - 1 where it would normally be zero. Intermediate results >> equal to 2**32-1 do not change the final outcome vs. doing a true >> modulo operator. > > I guess I see what you are saying, even if this only applies to an > accumulator rather than taking a sum of two arbitrary numbers. When a > sum is equal to 2**N-1 the carry is essentially delayed until the next > cycle. But... if the next value to be added is 2**n-1 (don't know if > this is possible) a carry will happen, but there should be a second > carry which will be missed. So as long as the input domain is > restricted to number from 0 to 2**N-2 this will work if the final result > is adjusted to zero when equal to 2**N-1. >
All ones (the alternate to 0 = 2^32-1) plus All ones will give us the result of All ones due to the end around carry (in effect the second carry was remembered by leaving the result as all ones).
Reply by rickman December 21, 20152015-12-21
On 12/21/2015 2:33 PM, Ilya Kalistru wrote:
> >> I'm interested. Placement will have a huge effect. I look at routed >> results for my implementation and I saw one route of 4.5 ns. Looks like >> they shoved the input FFs into the IOBs, so the design can't all be >> close together as the IO is scattered around the chip perimeter. I'd >> need to kill that or add another layer of FFs so the routes of interest >> are all internal FFs which could be co-located. Still doesn't say they >> *will* be close together. I find tools to be hard to manage in that >> regard unless you do floor-planning. I much prefer a slow clock... lol > > Today I've tried the new method in the real design and it tuned out that it works quite well. Biggest negative timing slack was about -0.047 or so. If I can take an advantage of single cycle operation, I'll try to implement it there. > Total delay of critical path there is about 4 ns and it's bigger then in my simple test mostly because of worse routing to LUTs at the beginning and at the end of a dedicated CARRY network.
Lots of times routing is half the total delay in a design. In this case it may be less because of there just being two routes and the carry chain. But the point is a bad routing delay in just one place can dominate. -- Rick
Reply by Ilya Kalistru December 21, 20152015-12-21
 
> I'm interested. Placement will have a huge effect. I look at routed > results for my implementation and I saw one route of 4.5 ns. Looks like > they shoved the input FFs into the IOBs, so the design can't all be > close together as the IO is scattered around the chip perimeter. I'd > need to kill that or add another layer of FFs so the routes of interest > are all internal FFs which could be co-located. Still doesn't say they > *will* be close together. I find tools to be hard to manage in that > regard unless you do floor-planning. I much prefer a slow clock... lol
Today I've tried the new method in the real design and it tuned out that it works quite well. Biggest negative timing slack was about -0.047 or so. If I can take an advantage of single cycle operation, I'll try to implement it there. Total delay of critical path there is about 4 ns and it's bigger then in my simple test mostly because of worse routing to LUTs at the beginning and at the end of a dedicated CARRY network.
Reply by rickman December 21, 20152015-12-21
On 12/21/2015 11:26 AM, Gabor Szakacs wrote:
> On 12/20/2015 11:23 PM, rickman wrote: >> On 12/20/2015 5:11 PM, Gabor Szakacs wrote: >>> On 12/19/2015 1:03 PM, Ilya Kalistru wrote: >>>> I thank you all for this very interesting and useful discussion. There >>>> is a lot to think about. >>>> But I should apologize: later on it turns out that the definition of >>>> "modulo 2**32-1" which is used in the algorithm description other than >>>> I had thought. It is weird from the mathematical point of view: >>>> A+B if A+B<2**32 >>>> A+B-2**32+1 if A+B>=2**32 >>> >>> So in fact this is what I originally suggested, i.e. "end around carry." >>> This is often used in checksum algorithms. it's the equivalent of >>> (without the type / size changes): >>> >>> sum <= A + B; >>> out <= sum(31 downto 0) + sum(32); >>> >>> Note that if you do the whole thing in one cycle by looping the carry >>> out of a 32-bit full adder to its carry in, you have not only two passes >>> through a 32-bit carry chain (worst case) but also the non-dedicated >>> routing in the loopback connection from carry out to carry in. >> >> Another problem with this method is that it creates a combinatorial loop >> which prevents proper static timing analysis. It is very efficient with >> the logic however. >> >> I'm just very curious about why this is the desired algorithm while it >> is being called "mod 2^32-1", but obviously we will never know the >> origin of this. >> > > I have always heard this method referred to as "end around carry," > however if you are using this as an accumulator, i.e. A is the next > data input and B is the result of the previous sum, then it is in > effect the same as taking the sum of inputs modulo 2**32 - 1, with > the only difference being that the final result can be equal to > 2**32 - 1 where it would normally be zero. Intermediate results > equal to 2**32-1 do not change the final outcome vs. doing a true > modulo operator.
I guess I see what you are saying, even if this only applies to an accumulator rather than taking a sum of two arbitrary numbers. When a sum is equal to 2**N-1 the carry is essentially delayed until the next cycle. But... if the next value to be added is 2**n-1 (don't know if this is possible) a carry will happen, but there should be a second carry which will be missed. So as long as the input domain is restricted to number from 0 to 2**N-2 this will work if the final result is adjusted to zero when equal to 2**N-1. -- Rick