I am looking for an algorithm to calculate a floating point divide. There a= re a number of options, but the one that is most clear to me and easiest to= implement on the hardware I am designing seems to be the Newton=E2=80=93Ra= phson iterative method. I'm trying to understand how many iterations it wil= l take. The formula for that in the Wikipedia article assumes an initial es= timate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterat= ions to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating p= oint numbers. My needs are for something between that range, but would like= to minimize the number of iterations... ideally one. I believe most implementations use a table lookup for the seed value of the= reciprocal of the divisor. In the FPGA I am using, a convenient size would= be 2k entries (11 bit address) of 18 bits. I'm wondering if this would be = sufficiently more accurate than the above estimate so that one iteration wo= uld be sufficient. I'm not clear on how to calculate this. I know in genera= l the formula converges rapidly with the error being squared, so I'm thinki= ng an initial value that is good to some 11 bits would produce more than 20= bits for one turn of the Newton-Raphson crank (20 bits being a level of ac= curacy called "good enough" for this application). But I'd like to have som= ething that verifies this rather than using a rule of thumb. At some point = I will probably have to justify the correctness of the calculations. Because of the hardware facilities available the calculations have intermed= iate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack ot= her than ToS which is the register in the ALU (>36 bits). I don't think 11 = bits is quite as good as required, but I'm thinking one crank of the NR mac= hine and I'm "good". However, I need to be able to show how good. Anyone k= now of good documentation on this matter? Oh, I guess I should mention later on I might be repeating this process for= a square root calculation. A measurement using a sensor that is at least t= emporarily deprecated requires a square root. I managed to show the sensor = was not capable of providing an accuracy at the low end that would produce = meaningful results. But if they find a better sensor they may want to add t= hat sensor as an option and the square root will be back. No! The HORROR! Whatever... It's just an algorithm... --=20 Rick C. - Get 1,000 miles of free Supercharging - Tesla referral code - https://ts.la/richard11209
Division Algorithms
Started by ●January 16, 2021
Reply by ●January 18, 20212021-01-18
On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.com = wrote:> I am looking for an algorithm to calculate a floating point divide. There=are a number of options, but the one that is most clear to me and easiest = to implement on the hardware I am designing seems to be the Newton=E2=80=93= Raphson iterative method. I'm trying to understand how many iterations it w= ill take. The formula for that in the Wikipedia article assumes an initial = estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iter= ations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating= point numbers. My needs are for something between that range, but would li= ke to minimize the number of iterations... ideally one.=20>=20 > I believe most implementations use a table lookup for the seed value of t=he reciprocal of the divisor. In the FPGA I am using, a convenient size wou= ld be 2k entries (11 bit address) of 18 bits. I'm wondering if this would b= e sufficiently more accurate than the above estimate so that one iteration = would be sufficient. I'm not clear on how to calculate this. I know in gene= ral the formula converges rapidly with the error being squared, so I'm thin= king an initial value that is good to some 11 bits would produce more than = 20 bits for one turn of the Newton-Raphson crank (20 bits being a level of = accuracy called "good enough" for this application). But I'd like to have s= omething that verifies this rather than using a rule of thumb. At some poin= t I will probably have to justify the correctness of the calculations.=20>=20 > Because of the hardware facilities available the calculations have interm=ediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack = other than ToS which is the register in the ALU (>36 bits). I don't think 1= 1 bits is quite as good as required, but I'm thinking one crank of the NR m= achine and I'm "good". However, I need to be able to show how good. Anyone = know of good documentation on this matter?=20>=20 > Oh, I guess I should mention later on I might be repeating this process f=or a square root calculation. A measurement using a sensor that is at least= temporarily deprecated requires a square root. I managed to show the senso= r was not capable of providing an accuracy at the low end that would produc= e meaningful results. But if they find a better sensor they may want to add= that sensor as an option and the square root will be back. No! The HORROR!= =20>=20 > Whatever... It's just an algorithm...=20 >=20 > --=20 >=20 > Rick C.=20 >=20 > - Get 1,000 miles of free Supercharging=20 > - Tesla referral code - https://ts.la/richard11209You don't mention throughput or latency requirements, just that you want so= mething non-iterative. A resource I've used before is the Altera Advanced = Synthesis Cookbook. You can find it online and it has algorithms/source fo= r both "approximate" floating-point divider and root modules. These both s= acrifice a bit of precision for low gate count. The divider is pipelined a= nd has a latency of about 5 cycles. Maybe it could be helpful.
Reply by ●January 19, 20212021-01-19
On Monday, January 18, 2021 at 8:13:02 PM UTC-5, Kevin Neilson wrote:> On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.co=m wrote:=20> > I am looking for an algorithm to calculate a floating point divide. The=re are a number of options, but the one that is most clear to me and easies= t to implement on the hardware I am designing seems to be the Newton=E2=80= =93Raphson iterative method. I'm trying to understand how many iterations i= t will take. The formula for that in the Wikipedia article assumes an initi= al estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 i= terations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit float= ing point numbers. My needs are for something between that range, but would= like to minimize the number of iterations... ideally one.=20> >=20 > > I believe most implementations use a table lookup for the seed value of=the reciprocal of the divisor. In the FPGA I am using, a convenient size w= ould be 2k entries (11 bit address) of 18 bits. I'm wondering if this would= be sufficiently more accurate than the above estimate so that one iteratio= n would be sufficient. I'm not clear on how to calculate this. I know in ge= neral the formula converges rapidly with the error being squared, so I'm th= inking an initial value that is good to some 11 bits would produce more tha= n 20 bits for one turn of the Newton-Raphson crank (20 bits being a level o= f accuracy called "good enough" for this application). But I'd like to have= something that verifies this rather than using a rule of thumb. At some po= int I will probably have to justify the correctness of the calculations.=20> >=20 > > Because of the hardware facilities available the calculations have inte=rmediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stac= k other than ToS which is the register in the ALU (>36 bits). I don't think= 11 bits is quite as good as required, but I'm thinking one crank of the NR= machine and I'm "good". However, I need to be able to show how good. Anyon= e know of good documentation on this matter?=20> >=20 > > Oh, I guess I should mention later on I might be repeating this process=for a square root calculation. A measurement using a sensor that is at lea= st temporarily deprecated requires a square root. I managed to show the sen= sor was not capable of providing an accuracy at the low end that would prod= uce meaningful results. But if they find a better sensor they may want to a= dd that sensor as an option and the square root will be back. No! The HORRO= R!=20> >=20 > > Whatever... It's just an algorithm...=20 > >=20 > > --=20 > >=20 > > Rick C.=20 > >=20 > > - Get 1,000 miles of free Supercharging=20 > > - Tesla referral code - https://ts.la/richard11209 > You don't mention throughput or latency requirements, just that you want =something non-iterative. A resource I've used before is the Altera Advanced= Synthesis Cookbook. You can find it online and it has algorithms/source fo= r both "approximate" floating-point divider and root modules. These both sa= crifice a bit of precision for low gate count. The divider is pipelined and= has a latency of about 5 cycles. Maybe it could be helpful. I'm not looking for pipelined solutions because speed is not an issue. I d= on't want to bother with shift and subtract not because it is slow, but bec= ause each step in this design would be controlled by a ROM and while I don'= t expect to worry about the ROM size, if it uses 18 locations for each divi= de in the process I might need to focus on that. =20 The floating point NR approach seems to work ok. After considering the loo= k up table in detail it seems like it will work fine with one pass of the N= R algorithm, about 5 steps in all. Not bad.=20 --=20 Rick C. + Get 1,000 miles of free Supercharging + Tesla referral code - https://ts.la/richard11209