Forums

Division Algorithms

Started by gnua...@gmail.com January 16, 2021
I am looking for an algorithm to calculate a floating point divide. There a=
re a number of options, but the one that is most clear to me and easiest to=
 implement on the hardware I am designing seems to be the Newton=E2=80=93Ra=
phson iterative method. I'm trying to understand how many iterations it wil=
l take. The formula for that in the Wikipedia article assumes an initial es=
timate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterat=
ions to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating p=
oint numbers. My needs are for something between that range, but would like=
 to minimize the number of iterations... ideally one.

I believe most implementations use a table lookup for the seed value of the=
 reciprocal of the divisor. In the FPGA I am using, a convenient size would=
 be 2k entries (11 bit address) of 18 bits. I'm wondering if this would be =
sufficiently more accurate than the above estimate so that one iteration wo=
uld be sufficient. I'm not clear on how to calculate this. I know in genera=
l the formula converges rapidly with the error being squared, so I'm thinki=
ng an initial value that is good to some 11 bits would produce more than 20=
 bits for one turn of the Newton-Raphson crank (20 bits being a level of ac=
curacy called "good enough" for this application). But I'd like to have som=
ething that verifies this rather than using a rule of thumb. At some point =
I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermed=
iate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack ot=
her than ToS which is the register in the ALU (>36 bits). I don't think 11 =
bits is quite as good as required, but I'm thinking one crank of the NR mac=
hine and I'm "good".  However, I need to be able to show how good. Anyone k=
now of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for=
 a square root calculation. A measurement using a sensor that is at least t=
emporarily deprecated requires a square root. I managed to show the sensor =
was not capable of providing an accuracy at the low end that would produce =
meaningful results. But if they find a better sensor they may want to add t=
hat sensor as an option and the square root will be back. No! The HORROR!

Whatever... It's just an algorithm...

--=20

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.com =
wrote:
> I am looking for an algorithm to calculate a floating point divide. There=
are a number of options, but the one that is most clear to me and easiest = to implement on the hardware I am designing seems to be the Newton=E2=80=93= Raphson iterative method. I'm trying to understand how many iterations it w= ill take. The formula for that in the Wikipedia article assumes an initial = estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iter= ations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating= point numbers. My needs are for something between that range, but would li= ke to minimize the number of iterations... ideally one.=20
>=20 > I believe most implementations use a table lookup for the seed value of t=
he reciprocal of the divisor. In the FPGA I am using, a convenient size wou= ld be 2k entries (11 bit address) of 18 bits. I'm wondering if this would b= e sufficiently more accurate than the above estimate so that one iteration = would be sufficient. I'm not clear on how to calculate this. I know in gene= ral the formula converges rapidly with the error being squared, so I'm thin= king an initial value that is good to some 11 bits would produce more than = 20 bits for one turn of the Newton-Raphson crank (20 bits being a level of = accuracy called "good enough" for this application). But I'd like to have s= omething that verifies this rather than using a rule of thumb. At some poin= t I will probably have to justify the correctness of the calculations.=20
>=20 > Because of the hardware facilities available the calculations have interm=
ediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack = other than ToS which is the register in the ALU (>36 bits). I don't think 1= 1 bits is quite as good as required, but I'm thinking one crank of the NR m= achine and I'm "good". However, I need to be able to show how good. Anyone = know of good documentation on this matter?=20
>=20 > Oh, I guess I should mention later on I might be repeating this process f=
or a square root calculation. A measurement using a sensor that is at least= temporarily deprecated requires a square root. I managed to show the senso= r was not capable of providing an accuracy at the low end that would produc= e meaningful results. But if they find a better sensor they may want to add= that sensor as an option and the square root will be back. No! The HORROR!= =20
>=20 > Whatever... It's just an algorithm...=20 >=20 > --=20 >=20 > Rick C.=20 >=20 > - Get 1,000 miles of free Supercharging=20 > - Tesla referral code - https://ts.la/richard11209
You don't mention throughput or latency requirements, just that you want so= mething non-iterative. A resource I've used before is the Altera Advanced = Synthesis Cookbook. You can find it online and it has algorithms/source fo= r both "approximate" floating-point divider and root modules. These both s= acrifice a bit of precision for low gate count. The divider is pipelined a= nd has a latency of about 5 cycles. Maybe it could be helpful.
On Monday, January 18, 2021 at 8:13:02 PM UTC-5, Kevin Neilson wrote:
> On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.co=
m wrote:=20
> > I am looking for an algorithm to calculate a floating point divide. The=
re are a number of options, but the one that is most clear to me and easies= t to implement on the hardware I am designing seems to be the Newton=E2=80= =93Raphson iterative method. I'm trying to understand how many iterations i= t will take. The formula for that in the Wikipedia article assumes an initi= al estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 i= terations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit float= ing point numbers. My needs are for something between that range, but would= like to minimize the number of iterations... ideally one.=20
> >=20 > > I believe most implementations use a table lookup for the seed value of=
the reciprocal of the divisor. In the FPGA I am using, a convenient size w= ould be 2k entries (11 bit address) of 18 bits. I'm wondering if this would= be sufficiently more accurate than the above estimate so that one iteratio= n would be sufficient. I'm not clear on how to calculate this. I know in ge= neral the formula converges rapidly with the error being squared, so I'm th= inking an initial value that is good to some 11 bits would produce more tha= n 20 bits for one turn of the Newton-Raphson crank (20 bits being a level o= f accuracy called "good enough" for this application). But I'd like to have= something that verifies this rather than using a rule of thumb. At some po= int I will probably have to justify the correctness of the calculations.=20
> >=20 > > Because of the hardware facilities available the calculations have inte=
rmediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stac= k other than ToS which is the register in the ALU (>36 bits). I don't think= 11 bits is quite as good as required, but I'm thinking one crank of the NR= machine and I'm "good". However, I need to be able to show how good. Anyon= e know of good documentation on this matter?=20
> >=20 > > Oh, I guess I should mention later on I might be repeating this process=
for a square root calculation. A measurement using a sensor that is at lea= st temporarily deprecated requires a square root. I managed to show the sen= sor was not capable of providing an accuracy at the low end that would prod= uce meaningful results. But if they find a better sensor they may want to a= dd that sensor as an option and the square root will be back. No! The HORRO= R!=20
> >=20 > > Whatever... It's just an algorithm...=20 > >=20 > > --=20 > >=20 > > Rick C.=20 > >=20 > > - Get 1,000 miles of free Supercharging=20 > > - Tesla referral code - https://ts.la/richard11209 > You don't mention throughput or latency requirements, just that you want =
something non-iterative. A resource I've used before is the Altera Advanced= Synthesis Cookbook. You can find it online and it has algorithms/source fo= r both "approximate" floating-point divider and root modules. These both sa= crifice a bit of precision for low gate count. The divider is pipelined and= has a latency of about 5 cycles. Maybe it could be helpful. I'm not looking for pipelined solutions because speed is not an issue. I d= on't want to bother with shift and subtract not because it is slow, but bec= ause each step in this design would be controlled by a ROM and while I don'= t expect to worry about the ROM size, if it uses 18 locations for each divi= de in the process I might need to focus on that. =20 The floating point NR approach seems to work ok. After considering the loo= k up table in detail it seems like it will work fine with one pass of the N= R algorithm, about 5 steps in all. Not bad.=20 --=20 Rick C. + Get 1,000 miles of free Supercharging + Tesla referral code - https://ts.la/richard11209