# Division Algorithms

Started by January 16, 2021
```I am looking for an algorithm to calculate a floating point divide. There a=
re a number of options, but the one that is most clear to me and easiest to=
implement on the hardware I am designing seems to be the Newton=E2=80=93Ra=
phson iterative method. I'm trying to understand how many iterations it wil=
l take. The formula for that in the Wikipedia article assumes an initial es=
timate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterat=
ions to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating p=
oint numbers. My needs are for something between that range, but would like=
to minimize the number of iterations... ideally one.

I believe most implementations use a table lookup for the seed value of the=
reciprocal of the divisor. In the FPGA I am using, a convenient size would=
be 2k entries (11 bit address) of 18 bits. I'm wondering if this would be =
sufficiently more accurate than the above estimate so that one iteration wo=
uld be sufficient. I'm not clear on how to calculate this. I know in genera=
l the formula converges rapidly with the error being squared, so I'm thinki=
ng an initial value that is good to some 11 bits would produce more than 20=
bits for one turn of the Newton-Raphson crank (20 bits being a level of ac=
curacy called "good enough" for this application). But I'd like to have som=
ething that verifies this rather than using a rule of thumb. At some point =
I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermed=
iate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack ot=
her than ToS which is the register in the ALU (>36 bits). I don't think 11 =
bits is quite as good as required, but I'm thinking one crank of the NR mac=
hine and I'm "good".  However, I need to be able to show how good. Anyone k=
now of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for=
a square root calculation. A measurement using a sensor that is at least t=
emporarily deprecated requires a square root. I managed to show the sensor =
was not capable of providing an accuracy at the low end that would produce =
meaningful results. But if they find a better sensor they may want to add t=
hat sensor as an option and the square root will be back. No! The HORROR!

Whatever... It's just an algorithm...

--=20

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
```
```On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.com =
wrote:
> I am looking for an algorithm to calculate a floating point divide. There=
are a number of options, but the one that is most clear to me and easiest =
to implement on the hardware I am designing seems to be the Newton=E2=80=93=
Raphson iterative method. I'm trying to understand how many iterations it w=
ill take. The formula for that in the Wikipedia article assumes an initial =
estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iter=
ations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating=
point numbers. My needs are for something between that range, but would li=
ke to minimize the number of iterations... ideally one.=20
>=20
> I believe most implementations use a table lookup for the seed value of t=
he reciprocal of the divisor. In the FPGA I am using, a convenient size wou=
ld be 2k entries (11 bit address) of 18 bits. I'm wondering if this would b=
e sufficiently more accurate than the above estimate so that one iteration =
would be sufficient. I'm not clear on how to calculate this. I know in gene=
ral the formula converges rapidly with the error being squared, so I'm thin=
king an initial value that is good to some 11 bits would produce more than =
20 bits for one turn of the Newton-Raphson crank (20 bits being a level of =
accuracy called "good enough" for this application). But I'd like to have s=
omething that verifies this rather than using a rule of thumb. At some poin=
t I will probably have to justify the correctness of the calculations.=20
>=20
> Because of the hardware facilities available the calculations have interm=
ediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack =
other than ToS which is the register in the ALU (>36 bits). I don't think 1=
1 bits is quite as good as required, but I'm thinking one crank of the NR m=
achine and I'm "good". However, I need to be able to show how good. Anyone =
know of good documentation on this matter?=20
>=20
> Oh, I guess I should mention later on I might be repeating this process f=
or a square root calculation. A measurement using a sensor that is at least=
temporarily deprecated requires a square root. I managed to show the senso=
r was not capable of providing an accuracy at the low end that would produc=
e meaningful results. But if they find a better sensor they may want to add=
that sensor as an option and the square root will be back. No! The HORROR!=
=20
>=20
> Whatever... It's just an algorithm...=20
>=20
> --=20
>=20
> Rick C.=20
>=20
> - Get 1,000 miles of free Supercharging=20
> - Tesla referral code - https://ts.la/richard11209

You don't mention throughput or latency requirements, just that you want so=
mething non-iterative.  A resource I've used before is the Altera Advanced =
Synthesis Cookbook.  You can find it online and it has algorithms/source fo=
r both "approximate" floating-point divider and root modules.  These both s=
acrifice a bit of precision for low gate count.  The divider is pipelined a=
nd has a latency of about 5 cycles.  Maybe it could be helpful.
```
```On Monday, January 18, 2021 at 8:13:02 PM UTC-5, Kevin Neilson wrote:
> On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.co=
m wrote:=20
> > I am looking for an algorithm to calculate a floating point divide. The=
re are a number of options, but the one that is most clear to me and easies=
t to implement on the hardware I am designing seems to be the Newton=E2=80=
=93Raphson iterative method. I'm trying to understand how many iterations i=
t will take. The formula for that in the Wikipedia article assumes an initi=
al estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 i=
terations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit float=
ing point numbers. My needs are for something between that range, but would=
like to minimize the number of iterations... ideally one.=20
> >=20
> > I believe most implementations use a table lookup for the seed value of=
the reciprocal of the divisor. In the FPGA I am using, a convenient size w=
ould be 2k entries (11 bit address) of 18 bits. I'm wondering if this would=
be sufficiently more accurate than the above estimate so that one iteratio=
n would be sufficient. I'm not clear on how to calculate this. I know in ge=
neral the formula converges rapidly with the error being squared, so I'm th=
inking an initial value that is good to some 11 bits would produce more tha=
n 20 bits for one turn of the Newton-Raphson crank (20 bits being a level o=
f accuracy called "good enough" for this application). But I'd like to have=
something that verifies this rather than using a rule of thumb. At some po=
int I will probably have to justify the correctness of the calculations.=20
> >=20
> > Because of the hardware facilities available the calculations have inte=
rmediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stac=
k other than ToS which is the register in the ALU (>36 bits). I don't think=
11 bits is quite as good as required, but I'm thinking one crank of the NR=
machine and I'm "good". However, I need to be able to show how good. Anyon=
e know of good documentation on this matter?=20
> >=20
> > Oh, I guess I should mention later on I might be repeating this process=
for a square root calculation. A measurement using a sensor that is at lea=
st temporarily deprecated requires a square root. I managed to show the sen=
sor was not capable of providing an accuracy at the low end that would prod=
uce meaningful results. But if they find a better sensor they may want to a=
dd that sensor as an option and the square root will be back. No! The HORRO=
R!=20
> >=20
> > Whatever... It's just an algorithm...=20
> >=20
> > --=20
> >=20
> > Rick C.=20
> >=20
> > - Get 1,000 miles of free Supercharging=20
> > - Tesla referral code - https://ts.la/richard11209
> You don't mention throughput or latency requirements, just that you want =
something non-iterative. A resource I've used before is the Altera Advanced=
Synthesis Cookbook. You can find it online and it has algorithms/source fo=
r both "approximate" floating-point divider and root modules. These both sa=
crifice a bit of precision for low gate count. The divider is pipelined and=
has a latency of about 5 cycles. Maybe it could be helpful.

I'm not looking for pipelined solutions because speed is not an issue.  I d=
on't want to bother with shift and subtract not because it is slow, but bec=
ause each step in this design would be controlled by a ROM and while I don'=
t expect to worry about the ROM size, if it uses 18 locations for each divi=
de in the process I might need to focus on that. =20

The floating point NR approach seems to work ok.  After considering the loo=
k up table in detail it seems like it will work fine with one pass of the N=