FPGARelated.com
Forums

Efficient division algorithm?

Started by Nial Stewart February 19, 2008
I'm going to talk to a potential new client about using FPGAs to
accelerate part of their system.

As part of what needs done there could be a significant amount of
division(s) done.

Previously I've been able to multiply by a reciprocal then scale to
make division a double clock operation so this can be easily pipelined.
This is only achieveable if the divisor is pre-known and the
reciprocal can be pre-calculated.

With what's coming up I'm not sure that I can do this, I know that

Are there any clever techniques for streamlining divisions that
make them deterministic and don't use a big wodge of logic?


Thanks for any pointers,


Nial


Nial Stewart <nial*REMOVE_THIS*@nialstewartdevelopments.co.uk> wrote:
> I'm going to talk to a potential new client about using FPGAs to > accelerate part of their system.
> As part of what needs done there could be a significant amount of > division(s) done.
> Previously I've been able to multiply by a reciprocal then scale to > make division a double clock operation so this can be easily pipelined. > This is only achieveable if the divisor is pre-known and the > reciprocal can be pre-calculated.
> With what's coming up I'm not sure that I can do this, I know that
> Are there any clever techniques for streamlining divisions that > make them deterministic and don't use a big wodge of logic?
Do you need high throughput and short latency? If latency isn't that big issue, I have successfully used the serial devider from opencores.Otherwise there are more divider projects at opencores. -- Uwe Bonnes bon@elektron.ikp.physik.tu-darmstadt.de Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt --------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------
On Feb 19, 5:15=A0am, "Nial Stewart"
<nial*REMOVE_TH...@nialstewartdevelopments.co.uk> wrote:
> I'm going to talk to a potential new client about using FPGAs to > accelerate part of their system. > > As part of what needs done there could be a significant amount of > division(s) done. > > Previously I've been able to multiply by a reciprocal then scale to > make division a double clock operation so this can be easily pipelined. > This is only achieveable if the divisor is pre-known and the > reciprocal can be pre-calculated. > > With what's coming up I'm not sure that I can do this, I know that > > Are there any clever techniques for streamlining divisions that > make them deterministic and don't use a big wodge of logic? > > Thanks for any pointers, >
Check into the lpm_divide function that is part of the lpm package. It's synchronous and accepts new inputs on every clock. It is a standard, although people seem to rail against lpm as being mainly 'Altera'. Kevin Jennings
Nial Stewart wrote:

> Are there any clever techniques for streamlining divisions that > make them deterministic and don't use a big wodge of logic?
That's a trade-off with speed. A subtract loop is simple and deterministic but slow. A binary search multiply-subtract is possible with dsp blocks. -- Mike Treseler
> Do you need high throughput and short latency?
I don't know yet :-)
> If latency isn't that big > issue, I have successfully used the serial devider from opencores.Otherwise > there are more divider projects at opencores.
I'll have a look at Opencores, I hadn't thought of trying there. Thanks, Nial.
> Check into the lpm_divide function that is part of the lpm package. > It's synchronous and accepts new inputs on every clock. It is a > standard, although people seem to rail against lpm as being mainly > 'Altera'.
I'll have a look at this but my impression with this is that it uses a _lot_ of logic? Nial
Nial Stewart wrote:

> I'll have a look at this but my impression with this is that it > uses a _lot_ of logic?
It makes a_whole_lotta_luts in one lump. Requires a slow system clock or a multi-cycle constraint. Quartus will infer lpm_divide from from int or signed /. -- Mike Treseler
On Feb 19, 8:05=A0am, "Nial Stewart"
<nial*REMOVE_TH...@nialstewartdevelopments.co.uk> wrote:
> > Check into the lpm_divide function that is part of the lpm package. > > It's synchronous and accepts new inputs on every clock. =A0It is a > > standard, although people seem to rail against lpm as being mainly > > 'Altera'. > > I'll have a look at this but my impression with this is that it > uses a _lot_ of logic? > > Nial
That of course depends on how much you think is *a lot*. It does let you tradeoff latency for performance though and depending on how you set that parameter it uses different amounts of logic...if I recall correctly, 0 latency (i.e. a combinatorial implementation of division) results in the largest logic usage, a latency equal to the width of the numerator (or denominator, maybe? fuzzy brain neuron currently being accessed) resulted in a good tradeoff in my application where I had three dividers but needed somewhat high performance (75 MHz in a Stratix I). If you need to be able to process full divisions on a system clock cycle, it will do the trick. If your system clock can run much faster so you can take several clocks to do the divide a divider that works on single bits at a time might be more appropriate. Kevin Jennings
On Feb 19, 9:43=A0am, Mike Treseler <mike_trese...@comcast.net> wrote:
> Nial Stewart wrote: > > I'll have a look at this but my impression with this is that it > > uses a _lot_ of logic? > > It makes a_whole_lotta_luts in one lump. > Requires a slow system clock or a multi-cycle constraint.
I ran it at 75 MHz on a Stratix I without any multi-cycle constraints. You want to use the pipelining parameter to lpm_divide to tell it how many pipeline stages it can use. That affects how it implements the logic and allows for tradeoffs between the size of the lump of logic, the latency and the clock cycle but does not require specifying of any multi-cycle timing constraints. In any case, it shouldn't require any dramatic system clock sloooowing.
> Quartus will infer lpm_divide from from int or signed /. >
Doing so though will result in lpm_divide being parameterized with 0 latency which will result in the largest logic lump and the slowest clock cycle performance. Depending on the application it might be better to instantiate the lpm_divide and not have it inferred from "/". Kevin Jennings
On Feb 19, 8:02=A0am, Mike Treseler <mike_trese...@comcast.net> wrote:
> Nial Stewart wrote: > A binary search multiply-subtract is possible with dsp blocks. >
That's one that I've always wanted to try just to see how well it does, but haven't had the need to do so of late. One would expect it to get high throughput, low logic usage and probably be the best in today's world of nearly free DSP block multipliers in FPGAs. Any benchmark data? Kevin Jennings