FPGARelated.com
Forums

Multiply Accumulate FPGA/DSP

Started by bart May 3, 2005
I have been tasked with trying to implement a FFT algorithm in a
FPGA/DSP architecture.  The algorithm would be a N point FFT with 1000
frequency bins.  Each frequency bin would require a multiply, by the
constant e^jx, and then accumulate every 1 microsecond.  This turns out
to be 1000 multiply accumulates happening in parallel every 1
microsecond.  Does anyone have experience doing something similar in an
FPGA/DSP and can they point me in the  right direction as far as
choosing a FPGA/DSP development board?  Any help would be appreciated.

bart wrote:
> I have been tasked with trying to implement a FFT algorithm in a > FPGA/DSP architecture. The algorithm would be a N point FFT with 1000 > frequency bins. Each frequency bin would require a multiply, by the > constant e^jx, and then accumulate every 1 microsecond. This turns out > to be 1000 multiply accumulates happening in parallel every 1 > microsecond. Does anyone have experience doing something similar in an > FPGA/DSP and can they point me in the right direction as far as > choosing a FPGA/DSP development board? Any help would be appreciated. >
1000 MAC in parallel ... that's a lot ! come on, just to store all the accumulators in parallel, with just like 48 bits accumulator , that would be 48000 regs ... 1 microsecod is 1000 ns so the way to go is to have like 20 units in parallel and do the job every 20 ns which sounds a lot better. Then use a block ram. Each block ram would have to "remember" 50 accumulator, not a problem. Sylvain
In article <1115143492.028339.55630@f14g2000cwb.googlegroups.com>,
bart <larsonbr@gmail.com> wrote:
>I have been tasked with trying to implement a FFT algorithm in a >FPGA/DSP architecture. The algorithm would be a N point FFT with 1000 >frequency bins. Each frequency bin would require a multiply, by the >constant e^jx, and then accumulate every 1 microsecond. This turns out >to be 1000 multiply accumulates happening in parallel every 1 >microsecond.
Or ten MACs in parallel every ten nanoseconds; I'm imagining a little circuit (two BRAMs, one multiplier) which reads the input, multiplies it by a constant read from one block RAM, and adds it to an accumulator in another, plus a sequencer over the block RAM locations, the whole thing replicated ten times in an XC3S1000 (dev. boards are $200 or so from www.xess.com). Though you're using a complex multiplier, which is roughly four integer multipliers, so you might have difficulty with ten-fold replication in the 3S1000; and from what I've read here, running with only five-fold replication, so a cycle time of 5ns, might require quite elaborate design to get the speed sufficient; it might even be too fast for the multipliers. Have another circuit the other side which uses the other port on the accumulator BRAMs to read out the accumulated data when the time comes. This is a back-of-an-envelope design, I'd be really happy if someone with actual FPGA experience could point out what's wrong with it. Tom
In article <1115143492.028339.55630@f14g2000cwb.googlegroups.com>,
bart <larsonbr@gmail.com> wrote:
>I have been tasked with trying to implement a FFT algorithm in a >FPGA/DSP architecture. The algorithm would be a N point FFT with 1000 >frequency bins. Each frequency bin would require a multiply, by the >constant e^jx, and then accumulate every 1 microsecond.
I'd not call that an FFT; I'd call it a calculation of a thousand points of a DFT. It may well be possible to do it with less than one complex gigaMACs, by using an FFT, but regrettably I'm not awake enough to remember how to do that filter transformation. Tom
This bloke seems to know what he's doing.
http://www.andraka.com/cores.htm
Some stuff here too
http://www.opencores.org/browse.cgi/by_category
Also, Google gave me over 44khits for FPGA FFT
"bart" <larsonbr@gmail.com> wrote in message
news:1115143492.028339.55630@f14g2000cwb.googlegroups.com...
> I have been tasked with trying to implement a FFT algorithm in a > FPGA/DSP architecture. The algorithm would be a N point FFT with 1000 > frequency bins. Each frequency bin would require a multiply, by the > constant e^jx, and then accumulate every 1 microsecond. This turns out > to be 1000 multiply accumulates happening in parallel every 1 > microsecond. Does anyone have experience doing something similar in an > FPGA/DSP and can they point me in the right direction as far as > choosing a FPGA/DSP development board? Any help would be appreciated. >
I just want to ask how will you enter your 1000 frequancy pins, how
many bits are you representing you frequancy points. I mean if you have
8 bits per point then you need 8000 pins which I think is to much for
any FPGA avaliable.

I think you mean 1000 analog inputs which also requars some form of ADC
, which also leads to the same problem.

May be you will enter them sequentaly which will take time to enter
them to the FPGA.

Best regards

ahosyney wrote:
> I just want to ask how will you enter your 1000 frequancy pins, how
bin, not pin. -- [100~Plax]sb16i0A2172656B63616820636420726568746F6E61207473754A[dZ1!=b]salax
I need 1000 frequency "bins", where each bin is a descrete frequency.
As Thomas Womack pointed out above, it is beter defined as a N-point
DFT with 1000 frequency bins, where N = 1024.  For each sample, every
microsecond, there is 24-bits of data lets call that x(n).  During that
microsecond there must be 1000 MACS in parallel to calculate the N=1024
DFT.  This would happen for 1024 samples to calculate the N-point DFT.
I hope that is a better description.  Thanks for the  input.

Bart,
consider time / frequency as a third dimension. You have a certain job
to do in a given time. Then look at the perforamnce of your multiplier,
registers, etc, and you find that they will work at multiple 100 MHz.
Then get creative and do certain things sequentially, and other things
in parallel. You have an enormous amount of creative freedom, and
pipelining is essentially free in an FPGA.
Remember, any circuit that does not work close to its speed limit
represents waste.
Peter Alfke

Hi Peter,

> Remember, any circuit that does not work close to its speed limit > represents waste.
Well, I've seen a fair share of 15-25ns CPLD designs, filled 60% and running at 4 or 8MHz. Sometimes applications can simply be slow. And developed, debugged and programmed in under an hour and a half. And, especially nowadays, without a smaller or slower part that is any cheaper. But, that's good, isn't it? It would be horrible if the lower end of the market couldn't take advantage of modern technology. Best regards, Ben