Reply by Ray Andraka September 19, 20032003-09-19
Several of the textbooks on FFTs such as Smith & Smith "Handbook of realtime FFTs"
describe higher radix FFTs.  I believe the Xilinx FFT core data sheet goes into a fair
amount of detail on a radix 4 kernel.  The easiest way to do hardware sharing is to use
one kernel and run the data through it in multiple passes.  You'll need to change the
twiddle factors (which are really a phase rotation) and the data ordering.  The twiddles
can be handled by changing the input to a multiplier using a table.  The data reordering
is doable by permuting the bits of your address counter.

DA is really a technique for doing multiply-accumulate, not just a multiply.   There are
many ways to do the multiplications, for example if speed is not an issue you can use a
scaling accumulator for a fairly compact multiplier.  See the page about multiplication on
my website.  Flip-flop count is not the whole story, you also have to account for the
combinatorial logic between flip-flops.  You could do a totally combinatorial multiplier
with no flip=flops, for example (although it would be quite slow if it is reasonably
sized)

PJ wrote:

> Hi Ray/Andyman, > > Thanks for replying. > > Ray .... > Is there a good explanation on radix-4 ffts ? If I can mod the code > quickly, I will certainly do it. As regarding shared hardware, I > thought about that when implementing my design, but I really couldn't > figure out a way to do it. Are you talking about reducing the number > of stages ? Did you use RAM to provide the hardware sharing ? > > I have let the synthesis tools generate the multiplier. I just used > the '*' function. This is one area I think I could make good space > saving. I have seen your articles about multipliers on your website. > Would a DA type multiplier save me much space ? > > What is the most efficient multiplier in terms of flop count, and is > there VHDL code or verilog code available. I'll gladly write it > myself, if I can get some pointers. I can trade speed for gate count. > My problem now is the lack of time I have to get this project > completed. > > Andyman > > I have implemented RAM in my design, and thankfully it synthesis to > RAM fine. However in saying that...that is for very small designs. > This certainly might be a problem. I am using leonardo spectrum, and > will give synplify a try next. > If you found any solutions...I would be glad to hear. > > Thank you both for your time and your replies. > > Good luck > PJ > > Ray Andraka <ray@andraka.com> wrote in message news:<3F664CBB.86459E77@andraka.com>... > > It very heavily depends on your implementation. The fact that it is > > radix2 and > > not radix 4 or higher tells me already that your implementation is not a > > very > > efficient one. A radix 4 kernel is very little added complexity over a > > radix 2 > > kernel and cuts the processing time and area considerable. Depending on > > your > > speed requirements and your design prowess, you can certainly fit a 128 > > point > > real-only FFT in a much smaller part than a 1M gate part. As your speed > > requirements decrease, you can take advantage of iterative or shared > > hardware > > to reduce the gate count considerably. I did a 4096 point design in a > > Xilinx > > XCV1000 that does the complex transform in 68 us (the floorplan and a > > brief > > description are on the gallery page on my website). That one only > > occupies > > about 40% of the FPGA and includes some floating point stuff as well as > > windowing multipliers and some other goodies. It takes intimate > > knowledge > > of the FPGA structure and of algorithms to achieve that level of density > > (the customer called the design "a work of art"), but even a novice can > > achieve a density/performance level of half that design with some > > carefully > > thought out design. > > > > PJ wrote: > > > > > Hello, > > > > > > I am implementing a 128 point real Radix-2 fft, data and coefficient > > > widths are 16 bit. > > > I am synthezising it for use in an FPGA. However, it is taking a very > > > long time to synthesize. (approx 3 days using Leonardo on a 2 GHz > > > machine with 512 MByte RAM) I am using a 20K1000 Altera FPGA. The ram > > > required by the fft will be internal to the FPGA > > > > > > Will this design take up all the space on the device. From past > > > experience, can someone give me an indication of what area of the > > > device the fft will occupy. > > > > > > Surely if it takes up most of the device, then it will be too big, as > > > I have other features to implement in the FPGA also !! > > > > > > Thank you > > > PJ > > > > -- > > --Ray Andraka, P.E. > > President, the Andraka Consulting Group, Inc. > > 401/884-7930 Fax 401/884-7950 > > email ray@andraka.com > > http://www.andraka.com > > > > "They that give up essential liberty to obtain a little > > temporary safety deserve neither liberty nor safety." > > -Benjamin Franklin, 1759
-- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759
Reply by Andyman September 16, 20032003-09-16
> PJ wrote: > > > Hello, > > > > I am implementing a 128 point real Radix-2 fft, data and coefficient > > widths are 16 bit. > > I am synthezising it for use in an FPGA. However, it is taking a very > > long time to synthesize. (approx 3 days using Leonardo on a 2 GHz > > machine with 512 MByte RAM) I am using a 20K1000 Altera FPGA. The ram > > required by the fft will be internal to the FPGA
Hiya, I have noted with some synthesizers that if it cannot infer your RAM "correctly" it will try to build it out of registers (as opposed to the dedicated RAM on the FPGA). This can end up taking a _very_ long time for only reasonably large RAMs. You might consider synthesizing each module in your design separately until you find the part that causes the problem. If it isn't already structured thus it might be wise to do so as this may also help. Later, Andyman.
Reply by Ray Andraka September 15, 20032003-09-15
It very heavily depends on your implementation.  The fact that it is
radix2 and
not radix 4 or higher tells me already that your implementation is not a
very
efficient one.  A radix 4 kernel is very little added complexity over a
radix 2
kernel and cuts the processing time and area considerable.  Depending on
your
speed requirements and your design prowess, you can certainly fit a 128
point
real-only FFT in a much smaller part than a 1M gate part.  As your speed
requirements decrease, you can take advantage of iterative or shared
hardware
to reduce the gate count considerably.  I did a 4096 point design in a
Xilinx
XCV1000 that does the complex transform in 68 us (the floorplan and a
brief
description are on the gallery page on my website).  That one only
occupies
about 40% of the FPGA and includes some floating point stuff as well as
windowing multipliers and some other goodies.  It takes intimate
knowledge
of the FPGA structure and of algorithms to achieve that level of density
(the customer called the design "a work of art"), but even a novice can
achieve a density/performance level of half that design with some
carefully
thought out design.

PJ wrote:

> Hello, > > I am implementing a 128 point real Radix-2 fft, data and coefficient > widths are 16 bit. > I am synthezising it for use in an FPGA. However, it is taking a very > long time to synthesize. (approx 3 days using Leonardo on a 2 GHz > machine with 512 MByte RAM) I am using a 20K1000 Altera FPGA. The ram > required by the fft will be internal to the FPGA > > Will this design take up all the space on the device. From past > experience, can someone give me an indication of what area of the > device the fft will occupy. > > Surely if it takes up most of the device, then it will be too big, as > I have other features to implement in the FPGA also !! > > Thank you > PJ
-- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759
Reply by PJ September 14, 20032003-09-14
Hello,

I am implementing a 128 point real Radix-2 fft, data and coefficient
widths are 16 bit.
I am synthezising it for use in an FPGA. However, it is taking a very
long time to synthesize. (approx 3 days using Leonardo on a 2 GHz
machine with 512 MByte RAM) I am using a 20K1000 Altera FPGA. The ram
required by the fft will be internal to the FPGA

Will this design take up all the space on the device. From past
experience, can someone give me an indication of what area of the
device the fft will occupy.

Surely if it takes up most of the device, then it will be too big, as
I have other features to implement in the FPGA also !!

Thank you
PJ