On 9 avr, 16:19, Pete <petersen.c...@gmail.com> wrote:
> I've noticed that the Xilinx FFT bit-accurate c simulation calls are
> very very slow. Anyone else notice this?
I'm using the Matlab mex file function and noticed the slowness as
well. Not only that, my input data is real only and the function
issues a warning in the Matlab console in that case. It's really
annoying because it can't be turned off.
As for the speed issue, maybe you could consider using a non bit-
accurate model that you would build yourself and use the bit-accurate
model only when really needed. That's what I did.
Patrick
Reply by Marty Ryba●April 10, 20082008-04-10
"Pete" <petersen.curt@gmail.com> wrote in message
news:c85cbe7c-8693-498b-96bb-093b8c244ae3@8g2000hsu.googlegroups.com...
> I've noticed that the Xilinx FFT bit-accurate c simulation calls are
> very very slow. Anyone else notice this?
<snip>
> The program is single-threaded right now, and I intend to write a
> multi-threaded version soon. But first, I decided to use a profiler
> on my single-threaded code to see what kind of speedup I should
> expect. I wasn't very surprised to find that 99% of my execution time
> was taking place inside the xilinx_ip_xfft_v5_0_bitacc_simulate()
> function. I WAS surprised to find that 55.6% of that function's
> execution time is being spent in malloc() and free() calls. I have
> some nice call graphs and an excel spreadsheet I'd be willing to share
> if someone from Xilinx would like to take a look.
> Petersen Curt
I don't have direct experience with the Xilinx simulation, but a couple
observations:
-- malloc() and free() don't belong in an FFT core; most optimized DSP
library calls make the user pre-build the "twiddle factors" table (via an
API call) and then the FFT routine can reuse that without needing to
recreate it every time. Saves a lot of CPU-expensive sine() and cosine()
calls as well. Xilinx could easily alter their API to provide that
functionality.
-- Bit-true emulation is expensive since you no longer can call the IEEE-754
functions built into your FPU in the microprocessor core but have to do
everything the "hard way".
I've done one simulation using the Matlab Fixed Point Toolbox. It
worked, and was pretty close to the Blackfin DSP I was emulating, but it ran
10 times slower than the floating point implementation. One problem with the
Fixed Point Toolbox is that they only provide an "example" code that
implements a radix-two transform. If you don't do it exactly that way,
you'll have to code the FFT routine yourself. I was using a radix-4
algorithm on the Blackfin, and some of the source code was in assembler. I
truncated and rounded my results after the FFT and did the other critical
calculations in the Toolbox. Maybe Xilinx could generate a reference FFT
function for the Matlab toolbox, but that might give away some IP on their
implementation.
Reply by Pete●April 9, 20082008-04-09
I've noticed that the Xilinx FFT bit-accurate c simulation calls are
very very slow. Anyone else notice this?
I am working on hybrid fixed/floating-point digital signal processing
application, and I frequently make calls to the bit-accurate
simulation function (anywhere on the order of 1,000 times, to
1,000,000 times per run). As I'm attempting to run longer
simulations, the runtime is becoming my critical path.
The program is single-threaded right now, and I intend to write a
multi-threaded version soon. But first, I decided to use a profiler
on my single-threaded code to see what kind of speedup I should
expect. I wasn't very surprised to find that 99% of my execution time
was taking place inside the xilinx_ip_xfft_v5_0_bitacc_simulate()
function. I WAS surprised to find that 55.6% of that function's
execution time is being spent in malloc() and free() calls. I have
some nice call graphs and an excel spreadsheet I'd be willing to share
if someone from Xilinx would like to take a look.
Why not move all of the memory allocations into the "create" and
"destroy" state calls of the program? If this were done, all memory
allocations could be performed once, when the state were created. A
speedup of over 2x could be realized with those changes to the
function. On top of that, I fear that multi-threading my code won't
lead to very much speedup, due to the spin-locks in the heap
allocation kernel calls.
Now before anyone replies to this thread and says, "but that's a lot
of memory allocations hanging around!"
My reply: If a user is worried about running out of memory, they can
create and destroy the state as needed. The memory is going to get
allocated anyway. The only difference will be better control of how
often the costly heap access kernel functions are called. I could see
the memory possibly being wasted when simulating a core with a
reconfigurable transform length. A pipelined 64K-point FFT would need
roughly 1GB. At that point, however, the math complexity will
probably start to eclipse the memory allocation penalties.
I'd be more than happy to make the changes to the source myself, if
someone from Xilinx were kind enough to share it.
Cheers.
Petersen Curt
EM Photonics, Inc.
Newark, DE