FPGARelated.com
Forums

Interesting problems about high performance computing

Started by hits...@hit.edu.cn June 20, 2007
There is a software application which demonds huge computation. My aim
is to port the software program into hardware. So first of all, I need
to evaluate the required volumn of computation.

The situation is that the software program will need up to 22 hours on
personal computer (P4 3.06G, 2GB RAM) to finish the computation, while
the operations mostly are float-point multiplications and additions. I
need to finish the computation by hardware in serveral minites. Is it
possible?

Another problem is that which kind of hardware platforms is
perferable. I have checked that the production provided by
Nallatech(www.nallatech.com) seems suitable for me. I am still
wondering whether I should choose some boards with multiple DSP
processors on it? I think DSP processors are better at floating point
operations.

On Jun 19, 11:41 pm, "h...@hit.edu.cn" <h...@hit.edu.cn> wrote:
> There is a software application which demonds huge computation. My aim > is to port the software program into hardware. So first of all, I need > to evaluate the required volumn of computation.
put some counters in the software version to total the required number of operations on each variable in the calculation, and you can get that answer fairly accurately. Next take a good look at the algorithm, for concurrency and necessary serial operations. Any place you have a different data value used in each iteration a loop, you have the chance to do calculate two or more values per loop iterations (unrolling the loop) to gain parallelism. If however, there are serializations which offer no restructuring to parallelize, then the application might well be done best with a fast cpu/cache rather than a slower FGPA. Look at Amdahl's law regarding possible speedup between parallel and serial sections of a program. In some cases, even a serial operation that is lengthy, may see some gains by pipelining, as another form of parallelization. So be constructive in looking for parallelism in the algorithm. In some other cases, using a completely different algorithm which completes the task equally, may be the only means for gaining the required parallelism. This sometimes requires some thought, like replacing recursion with linear code, to "unroll" the implied loops. or reordering the sequence over operations ... such as loop combining where one or more loops process the same data points.
> The situation is that the software program will need up to 22 hours on > personal computer (P4 3.06G, 2GB RAM) to finish the computation, while > the operations mostly are float-point multiplications and additions. I > need to finish the computation by hardware in serveral minites. Is it > possible?
In general, most FPGA implementations see speed improvements between 2X to 200X, sometimes worse, rarely much better, unless the main software loop is very small, and the algorithm has exceptional parallelism. The reason is that FPGA cycle times are relatively slow compared to GHz multi-issue pipelined CPUs. It doesn't really mater much if the data is fixed or floating point, what matters is the degree of parallelism you can introduce and exploit in the FPGA. You are looking for speed up's on the order of 22*60/3 = 440X which rather implies the calculations of the algorithm must have a very small kernel, and extremely easy to parallelize with little or no serial component (such as memory accesses) to meet your goals in an FPGA ... certainly possible with some careful work, but not necessarily for all algorithms or problems.
> Another problem is that which kind of hardware platforms is > perferable. I have checked that the production provided by > Nallatech(www.nallatech.com) seems suitable for me. I am still > wondering whether I should choose some boards with multiple DSP > processors on it? I think DSP processors are better at floating point > operations.
Beside looking at FPGAs, you might also consider GPUs, Cell processors, and a different CPU architecture which has better multi- issue floating point, cache, or memory performance ... including tuning the application for exactly that CPU/Memory configuration. Have fun! John
Hello
> Another problem is that which kind of hardware platforms is > perferable. I have checked that the production provided by > Nallatech(www.nallatech.com) seems suitable for me. I am still > wondering whether I should choose some boards with multiple DSP > processors on it? I think DSP processors are better at floating point > operations. >
perhaps, this one could suit your needs: http://www.mathstar.com/products.html Regards Christian
Most engineers still don't realize the processing power of the FPGA. True,
today's processors operate at high clock frequencies.  People don't truly
understand how much of that raw speed is lost to processor overhead.
 I created a FPGA process for processing real time 1280pixel
32bit camera scans to identifiy the leading edge of incoming documents,
determine
the skew angle and rotate the image in realtime while the image was still
being scanned.  This process originally required 8 High speed DSP's with
significant
propagation delay and large amounts of memory.
I have also converted many software processes to the FPGA enviroment.
Each have had dramatic improvements in speed that no processor could
ever come close to matching.

Here is a short list of problems your software program may be
experiencing.

  1. Almost 25% to 50% your speed is lost to opcode and memory access.
This number varies depending on CPU cache efficiency.
  2. Depending on the software program size and memory access.  You
could be forcing excessive cache dumps and reload which can reduce
speed another 10 to 50%.
  3. You must then content with program efficiency.  Is the program
  optimized
for speed.
  4. Last, the efficiency of your software compiler.  Is it using extensive
use of libraries or inline code.

When a program is converted to hardware you eliminate items 1, 2, and 4. 
You
are left with program efficiency, how well it is translated to hardware.

The first thing to do is reorder the statements in the program.  Section the
program
into stages and identify the loop/repitition structure.  Each stage has a
dependency on the previous computation.  This
will usually lead to each stage having multiple computations.  These can be
executed in parallel in the FPGA and many equations can be executed in just
one
FPGA clock cycle.  After reordering the statements in the
proper order for hardware conversion, recompile the program and run to
insure in still functions correctly.  This is now your basic template for
conversion.

If you use large amounts of data in the program beyond the capacity of the
FPGA,
you will need to create a multi channel DMA controller w/cache.
This controller will provide access to each stage needing external memory.

Second, when using decimal calculation, determine the maximum decimal error.
Floating point offers many advantages but slows down computation, consume
large number of resources and add to the overall complexity of the hardware.
 You
should use fix point computation if possible.

Fix point decimal accuracy (decimal portion not including integer size)

 1 byte = 2 decimal places
 2 byte = 4 decimal places
 3 byte = 7 decimal places
 4 byte = 9 decimal places
 5 byte = 12 decimal places
 6 byte = 14 decimal places

Xilinx devices have built in multipliers (18x18) which are fast and save
hardware.
Division is possible but uses more resources.

A spartan3 can do the work.  The Virtex 2, and 4 are faster and offer more
memory and multipliers with the addition of CPU's to help with other tasks.
A DSP does provide avantages over a standard processor
but does not compare to the raw power of the FPGA.

evansamuel@charter.net
The type of thing you are doing we have a number of clients doing
similar things with our development hardware in quite a few
application areas.

>From DSP performance it is hard for a DSP processor to beat a well
designed and implemented FPGA design. Generally you should see a massive increase in performance in FPGA DSP engine compared to the software running on a generic X86 processor. The second part of your problem is getting data to and from the DSP Processor or FPGA. We see a lot of these applications because a lot of our boards are PCI or PCI-E based and offer a high bandwidth route to the PC but you may want to consider USB2 as an alternative. It is not as performant as PCI usually. Often the USB controller itself effectively sits on a PCI bus so is limited by that if not it's own protocol handing. If you do have a high data bandwidth requirement do be careful of what PCI and PCI-E can deliver in reality. 20-50% of the bus/link speed is a reason first level guide of what can be achieved and even then there are lots of factors. John Adair Enterpoint Ltd. www.enterpoint.co.uk On 20 Jun, 06:41, "h...@hit.edu.cn" <h...@hit.edu.cn> wrote:
> There is a software application which demonds huge computation. My aim > is to port the software program into hardware. So first of all, I need > to evaluate the required volumn of computation. > > The situation is that the software program will need up to 22 hours on > personal computer (P4 3.06G, 2GB RAM) to finish the computation, while > the operations mostly are float-point multiplications and additions. I > need to finish the computation by hardware in serveral minites. Is it > possible? > > Another problem is that which kind of hardware platforms is > perferable. I have checked that the production provided by > Nallatech(www.nallatech.com) seems suitable for me. I am still > wondering whether I should choose some boards with multiple DSP > processors on it? I think DSP processors are better at floating point > operations.
hitsx@hit.edu.cn wrote:

> There is a software application which demonds huge computation. My aim > is to port the software program into hardware. So first of all, I need > to evaluate the required volumn of computation.
> The situation is that the software program will need up to 22 hours on > personal computer (P4 3.06G, 2GB RAM) to finish the computation, while > the operations mostly are float-point multiplications and additions. I > need to finish the computation by hardware in serveral minites. Is it > possible?
If you give some hints as to what type of computation you are doing you will get better answers. Floating point, especially addition, is relatively expensive in an FPGA. You might consider the possibility of using fixed point, even if it is significantly wider. Sometimes block floating point works well, where one exponent applies to many (like a whole vector of) values. (The pre/post normalization needed for floating point is expensive.) The actual implementation of the algorithm is usually very different in an FPGA. I recommend the systolic array architecture, but there are other possibilities.
> Another problem is that which kind of hardware platforms is > perferable. I have checked that the production provided by > Nallatech(www.nallatech.com) seems suitable for me. I am still > wondering whether I should choose some boards with multiple DSP > processors on it? I think DSP processors are better at floating point > operations.
There are some systems convenient for implementing such designs. I have seen PCMCIA cards with an FPGA, though that might not be big enough for you. How many times do you need to run this problem, and how big is your budget? You might also look at: http://www.drccomputer.com/ -- glen
Hello John,

Thank you for you quick reply!

To be specific, my program is related to 3D image reconstruction. The
input data is float point numbers in the form of 3D array. I represent
it using symbol: g(x1,y1,z1), while the number of x1,y1,z1 are quite
large. The output data is another 3D float point array. I represent it
using symbol: F(x2,y2,z2), similarily the number of x2,y2,z2 are also
quite large.

The estimated memory usage is 2GB or so, while the calculations
required is listed below:

integer addition	2442 Giga operations per second
float add		814  Giga operations per second
float substrac		2424 Giga operations per second
float muliply		1610 Giga operations per second
float divide		809  Giga operations per second

I want these calculations done in one minite, so we can divide the
operations by 60.
As a matter of fact, if the calculations could be done in 5 minites,
it is OK for me. So for minimum requirment, divide the above numbers
by 300(=60*5).

And your suggestion?

Hello glen:

Please refer to my reply to John. There is no problem about the
budget. And I prefer to get the whole calculations done in 10 minites
as I have said in my reply to John.

<hitsx@hit.edu.cn> wrote
> > To be specific, my program is related to 3D image reconstruction. The > input data is float point numbers in the form of 3D array. I represent > it using symbol: g(x1,y1,z1), while the number of x1,y1,z1 are quite > large. The output data is another 3D float point array. I represent it > using symbol: F(x2,y2,z2), similarily the number of x2,y2,z2 are also > quite large. > > The estimated memory usage is 2GB or so, while the calculations > required is listed below: > > integer addition 2442 Giga operations per second > float add 814 Giga operations per second > float substrac 2424 Giga operations per second > float muliply 1610 Giga operations per second > float divide 809 Giga operations per second > > I want these calculations done in one minite, so we can divide the > operations by 60. > As a matter of fact, if the calculations could be done in 5 minites, > it is OK for me. So for minimum requirment, divide the above numbers > by 300(=60*5).
Is it single or double precision floats? Divided by 300 the numbers are not very high: 10.7G fadd/fsub 5.3G fmul 2.7G fdiv IMO your limiting factor will probably be the memory bandwidth. If you have enough memory bandwidth then it's not a problem for a large FPGA. You can even go faster than that. Marc
It is double precision. And can you please explain it for me why it
matters?
The another question is that as you mentioned memory bandwidth could
be an issue, can you evaluate the approximately required memory
bandwidth?