There is a software application which demonds huge computation. My aim is to port the software program into hardware. So first of all, I need to evaluate the required volumn of computation. The situation is that the software program will need up to 22 hours on personal computer (P4 3.06G, 2GB RAM) to finish the computation, while the operations mostly are float-point multiplications and additions. I need to finish the computation by hardware in serveral minites. Is it possible? Another problem is that which kind of hardware platforms is perferable. I have checked that the production provided by Nallatech(www.nallatech.com) seems suitable for me. I am still wondering whether I should choose some boards with multiple DSP processors on it? I think DSP processors are better at floating point operations.
Interesting problems about high performance computing
Started by ●June 20, 2007
Reply by ●June 20, 20072007-06-20
On Jun 19, 11:41 pm, "h...@hit.edu.cn" <h...@hit.edu.cn> wrote:> There is a software application which demonds huge computation. My aim > is to port the software program into hardware. So first of all, I need > to evaluate the required volumn of computation.put some counters in the software version to total the required number of operations on each variable in the calculation, and you can get that answer fairly accurately. Next take a good look at the algorithm, for concurrency and necessary serial operations. Any place you have a different data value used in each iteration a loop, you have the chance to do calculate two or more values per loop iterations (unrolling the loop) to gain parallelism. If however, there are serializations which offer no restructuring to parallelize, then the application might well be done best with a fast cpu/cache rather than a slower FGPA. Look at Amdahl's law regarding possible speedup between parallel and serial sections of a program. In some cases, even a serial operation that is lengthy, may see some gains by pipelining, as another form of parallelization. So be constructive in looking for parallelism in the algorithm. In some other cases, using a completely different algorithm which completes the task equally, may be the only means for gaining the required parallelism. This sometimes requires some thought, like replacing recursion with linear code, to "unroll" the implied loops. or reordering the sequence over operations ... such as loop combining where one or more loops process the same data points.> The situation is that the software program will need up to 22 hours on > personal computer (P4 3.06G, 2GB RAM) to finish the computation, while > the operations mostly are float-point multiplications and additions. I > need to finish the computation by hardware in serveral minites. Is it > possible?In general, most FPGA implementations see speed improvements between 2X to 200X, sometimes worse, rarely much better, unless the main software loop is very small, and the algorithm has exceptional parallelism. The reason is that FPGA cycle times are relatively slow compared to GHz multi-issue pipelined CPUs. It doesn't really mater much if the data is fixed or floating point, what matters is the degree of parallelism you can introduce and exploit in the FPGA. You are looking for speed up's on the order of 22*60/3 = 440X which rather implies the calculations of the algorithm must have a very small kernel, and extremely easy to parallelize with little or no serial component (such as memory accesses) to meet your goals in an FPGA ... certainly possible with some careful work, but not necessarily for all algorithms or problems.> Another problem is that which kind of hardware platforms is > perferable. I have checked that the production provided by > Nallatech(www.nallatech.com) seems suitable for me. I am still > wondering whether I should choose some boards with multiple DSP > processors on it? I think DSP processors are better at floating point > operations.Beside looking at FPGAs, you might also consider GPUs, Cell processors, and a different CPU architecture which has better multi- issue floating point, cache, or memory performance ... including tuning the application for exactly that CPU/Memory configuration. Have fun! John
Reply by ●June 20, 20072007-06-20
Hello> Another problem is that which kind of hardware platforms is > perferable. I have checked that the production provided by > Nallatech(www.nallatech.com) seems suitable for me. I am still > wondering whether I should choose some boards with multiple DSP > processors on it? I think DSP processors are better at floating point > operations. >perhaps, this one could suit your needs: http://www.mathstar.com/products.html Regards Christian
Reply by ●June 20, 20072007-06-20
Most engineers still don't realize the processing power of the FPGA. True, today's processors operate at high clock frequencies. People don't truly understand how much of that raw speed is lost to processor overhead. I created a FPGA process for processing real time 1280pixel 32bit camera scans to identifiy the leading edge of incoming documents, determine the skew angle and rotate the image in realtime while the image was still being scanned. This process originally required 8 High speed DSP's with significant propagation delay and large amounts of memory. I have also converted many software processes to the FPGA enviroment. Each have had dramatic improvements in speed that no processor could ever come close to matching. Here is a short list of problems your software program may be experiencing. 1. Almost 25% to 50% your speed is lost to opcode and memory access. This number varies depending on CPU cache efficiency. 2. Depending on the software program size and memory access. You could be forcing excessive cache dumps and reload which can reduce speed another 10 to 50%. 3. You must then content with program efficiency. Is the program optimized for speed. 4. Last, the efficiency of your software compiler. Is it using extensive use of libraries or inline code. When a program is converted to hardware you eliminate items 1, 2, and 4. You are left with program efficiency, how well it is translated to hardware. The first thing to do is reorder the statements in the program. Section the program into stages and identify the loop/repitition structure. Each stage has a dependency on the previous computation. This will usually lead to each stage having multiple computations. These can be executed in parallel in the FPGA and many equations can be executed in just one FPGA clock cycle. After reordering the statements in the proper order for hardware conversion, recompile the program and run to insure in still functions correctly. This is now your basic template for conversion. If you use large amounts of data in the program beyond the capacity of the FPGA, you will need to create a multi channel DMA controller w/cache. This controller will provide access to each stage needing external memory. Second, when using decimal calculation, determine the maximum decimal error. Floating point offers many advantages but slows down computation, consume large number of resources and add to the overall complexity of the hardware. You should use fix point computation if possible. Fix point decimal accuracy (decimal portion not including integer size) 1 byte = 2 decimal places 2 byte = 4 decimal places 3 byte = 7 decimal places 4 byte = 9 decimal places 5 byte = 12 decimal places 6 byte = 14 decimal places Xilinx devices have built in multipliers (18x18) which are fast and save hardware. Division is possible but uses more resources. A spartan3 can do the work. The Virtex 2, and 4 are faster and offer more memory and multipliers with the addition of CPU's to help with other tasks. A DSP does provide avantages over a standard processor but does not compare to the raw power of the FPGA. evansamuel@charter.net
Reply by ●June 20, 20072007-06-20
The type of thing you are doing we have a number of clients doing similar things with our development hardware in quite a few application areas.>From DSP performance it is hard for a DSP processor to beat a welldesigned and implemented FPGA design. Generally you should see a massive increase in performance in FPGA DSP engine compared to the software running on a generic X86 processor. The second part of your problem is getting data to and from the DSP Processor or FPGA. We see a lot of these applications because a lot of our boards are PCI or PCI-E based and offer a high bandwidth route to the PC but you may want to consider USB2 as an alternative. It is not as performant as PCI usually. Often the USB controller itself effectively sits on a PCI bus so is limited by that if not it's own protocol handing. If you do have a high data bandwidth requirement do be careful of what PCI and PCI-E can deliver in reality. 20-50% of the bus/link speed is a reason first level guide of what can be achieved and even then there are lots of factors. John Adair Enterpoint Ltd. www.enterpoint.co.uk On 20 Jun, 06:41, "h...@hit.edu.cn" <h...@hit.edu.cn> wrote:> There is a software application which demonds huge computation. My aim > is to port the software program into hardware. So first of all, I need > to evaluate the required volumn of computation. > > The situation is that the software program will need up to 22 hours on > personal computer (P4 3.06G, 2GB RAM) to finish the computation, while > the operations mostly are float-point multiplications and additions. I > need to finish the computation by hardware in serveral minites. Is it > possible? > > Another problem is that which kind of hardware platforms is > perferable. I have checked that the production provided by > Nallatech(www.nallatech.com) seems suitable for me. I am still > wondering whether I should choose some boards with multiple DSP > processors on it? I think DSP processors are better at floating point > operations.
Reply by ●June 20, 20072007-06-20
hitsx@hit.edu.cn wrote:> There is a software application which demonds huge computation. My aim > is to port the software program into hardware. So first of all, I need > to evaluate the required volumn of computation.> The situation is that the software program will need up to 22 hours on > personal computer (P4 3.06G, 2GB RAM) to finish the computation, while > the operations mostly are float-point multiplications and additions. I > need to finish the computation by hardware in serveral minites. Is it > possible?If you give some hints as to what type of computation you are doing you will get better answers. Floating point, especially addition, is relatively expensive in an FPGA. You might consider the possibility of using fixed point, even if it is significantly wider. Sometimes block floating point works well, where one exponent applies to many (like a whole vector of) values. (The pre/post normalization needed for floating point is expensive.) The actual implementation of the algorithm is usually very different in an FPGA. I recommend the systolic array architecture, but there are other possibilities.> Another problem is that which kind of hardware platforms is > perferable. I have checked that the production provided by > Nallatech(www.nallatech.com) seems suitable for me. I am still > wondering whether I should choose some boards with multiple DSP > processors on it? I think DSP processors are better at floating point > operations.There are some systems convenient for implementing such designs. I have seen PCMCIA cards with an FPGA, though that might not be big enough for you. How many times do you need to run this problem, and how big is your budget? You might also look at: http://www.drccomputer.com/ -- glen
Reply by ●June 21, 20072007-06-21
Hello John, Thank you for you quick reply! To be specific, my program is related to 3D image reconstruction. The input data is float point numbers in the form of 3D array. I represent it using symbol: g(x1,y1,z1), while the number of x1,y1,z1 are quite large. The output data is another 3D float point array. I represent it using symbol: F(x2,y2,z2), similarily the number of x2,y2,z2 are also quite large. The estimated memory usage is 2GB or so, while the calculations required is listed below: integer addition 2442 Giga operations per second float add 814 Giga operations per second float substrac 2424 Giga operations per second float muliply 1610 Giga operations per second float divide 809 Giga operations per second I want these calculations done in one minite, so we can divide the operations by 60. As a matter of fact, if the calculations could be done in 5 minites, it is OK for me. So for minimum requirment, divide the above numbers by 300(=60*5). And your suggestion?
Reply by ●June 21, 20072007-06-21
Hello glen: Please refer to my reply to John. There is no problem about the budget. And I prefer to get the whole calculations done in 10 minites as I have said in my reply to John.
Reply by ●June 21, 20072007-06-21
<hitsx@hit.edu.cn> wrote> > To be specific, my program is related to 3D image reconstruction. The > input data is float point numbers in the form of 3D array. I represent > it using symbol: g(x1,y1,z1), while the number of x1,y1,z1 are quite > large. The output data is another 3D float point array. I represent it > using symbol: F(x2,y2,z2), similarily the number of x2,y2,z2 are also > quite large. > > The estimated memory usage is 2GB or so, while the calculations > required is listed below: > > integer addition 2442 Giga operations per second > float add 814 Giga operations per second > float substrac 2424 Giga operations per second > float muliply 1610 Giga operations per second > float divide 809 Giga operations per second > > I want these calculations done in one minite, so we can divide the > operations by 60. > As a matter of fact, if the calculations could be done in 5 minites, > it is OK for me. So for minimum requirment, divide the above numbers > by 300(=60*5).Is it single or double precision floats? Divided by 300 the numbers are not very high: 10.7G fadd/fsub 5.3G fmul 2.7G fdiv IMO your limiting factor will probably be the memory bandwidth. If you have enough memory bandwidth then it's not a problem for a large FPGA. You can even go faster than that. Marc
Reply by ●June 21, 20072007-06-21






