FPGARelated.com
Forums

Building the 'uber processor'

Started by mikegw November 3, 2003
Hello all,

Firstly I would like to say that other than knowing what a FPGA is on a most
basic level my knowledge about the subject is nil.  I am looking at this
from an application that needs a solution.  I have seen about the place add
on boards for PC's that act as co-processors.  This is the interesting bit
to me.  Our research group is looking into building a computer (cluster
perhaps)  for calculation of particle dynamics, similar to CFD in
application.  Our programs are in C/C++ running on Linux ( any flavour will
do).

My questions are

a) Will a FPGA co-processor board(s) offer a speed improvement in running
our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)?
Bearing in mind that ours will be the only job on the machine so can we
reconfigure our FPGA boards to speed calculation?

b) Can anyone recommend a good book that I can read and hopefully be able to
ask more informed questions?

Cheers

Mike


I don't know of any good books.. but.. FPGA's can run rings around code...
especially if you can define what you want them to do.  that's the tricky
part... and as far as parallel processing is concerned.. they will blow your
mind.. or sit there flashing a light...Xilinx are working on a JAVA compiler
for FPGA's. I think its a student partnership thing so am not sure how good
it is but it converts java into hardware.

And FPGA's will eat any cluster.. just see above.. But if you can't define
the problem in a way the FPGA can handle then it would be no faster.  FPGA's
are literally OR's AND's and flip flops (latches) and that's what you need
to start with.. they also have adders and even processors.. small memories
and stuff like that.. if you need large memory they can do that too. its
hardware.. want SDRAM ? just connect it up and write a program to access it.
(just don't forget to refresh it too :-)

There are already a number of super cluster FPGA projects around.  One of
the fusion reactor projects uses several hundred of them .. I read an
article once.. don't remember the web site sorry.


Simon


"mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote in message
news:bo4na0$5qk$1@tomahawk.unsw.edu.au...
> Hello all, > > Firstly I would like to say that other than knowing what a FPGA is on a
most
> basic level my knowledge about the subject is nil. I am looking at this > from an application that needs a solution. I have seen about the place
add
> on boards for PC's that act as co-processors. This is the interesting bit > to me. Our research group is looking into building a computer (cluster > perhaps) for calculation of particle dynamics, similar to CFD in > application. Our programs are in C/C++ running on Linux ( any flavour
will
> do). > > My questions are > > a) Will a FPGA co-processor board(s) offer a speed improvement in running > our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)? > Bearing in mind that ours will be the only job on the machine so can we > reconfigure our FPGA boards to speed calculation? > > b) Can anyone recommend a good book that I can read and hopefully be able
to
> ask more informed questions? > > Cheers > > Mike > >
"Simon Peacock" <nowhere@to.be.found> wrote in message
news:3fa621eb@news.actrix.gen.nz...
> I don't know of any good books.. but.. FPGA's can run rings around code... > especially if you can define what you want them to do. that's the tricky > part... and as far as parallel processing is concerned.. they will blow
your
> mind.. or sit there flashing a light...Xilinx are working on a JAVA
compiler
> for FPGA's. I think its a student partnership thing so am not sure how
good
> it is but it converts java into hardware. > > And FPGA's will eat any cluster.. just see above.. But if you can't define > the problem in a way the FPGA can handle then it would be no faster.
FPGA's
> are literally OR's AND's and flip flops (latches) and that's what you need > to start with.. they also have adders and even processors.. small memories > and stuff like that.. if you need large memory they can do that too. its > hardware.. want SDRAM ? just connect it up and write a program to access
it.
> (just don't forget to refresh it too :-) > > There are already a number of super cluster FPGA projects around. One of > the fusion reactor projects uses several hundred of them .. I read an > article once.. don't remember the web site sorry. > > > Simon > >
Thanks Just so I understand you, if I want to "realise" my c code in a FPGA array, I can upload the code, data and the processing array. Run it and download the data? The code (not actually mine I am just seeing if this is all possible) is basically applying an equation on a data set looping for all particles for each time step. The tricky bit (in at least the programming sense) is to constantly calculate the relative positions of each particle to calculate their effect on each other. I would really like it if there exists such a book that could take someone who has a c/c++ program and hold their hand through a whole "Realisation" of that code. Cheers Mike
"mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote in message news:<bo4na0$5qk$1@tomahawk.unsw.edu.au>...
> Hello all, > > Firstly I would like to say that other than knowing what a FPGA is on a most > basic level my knowledge about the subject is nil. I am looking at this > from an application that needs a solution. I have seen about the place add > on boards for PC's that act as co-processors. This is the interesting bit > to me. Our research group is looking into building a computer (cluster > perhaps) for calculation of particle dynamics, similar to CFD in > application. Our programs are in C/C++ running on Linux ( any flavour will > do). > > My questions are > > a) Will a FPGA co-processor board(s) offer a speed improvement in running > our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)? > Bearing in mind that ours will be the only job on the machine so can we > reconfigure our FPGA boards to speed calculation? > > b) Can anyone recommend a good book that I can read and hopefully be able to > ask more informed questions? > > Cheers > > Mike
Hi Mike Think of a coprocessor as a black box with input output channels that sits in your PC. The computing elements may be a fraction of the speed of a 3GHz P4 at some things or maybe many orders of magnitude more. I am guessing that your app needs FP calculations, maybe IEEE, maybe any adhock FP will do. The IEEE is still costly to do in FPGA but see a previous post for some pointers. An adhock FP may be all thats needed but you would have to do a similar version in SW for a unaccelerated node to get same results. Where FPGA boards really shine is when you can arrange for them to be in series with streaming data that that may be orders faster than a PC could normally handle. If your data is on HD and has to come through PCI bus then you are IO bound. That may be ok if you can perform N million comps per word transfered such as say crypto but if you needed to do minimal comps per point, FPGA can be wrong solution. Figure how much parallelism you can extract. P4 may run at 3GHz. An FPGA board may run at 50MHz to 200MHz, if you perform integer *+ that may limit to 100MHz. So you need to be doing atleast 30x more in parallel just to match 1 P4. If you can do an order more in parallel than that, then you could be doing fine as long as you aren't IO bound. Consider a faster PCI bus that will get you a few x more throughput. Consider if you can dump one time all data into onboard ram on PCI board, ie get the PC out of the eqn except for basic system support. Take alook at TimeLogic Decypher board as an example of Bioinformatics that get accelerated at similar rates to your app, but AFAIK its mostly pattern matching & integer comps. Can't say I heard of any books on this matter as its still immature field! Good luck johnjaksonATusaDOTcom
Hi Mike,

mikegw wrote:
> a) Will a FPGA co-processor board(s) offer a speed improvement in running > our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)? > Bearing in mind that ours will be the only job on the machine so can we > reconfigure our FPGA boards to speed calculation?
To parallel what Jon said earlier - the biggest gotcha that seems to bite people is IO bandwidth. It's not necessarily hard to develop highly pipelined FPGA designs that will crunch your numbers at 100M sample/sec, but can you keep it busy? I read of an interesting approach a while ago - do a search for Pilchard, it's an FPGA coprocessor board developed at a Hong Kong university. Basically it fits in the standard PC memory module form factor, with custom Linux drivers to access it. The bandwidth on the memory bus is much greater than on PCI. Regards, John
"mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote in message news:<bo4na0$5qk$1@tomahawk.unsw.edu.au>...
> Hello all, > > Firstly I would like to say that other than knowing what a FPGA is on a most > basic level my knowledge about the subject is nil. I am looking at this > from an application that needs a solution. I have seen about the place add > on boards for PC's that act as co-processors. This is the interesting bit > to me. Our research group is looking into building a computer (cluster > perhaps) for calculation of particle dynamics, similar to CFD in > application. Our programs are in C/C++ running on Linux ( any flavour will > do).
in M&#4294967295;nchen, Germany there is a research group that uses Xilinx a lot they do some 'particle' search I think FPGAs are mostly used to filter out the data coming from then experiment. as you are also in heavy research area maybe good idea to contact them - I have no addresses but there are not so many nuclear labs so the one I mentioned should be easy to find for you antti
"John Williams" <jwilliams@itee.uq.edu.au> wrote in message
news:bo6l2c$u39$1@bunyip.cc.uq.edu.au...
> Hi Mike, > > mikegw wrote: > > a) Will a FPGA co-processor board(s) offer a speed improvement in
running
> > our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)? > > Bearing in mind that ours will be the only job on the machine so can we > > reconfigure our FPGA boards to speed calculation? > > To parallel what Jon said earlier - the biggest gotcha that seems to > bite people is IO bandwidth. It's not necessarily hard to develop > highly pipelined FPGA designs that will crunch your numbers at 100M > sample/sec, but can you keep it busy? >
As we will be stepping time, the data (particle information position etc...) will be the output of the previous 'step'. The only bit that might be messy is to calculate the relative distances between particles. I think that these devices might be the way to go. To me it seems odd that we seem to be taking a step back to the old analogue computer days when you 'built' your program.
> I read of an interesting approach a while ago - do a search for > Pilchard, it's an FPGA coprocessor board developed at a Hong Kong > university. Basically it fits in the standard PC memory module form > factor, with custom Linux drivers to access it. The bandwidth on the > memory bus is much greater than on PCI.
I took a look, it seems to be fairly interesting. Given my particular data set I might be on the wrong track thinking of an accelerator card. Maybe a stand alone device which the input is up-loaded and it is sent forth to do. So much to learn......... Mike
"mikegw" <mikegw20@hotmail.spammers.must.die.com> writes:
>As we will be stepping time, the data (particle information position etc...) >will be the output of the previous 'step'. The only bit that might be messy >is to calculate the relative distances between particles.
>I think that these devices might be the way to go. To me it seems odd that >we seem to be taking a step back to the old analogue computer days when you >'built' your program.
This is getting away from hardware, and you haven't said how much expertise you have there to use on the problem, but I remember a series of books published by MIT press in the '90s. Each was the summary of a different phd thesis. One of those described break- throughs in the simulation of many body problems that led to orders of magnitude increase in speed for running the simulation. I don't know whether those results would apply in your case or not. It seems to be a general rule that hardware can speed up a problem by k-fold, where k is a modestly small number usually. But finding a better algorithm can speed up a problem by n-fold, where n is the number of items you have to deal with. With both you might get k*n.
>So much to learn.........
Someone once said to me "it takes six or eight years to really learn something well, and you don't have very many six or eights, so don't you go waste one." Now I realize I really should have understood what he meant then.
On Mon, 3 Nov 2003 16:05:52 +1100, "mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote:
>Hello all,
Hi Mike,
>Firstly I would like to say that other than knowing what a FPGA is on a most >basic level my knowledge about the subject is nil. I am looking at this >from an application that needs a solution. I have seen about the place add >on boards for PC's that act as co-processors. This is the interesting bit >to me. Our research group is looking into building a computer (cluster >perhaps) for calculation of particle dynamics, similar to CFD in >application. Our programs are in C/C++ running on Linux ( any flavour will >do).
So that we may better help you, please answer the following questions: Is the arithmetic Floating Point (FP) or Integer? If mixed, what is the ratio of the two? (i.e. 10000 integer ops to every floating point op) (If the ratio is greater than 100000:1, could you do the integer stuff in the FPGAs, and the FP in a host X86 processor?) If floating point: Does it need to be IEEE FP (i.e. identical to a software execution on the same data set) OR (Floating point with N bits of mantissa, M bits of exponent, X guard bits, etc...) What is the ratio of Mult, Div, Add, Sub, Sqrt, Sin, Cos, Exp, Log, ... (Are integer aproximation useable??) For integer operations, how many bits of precission are needed? Is this precision required all the way through the algorithm, or can the precission be adjusted at each step? How many arithmetic/logic ops per data item? What is the data set size needed before calculations can start (i.e. 20 3D points, 10 scan lines, a 512 by 512 2D set, ...) Can the calculations be partitioned in multiple identical sets that perform the same operation on different parts of the total data set. If partitioning is possible how much communications (number of data items) is needed to be passed between the separate calculation clusters? How often does this need to happen (what is the inter-processor bandwidth). How much local data is created while calculations take place? (What bandwith is needed to support it) How much table/look up data is required by the algorithm? (What bandwith is needed to support it) Can data be thought of as a continuous stream in and out, or is it 1 big chunk that must all arrive, then calculate till done, then spit out a result (what is size of input chunk and output gems). Is there a constant flow of chunks (Size, arrival rate, expected FP/Int ops per chunk?) Since you want an &#4294967295;ber processor, do you have an &#4294967295;ber hardware designer? (It takes considerable effort to create one of these, especially if what you start with is an &#4294967295;ber software designer. It is an order of magnitude easier to get a HW engineer to write passable SW than it is to get a SW programmer to design passable HW.) Are you aware that SW is basically written for sequential execution, or extremely chunky parallelism (threads). Hardware design (for &#4294967295;ber processors) typically require Ultra parallelism (100s to 1000s of operation running in parallel), which means that your algorithms will have to totally re-arranged to match such application specific hardware. Although this is daunting, there are hundreds of real life systems that have done this (i.e. your basic question of "does this make sense" to consider FPGAs to create an application specific co-processor is YES). Implementing these successful systems was never achieved by just taking the SW (C/C++ for example) and re-crafting as hardware. You will need to go back to the basics of the algorithm's intent, then design for the extreme parallelism that the FPGAs offer. This is not always possible, as discussed by others who have answered your original question. Are you thinking of a single co-processor board in a PC or something more like a Bewoulf cluster with each node having its own accelerator board? There are many more such questions, but this would be a good start.
>My questions are > >a) Will a FPGA co-processor board(s) offer a speed improvement in running >our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)? >Bearing in mind that ours will be the only job on the machine so can we >reconfigure our FPGA boards to speed calculation?
Can't answer this without far more information from you. See above :-) Note that your: "so can we reconfigure our FPGA boards to speed calculation?" is no trivial thing. The design of the hardware may take many months to do even if you have a &#4294967295;ber hardware designer.
>b) Can anyone recommend a good book that I can read and hopefully be able to >ask more informed questions?
There is an annual conference held in Napa California where all the people that do this type of thing meet. It is teh IEEE FCCM conference. You would be well served by looking at the titles of the proceddings for the last 7 years at http://www.fccm.org/ . You can probably get copies of the proceedings from the IEEE for way too much money.
>Cheers
Happiness to you too.
>Mike
Philip Philip Freidin Fliptronics
mikegw wrote:

> Just so I understand you, if I want to "realise" my c code in a FPGA > array, I can upload the code, data and the processing array. Run it and > download the data? > > The code (not actually mine I am just seeing if this is all possible) is > basically applying an equation on a data set looping for all particles for > each time step. The tricky bit (in at least the programming sense) is to > constantly calculate the relative positions of each particle to calculate > their effect on each other.
Mike, Surely, you might put something like a processor into an FPGA where you can download your code and data. But you will very likely not gain very much from this as you are still stuck with your "program code execution" paradigm. Depending on the application, you might get a little gain by placing a very special processor into the FPGA that is optimised for your application. DSPs are a good example here. They have special features that makes them very fast for some algorithms. This would also require that you have a special compiler, that compiles the code (that you want to reuse) optimized for your special processor. But many things you would probably anyways need to code in assembly language, because there is no direct translation from an high-level language to a special machine feature possible. As far as I know, this is the same for DSPs. However, a real speed-up you will achieve by throwing the processor concept over board and thinking just in distributed state machines. This is a completely different thing compared to implementing an algorithm in some language. At first, you have to be an experienced digital designer to do that. (Btw, you have to be the same when designing a special CPU, of course.) Regards, Mario