FPGARelated.com
Forums

Sorting algorithm for FPGA availlable?

Started by Unknown July 19, 2006
Hello,

we have a Virtex4 FPGA and are looking for availlable hardware sorting
algorithms. I couldnt find anything @opencores.org however I guess that
it has been worked on this subject.

The Virtex4 supports 2 PowerPCs and lots of Block RAMs so everything
seems to be there. Are there any projects and what speeds could be
estimated for sorting 64 bit integers?

thx Heiner

I'd suggest a binary tree based sorting algorithm with destination
vectors in order to be able to sort them parallely with more than one
architecture (e.g. one per block ram) or even one per cell in my case
:-) once having found the final order.

If you have no special demands concerning the speed, you should be able
to use a software based algo on the power pc core using one of the
algos around in the net.

Hello,

There was a similar thread in this newsgroup.  If you are looking to solve a
specific problem and have performance targets, you should share those -- it
will help people give you more focused answers.

You do not need PowerPCs or BlockRAMs to sort integers.  Sure, you can write
a sequential sort algorithm and run it on a processor, but if you are
interested in performance, I think that approach may be one of the last
things you would want to do.  If you are interested in two examlpes of a
hardware sort, take a look at:

http://www.engr.sjsu.edu/crabill/module05.pdf

There is a bubble sort and an odd-even transposition sort.  The examples
sort a set of five, 16-bit values.  They could certainly be extended to
handle fewer/more and/or wider/narrower inputs.  One example is pipelined,
the pipeline depth can certainly be changed.  In fact, you can turn it into
a combinational sorter if you want.

If anyone expands the examlpes to sort 1 million 64-bit integers, please let
me know what your synthesis tool thinks about it.

Certainly, I am not advertising the examples as "the best" and I am no
"expert" in sorting.  Or anything.  But they might be a place for you to
start.  I only knew the bubble sort, but used Google to search for "parallel
sorting algorithms" because I figured there would be one which I could
implement in hardware.  The odd-even transposition sort is indeed quite
easy.

Eric

> we have a Virtex4 FPGA and are looking for availlable hardware sorting > algorithms. I couldnt find anything @opencores.org however I guess that > it has been worked on this subject. > > The Virtex4 supports 2 PowerPCs and lots of Block RAMs so everything > seems to be there. Are there any projects and what speeds could be > estimated for sorting 64 bit integers?
heinerlitz@gmx.de wrote:
> Hello, > > we have a Virtex4 FPGA and are looking for availlable hardware sorting > algorithms. I couldnt find anything @opencores.org however I guess that > it has been worked on this subject. > > The Virtex4 supports 2 PowerPCs and lots of Block RAMs so everything > seems to be there. Are there any projects and what speeds could be > estimated for sorting 64 bit integers? > > thx Heiner >
How many objects and how many bits in each for the sort? How many clocks do you have to complete the sort? Are the results accessed sequentially or asa table? These are some of the important questions you need to answer before you can determine how best to sort the data.
heinerlitz@gmx.de wrote:
> Hello, > > we have a Virtex4 FPGA and are looking for availlable hardware sorting > algorithms. I couldnt find anything @opencores.org however I guess that > it has been worked on this subject. > > The Virtex4 supports 2 PowerPCs and lots of Block RAMs so everything > seems to be there. Are there any projects and what speeds could be > estimated for sorting 64 bit integers? > > thx Heiner
It all depends on what type of data you have, is it already sorted almost correctly, or soso, or fully random. look at Knuths texts for the algorithms. If you need to sort a huge array of integer keys I favor the radix sort, for 64 bit values, use 8 passes with 256 variable length output buckets whose total size will be same as initial array.. The time will go towards 16N memory cycles which may use the fastest rate of memory access, potentially at DDR rates even. Its is just like sorting a telephone book where you take the initial say random name list and put all the names beginning with 'a' into the 'a' bucket and similar for other letters. In this case you have 256 letters. On 2nd pass inspect 2nd letter. Every time you have say 16 'a' words you bag them and put them into the 'a' pile. You want to do mostly reads and mostly writes as bursts rather than interleaving reads & writes. Blockrams can be used as your letter bags. Its get more interesting when you figure how to manage the buckets. BlockRams can be usefull for holding blocks of each letter. 1 Blockram with 256 64b words would only hold 1 word buffer for each of the 256 buckets. So 16 Blockrams would hold 256 buckets of 16 words deep. You want them to be as deep as possible to reduce the cost of bucket management. You could think of these Blockram buckets as soft 256 Fifos (16 word deep) into main memory. Only when each letter bucket fills do you need to have to write it out and find a new place to write it. One way to do that is to assume the input data is matched by a similar sized output array which is broken into 16 word cells. Add to that a link list to manage the cells, then its is a matter of managing 256 link list heads and a simple table that will need 1 word per cell. If you get really good at this, with small buffers for inputs and outputs, you can effectively do this inplace in main memory, though many will use separate in and out buffers. Every time 16 words are read in, that cell is added to a free list. On avg, you write 1 of the 256 letter cells for every cell you read in. Memory cost is therefore N+N/16 for 16 word deep buffers. Ofcourse you should proto this in C/whatever and observe its operation. The HDL code will look similar. John Jakson transputer guy
Hi John,
I don't understand the following memory allocation issue you mentioned:
"Every time 16 words are read
in, that cell is added to a free list. On avg, you write 1 of the 256
letter cells for every cell you read in. Memory cost is therefore
N+N/16  for 16 word deep buffers. "

For example, all 1 million keywords starts with same one of 256
letters?

Could you please give more details? Or any references in Knuth volumn 3
book?

Thank you.

Weng


JJ wrote:
> heinerlitz@gmx.de wrote: > > Hello, > > > > we have a Virtex4 FPGA and are looking for availlable hardware sorting > > algorithms. I couldnt find anything @opencores.org however I guess that > > it has been worked on this subject. > > > > The Virtex4 supports 2 PowerPCs and lots of Block RAMs so everything > > seems to be there. Are there any projects and what speeds could be > > estimated for sorting 64 bit integers? > > > > thx Heiner > > It all depends on what type of data you have, is it already sorted > almost correctly, or soso, or fully random. look at Knuths texts for > the algorithms. > > If you need to sort a huge array of integer keys I favor the radix > sort, for 64 bit values, use 8 passes with 256 variable length output > buckets whose total size will be same as initial array.. The time will > go towards 16N memory cycles which may use the fastest rate of memory > access, potentially at DDR rates even. > > Its is just like sorting a telephone book where you take the initial > say random name list and put all the names beginning with 'a' into the > 'a' bucket and similar for other letters. In this case you have 256 > letters. On 2nd pass inspect 2nd letter. Every time you have say 16 'a' > words you bag them and put them into the 'a' pile. You want to do > mostly reads and mostly writes as bursts rather than interleaving reads > & writes. Blockrams can be used as your letter bags. > > Its get more interesting when you figure how to manage the buckets. > BlockRams can be usefull for holding blocks of each letter. 1 Blockram > with 256 64b words would only hold 1 word buffer for each of the 256 > buckets. So 16 Blockrams would hold 256 buckets of 16 words deep. You > want them to be as deep as possible to reduce the cost of bucket > management. You could think of these Blockram buckets as soft 256 Fifos > (16 word deep) into main memory. > > Only when each letter bucket fills do you need to have to write it out > and find a new place to write it. One way to do that is to assume the > input data is matched by a similar sized output array which is broken > into 16 word cells. Add to that a link list to manage the cells, then > its is a matter of managing 256 link list heads and a simple table > that will need 1 word per cell. > > If you get really good at this, with small buffers for inputs and > outputs, you can effectively do this inplace in main memory, though > many will use separate in and out buffers. Every time 16 words are read > in, that cell is added to a free list. On avg, you write 1 of the 256 > letter cells for every cell you read in. Memory cost is therefore > N+N/16 for 16 word deep buffers. > > Ofcourse you should proto this in C/whatever and observe its operation. > The HDL code will look similar. > > > John Jakson > transputer guy
Weng Tianxiang wrote:
> Hi John, > I don't understand the following memory allocation issue you mentioned: > "Every time 16 words are read > in, that cell is added to a free list. On avg, you write 1 of the 256 > letter cells for every cell you read in. Memory cost is therefore > N+N/16 for 16 word deep buffers. " >
> For example, all 1 million keywords starts with same one of 256 > letters? > > Could you please give more details? Or any references in Knuth volumn 3 > book? >
Well its not that dificult to write a small C program to explore this. Get a small rand function, like R250 and fill up a 1M word array and make a ptr table to every 16th word and make 256 empty head ptrs. Set up 256 empty 16 word fifos. Arrange the C code to use an FSM in a loop so that every iteration, data moves from the array to 1 of the 256 buffers for 16 cycle bursts. If any of the buckets should fill, copy them to a write back buffer. During the 16 word read intervals, insert write backs. Its would be much easier to just use 2x as much Blockram and let the buffers be 32 words deep. While all buffers less than 16 deep, keep reading in bursts. When any buffer crosses 16, write out the previous 16 during a read interval.. Basically you have a DMA engine that is always reading or writing 16 words blocks so DRAM can be efficient. You can change the 16 for another burst size by changing the bucket depths. Use consts ofcourse. It doesn't really matter if all the letters are evenly distributed or even all mostly the same. That only changes the routing of bucket fifos into the output stream. The input data of 64K 16 word entries all starts on 1 link list is going to be replaced by 64K 16 word blocks spread on 256 distinct lists, even if most might actually be on 1 list for uneven letters. On the next pass, the engine will iterate through each of these 256 input lists and write back sort of inplace back to tha same list. I could see 2 memory systems in play, the data reads and writes to the 1M array, and a separate system to do the header link list management. Every write back of a bucket needs for the link list FSM to track the 256 headers. The HDL FSM will be pretty much the same structure. Ofcourse writing a program on a NG post is much easier than writing it for real, but really this isn't rocket science. This is about a days worth of C coding & testing. Perhaps a basic text on link lists might be more useful than Knuth since thats is really all this is. John Jakson transputer guy
Hi JJ,
Thank you.

Weng

JJ schrieb:

> If you need to sort a huge array of integer keys I favor the radix > sort, for 64 bit values, use 8 passes with 256 variable length output > buckets whose total size will be same as initial array.. The time will > go towards 16N memory cycles which may use the fastest rate of memory > access, potentially at DDR rates even. > > Its is just like sorting a telephone book where you take the initial > say random name list and put all the names beginning with 'a' into the > 'a' bucket and similar for other letters. In this case you have 256 > letters. On 2nd pass inspect 2nd letter. Every time you have say 16 'a' > words you bag them and put them into the 'a' pile. You want to do > mostly reads and mostly writes as bursts rather than interleaving reads > & writes. Blockrams can be used as your letter bags.
Actually that would be bucket sort. The disadvantage of bucket sort is that you need to manage many buckets of varying size. This is either complicated or wastes a lot of memory. With radix sort you start by sorting counterintuitively by the LAST letter of the name (or the least significant byte) to avoid all that hassle. Each pass can be done by counting sort with an internal block ram with 256 entries for counting. In a V4 you can probably even hold 64k entries so you can get along with 4 passes of 16-bits each. See section 9.3 of Cormen, Leiserson and Rives "Introduction to Algorithms". Smaller datasets can be sorted in hardware by systolic priority queues as fast as you can input the data. Kolja Sulimma.
Kolja Sulimma wrote:
> Smaller datasets can be sorted in hardware by systolic priority queues > as fast as you can input the data.
You can also sort the words bit serial using a log2(N) array of muxes for each serial input. This strategy has a serial latency of log2(N) to the first bit, plus word length clocks, and can be built as wide as you want it. It's easily reused as part of a block sort. http://svn.sourceforge.net/viewvc/fpgac/trunk/fpgac/examples/pipelines/BitSer.c?revision=10&view=markup The strategy is a two input, two output muxes that passes bits through as pairs. Each bit is compared, and on the first difference, the mux selector is latched so that the low value bit stream passes to the left. The mux selectors are reset on a word flag. With a log2 mux arrangement, each mux layer sorts with a stream offset of 2, 4, 8, ..., 2^N Depending on the FPGA, the self selecting bit muxes are a few LUTs each, making the total cost of the sort engine N * log2(N) * K (where K is the number of LUTs per bit mux).