FPGARelated.com
Forums

Fast/low area Sorting hardware.

Started by Unknown June 7, 2005
Hi all ,
 I wanted to implement a fast and low area sorting algorithm in Verilog
RTL, does anyone have any suggestions?
Any links to IEEE papers, articles are higly welcome...

regards
Deepu John

I wanted to sort 48 8bit unsigned numbers

The question is how many values you want to sort... basically, this is 
iterative algorithm, so for a large set, a big memory and controller 
will be small but long.

as you posted on comp.arch.fpga, you might like this one:
http://www.xilinx.com/xcell/xl23/xl23_16.pdf

else googlize "rank order filter"

john.deepu@gmail.com wrote:
> Hi all , > I wanted to implement a fast and low area sorting algorithm in Verilog > RTL, does anyone have any suggestions? > Any links to IEEE papers, articles are higly welcome... > > regards > Deepu John >
Depends on how fast you want it and what the input order usually is.

Do it in C 1st and cost it so that every read and write into data array
is your memory access cost function and every compare done is your hw
compare cost function.

For a 2 input a,b = sort x,y
a,b <= (x<y)? x,y : y,x;

Arrange this into a merge sort should take 24*6*2 clocks using a DP
ram. Thats around 288 clocks worst case. If no swap, you could save
some clocks. If data is always sorted the time can be halved.

Quicksort would be a waste of time & HW design effort here.
Even bubble sorts are ok if you know the data is already mostly in
order most of the time. Sort only as fast as you actually need.

Since your data is also only 8bits wide the other obvious candidate is
counter sorting. Fill an array of 256 counters with 0.
Then scan the 48 inputs and do cntr[data[i]]++; Then scan the cntr list
and for each cntr value emit that index the value no of times. This
will take maybe 256 clocks to 0, 48 clocks to do the ++, then 256
clocks to scan the counts plus 48 more to emit the values. With dual
port you could get around 300 clocks with out using any compares. This
algorithm works very well when you have lots of data inputs with low
precisions, but not so well other way around. 48 inputs is a bit low
for this design.

johnjakson at usa dot com

Hi Stephane.
I wanted to sort 48 8bit unsigned numbers 

thanks
Deepu

john.deepu@gmail.com wrote:
> I wanted to sort 48 8bit unsigned numbers
And do you want the 384 bit result for a new set of numbers every 10 ns? Can you wait 1 ms for the result? Is it fine to keep the inputs and outputs in a BlockRAM or are you presenting them parallel?
<john.deepu@gmail.com> schrieb im Newsbeitrag
news:1118147185.854541.278530@g49g2000cwa.googlegroups.com...
> Hi Stephane. > I wanted to sort 48 8bit unsigned numbers
Just some ideas. Hold the numbers in a BRAM. Create a small FSM to pull out the numbers, store them in a temporary register and compare. Bubble sort is usually the easiest but slowest approach in software. Maybe this is also true for an FPGA. Misc sort is fastest, so it sould be for hardware too. You could use both ports aof a BRAm to simultaneously access two pieces of data to increase speed. Regards Falk
In article <3gm4uiFd4i81U2@individual.net> you wrote:

: <john.deepu@gmail.com> schrieb im Newsbeitrag
: news:1118147185.854541.278530@g49g2000cwa.googlegroups.com...
: > Hi Stephane.
: > I wanted to sort 48 8bit unsigned numbers

: Just some ideas. Hold the numbers in a BRAM. Create a small FSM to pull out
: the numbers, store them in a temporary register and compare. Bubble sort is
: usually the easiest but slowest approach in software. Maybe this is also
: true for an FPGA. Misc sort is fastest, so it sould be for hardware too. You
: could use both ports aof a BRAm to simultaneously access two pieces of data
: to increase speed.

Bubble sort should actually be quite fast - you can store all 48 values in
registers, then compare and swap if necessary odd pairs on odd clock cycles
and even pairs on even cycles.   After 48 cycles the registers should hold 
a sorted data set.

Probably this aproach is about as fast as you can get?  The speedup comes
from the fact that many many bubble sort opperations can occour in
parallel.  Is this the most hardware efficient sort that runs at this speed
though?

If possible data should be shifted sequentially into and out of (or at least
out of) the registers as a readout mux would be a resource hog.  Using a
Generate statement in VHDL (or equiv in Verilog) I'm guessing the sort will come
to less than 100 lines of code.

cheers
cds
"c d saunter" <christopher.saunter@durham.ac.uk> schrieb im Newsbeitrag
news:d84pvd$5pu$1@heffalump.dur.ac.uk...

> Bubble sort should actually be quite fast - you can store all 48 values in > registers, then compare and swap if necessary odd pairs on odd clock
cycles
> and even pairs on even cycles. After 48 cycles the registers should hold > a sorted data set.
This sounds like a almost full parallel approach. Could be quite fast, but also quite resouce hungry.
> Probably this aproach is about as fast as you can get? The speedup comes > from the fact that many many bubble sort opperations can occour in > parallel. Is this the most hardware efficient sort that runs at this
speed
> though? > > If possible data should be shifted sequentially into and out of (or at
least
> out of) the registers as a readout mux would be a resource hog. Using a
Thats why a BRAM is quite handy. Using both ports you can pull two datas out in one cycle, compare, and write them back on a second cycle. Regards Falk
Falk Brunner (Falk.Brunner@gmx.de) wrote:
: "c d saunter" <christopher.saunter@durham.ac.uk> schrieb im Newsbeitrag
: news:d84pvd$5pu$1@heffalump.dur.ac.uk...

: > Bubble sort should actually be quite fast - you can store all 48 values in
: > registers, then compare and swap if necessary odd pairs on odd clock
: cycles
: > and even pairs on even cycles.   After 48 cycles the registers should hold
: > a sorted data set.

: This sounds like a almost full parallel approach. Could be quite fast, but
: also quite resouce hungry.

Yup.  Unless the OP posts some details about desired performance it's impossible
to know which one is best...

: Thats why a BRAM is quite handy. Using both ports you can pull two datas out
: in one cycle, compare, and write them back on a second cycle.

Indeed.  There is a nice intermediate level of parallelism availible by using 
LUT RAMs to build units 16 words deep, each of which sort sequentially, with
every other set of sorts crossing the LUT RAM boundaries.  This would also be
the most complicated case to code :-)

cheers
cds