Hi all , I wanted to implement a fast and low area sorting algorithm in Verilog RTL, does anyone have any suggestions? Any links to IEEE papers, articles are higly welcome... regards Deepu John
Fast/low area Sorting hardware.
Started by ●June 7, 2005
Reply by ●June 7, 20052005-06-07
Reply by ●June 7, 20052005-06-07
The question is how many values you want to sort... basically, this is iterative algorithm, so for a large set, a big memory and controller will be small but long. as you posted on comp.arch.fpga, you might like this one: http://www.xilinx.com/xcell/xl23/xl23_16.pdf else googlize "rank order filter" john.deepu@gmail.com wrote:> Hi all , > I wanted to implement a fast and low area sorting algorithm in Verilog > RTL, does anyone have any suggestions? > Any links to IEEE papers, articles are higly welcome... > > regards > Deepu John >
Reply by ●June 7, 20052005-06-07
Depends on how fast you want it and what the input order usually is. Do it in C 1st and cost it so that every read and write into data array is your memory access cost function and every compare done is your hw compare cost function. For a 2 input a,b = sort x,y a,b <= (x<y)? x,y : y,x; Arrange this into a merge sort should take 24*6*2 clocks using a DP ram. Thats around 288 clocks worst case. If no swap, you could save some clocks. If data is always sorted the time can be halved. Quicksort would be a waste of time & HW design effort here. Even bubble sorts are ok if you know the data is already mostly in order most of the time. Sort only as fast as you actually need. Since your data is also only 8bits wide the other obvious candidate is counter sorting. Fill an array of 256 counters with 0. Then scan the 48 inputs and do cntr[data[i]]++; Then scan the cntr list and for each cntr value emit that index the value no of times. This will take maybe 256 clocks to 0, 48 clocks to do the ++, then 256 clocks to scan the counts plus 48 more to emit the values. With dual port you could get around 300 clocks with out using any compares. This algorithm works very well when you have lots of data inputs with low precisions, but not so well other way around. 48 inputs is a bit low for this design. johnjakson at usa dot com
Reply by ●June 7, 20052005-06-07
Reply by ●June 7, 20052005-06-07
john.deepu@gmail.com wrote:> I wanted to sort 48 8bit unsigned numbersAnd do you want the 384 bit result for a new set of numbers every 10 ns? Can you wait 1 ms for the result? Is it fine to keep the inputs and outputs in a BlockRAM or are you presenting them parallel?
Reply by ●June 7, 20052005-06-07
<john.deepu@gmail.com> schrieb im Newsbeitrag news:1118147185.854541.278530@g49g2000cwa.googlegroups.com...> Hi Stephane. > I wanted to sort 48 8bit unsigned numbersJust some ideas. Hold the numbers in a BRAM. Create a small FSM to pull out the numbers, store them in a temporary register and compare. Bubble sort is usually the easiest but slowest approach in software. Maybe this is also true for an FPGA. Misc sort is fastest, so it sould be for hardware too. You could use both ports aof a BRAm to simultaneously access two pieces of data to increase speed. Regards Falk
Reply by ●June 7, 20052005-06-07
In article <3gm4uiFd4i81U2@individual.net> you wrote: : <john.deepu@gmail.com> schrieb im Newsbeitrag : news:1118147185.854541.278530@g49g2000cwa.googlegroups.com... : > Hi Stephane. : > I wanted to sort 48 8bit unsigned numbers : Just some ideas. Hold the numbers in a BRAM. Create a small FSM to pull out : the numbers, store them in a temporary register and compare. Bubble sort is : usually the easiest but slowest approach in software. Maybe this is also : true for an FPGA. Misc sort is fastest, so it sould be for hardware too. You : could use both ports aof a BRAm to simultaneously access two pieces of data : to increase speed. Bubble sort should actually be quite fast - you can store all 48 values in registers, then compare and swap if necessary odd pairs on odd clock cycles and even pairs on even cycles. After 48 cycles the registers should hold a sorted data set. Probably this aproach is about as fast as you can get? The speedup comes from the fact that many many bubble sort opperations can occour in parallel. Is this the most hardware efficient sort that runs at this speed though? If possible data should be shifted sequentially into and out of (or at least out of) the registers as a readout mux would be a resource hog. Using a Generate statement in VHDL (or equiv in Verilog) I'm guessing the sort will come to less than 100 lines of code. cheers cds
Reply by ●June 7, 20052005-06-07
"c d saunter" <christopher.saunter@durham.ac.uk> schrieb im Newsbeitrag news:d84pvd$5pu$1@heffalump.dur.ac.uk...> Bubble sort should actually be quite fast - you can store all 48 values in > registers, then compare and swap if necessary odd pairs on odd clockcycles> and even pairs on even cycles. After 48 cycles the registers should hold > a sorted data set.This sounds like a almost full parallel approach. Could be quite fast, but also quite resouce hungry.> Probably this aproach is about as fast as you can get? The speedup comes > from the fact that many many bubble sort opperations can occour in > parallel. Is this the most hardware efficient sort that runs at thisspeed> though? > > If possible data should be shifted sequentially into and out of (or atleast> out of) the registers as a readout mux would be a resource hog. Using aThats why a BRAM is quite handy. Using both ports you can pull two datas out in one cycle, compare, and write them back on a second cycle. Regards Falk
Reply by ●June 7, 20052005-06-07
Falk Brunner (Falk.Brunner@gmx.de) wrote: : "c d saunter" <christopher.saunter@durham.ac.uk> schrieb im Newsbeitrag : news:d84pvd$5pu$1@heffalump.dur.ac.uk... : > Bubble sort should actually be quite fast - you can store all 48 values in : > registers, then compare and swap if necessary odd pairs on odd clock : cycles : > and even pairs on even cycles. After 48 cycles the registers should hold : > a sorted data set. : This sounds like a almost full parallel approach. Could be quite fast, but : also quite resouce hungry. Yup. Unless the OP posts some details about desired performance it's impossible to know which one is best... : Thats why a BRAM is quite handy. Using both ports you can pull two datas out : in one cycle, compare, and write them back on a second cycle. Indeed. There is a nice intermediate level of parallelism availible by using LUT RAMs to build units 16 words deep, each of which sort sequentially, with every other set of sorts crossing the LUT RAM boundaries. This would also be the most complicated case to code :-) cheers cds






