FPGARelated.com
Forums

16->5 "Sort"

Started by Kevin Neilson May 11, 2015
I'm trying to design a circuit (Virtex-7) which you might call either a pri=
ority encoder or a sorter.  This is what it should do:

<i>Given a 16-bit vector with 5 bits set, create a list of 5 4-bit encoded =
values of each bit set.  These needn't be in order.</i>

This turns out to be a lot harder than I thought.  Writing the behavioral R=
TL isn't hard, but Vivado synthesizes it to 16 levels of logic, and when I =
draw out an optimized version, I still get at least 5 levels (using 6-input=
 LUTs).  I'd like to do it with minimal latency, but I can only do about 3 =
levels of logic at my clock speed.  I've tried thinking about how I can use=
 the carry chain muxes but they don't seem to be helpful.  I can do a leadi=
ng-ones detector and encode the leading 1, but I'd probably have to pipelin=
e each stage so that would take 5 cycles.
On 5/11/2015 9:27 PM, Kevin Neilson wrote:
> I'm trying to design a circuit (Virtex-7) which you might call either > a priority encoder or a sorter. This is what it should do: > > <i>Given a 16-bit vector with 5 bits set, create a list of 5 4-bit > encoded values of each bit set. These needn't be in order.</i> > > This turns out to be a lot harder than I thought. Writing the > behavioral RTL isn't hard, but Vivado synthesizes it to 16 levels of > logic, and when I draw out an optimized version, I still get at least > 5 levels (using 6-input LUTs). I'd like to do it with minimal > latency, but I can only do about 3 levels of logic at my clock speed. > I've tried thinking about how I can use the carry chain muxes but > they don't seem to be helpful. I can do a leading-ones detector and > encode the leading 1, but I'd probably have to pipeline each stage so > that would take 5 cycles.
I'm not sure what you mean by "list". Logic circuits don't normally use lists. You want to know which five lines are set. It is easy to encode that into four bit values. But once you have the values what exactly do you want to do with them to provide them in a "list"? If you want them accessible on the same four data lines in sequence, what determines the timing of the sequence? What exactly is your interface? -- Rick
The list is a 20-bit vector comprising the 5 4-bit values.  This is put into a 20-bit wide FIFO; each of the 5 values will be processed simultaneously when read from the FIFO.  So another way you could describe this is a 16-bit/5-hot -> 20-bit encoder.
To be clearer, here's an example.

Input (bit 15 on left):

1 1 0 0  0 0 0 0  0 0 0 0  0 1 1 1 

20-bit Output List:
1111 1110 0010 0001 0000
On 5/11/2015 10:26 PM, Kevin Neilson wrote:
> To be clearer, here's an example. > > Input (bit 15 on left): > > 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 > > 20-bit Output List: > 1111 1110 0010 0001 0000
I am old school so I have to picture these things.... I see five priority encoders. The first priority encoder output is used to disable that input to the second priority encoder, etc. How did you code it? I can see how this would be a lot more than three levels of logic. The equations get quite long. When you talk about levels of logic I assume you mean layers of LUTs? I can see maybe each 16 input priority encoder being no more than 3 levels of LUTs, but all five layers with the inhibit logic... I don't think so. I expect the tools did a pretty good job of optimizing it and I don't easily see any way of using carry chains. -- Rick
rickman <gnuarm@gmail.com> wrote:

(snip)
>> Input (bit 15 on left):
>> 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1
>> 20-bit Output List: >> 1111 1110 0010 0001 0000
> I am old school so I have to picture these things....
> I see five priority encoders. The first priority encoder output is used > to disable that input to the second priority encoder, etc.
> How did you code it? I can see how this would be a lot more than three > levels of logic. The equations get quite long. When you talk about > levels of logic I assume you mean layers of LUTs? I can see maybe each > 16 input priority encoder being no more than 3 levels of LUTs, but all > five layers with the inhibit logic... I don't think so. I expect the > tools did a pretty good job of optimizing it and I don't easily see any > way of using carry chains.
It is slightly easier than that. (Sometime ago I did this with 36 and 3.) Consider the case of only two bits high. Use one priority encoder the usual way, and one upside down. (That is, the highest and lowest set bit.) Now, as you said, subtract off the highest and lowest, and two more priority encoders, and finally subtract one more and the last one. But first I would add the logic to determine that only (or at least) five were set. I suspect that, as the OP noted, it is combinatorial hard. Consider the logic needed to, separately, compute each bit of the result. Even just the first one. It does, however, pipeline very well. In the one I was working on, it had a fairly fast clock (66MHz) but I could stand some levels of latency. The OP claims the need for low latency. -- glen
On 5/11/2015 7:26 PM, Kevin Neilson wrote:
> To be clearer, here's an example. > > Input (bit 15 on left): > > 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 > > 20-bit Output List: > 1111 1110 0010 0001 0000 >
64K x 20 ROM? How badly do you want one level of logic? Rob.
Rob Doyle wrote:
> On 5/11/2015 7:26 PM, Kevin Neilson wrote: >> To be clearer, here's an example. >> >> Input (bit 15 on left): >> >> 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 >> >> 20-bit Output List: >> 1111 1110 0010 0001 0000 >> > > 64K x 20 ROM? > > How badly do you want one level of logic? > > Rob.
Two ROMs, 256 by 23 and 256 by 20. The first encodes the lower 8 bits into 0 to 5 4-bit numbers right-justified plus a 3-bit indication of how many bits were set. The second just encodes 0 to 5 4-bit numbers, left justified (you assume it has the remaining 1's not covered by the low 8 bits. Then one more level to select how many bits to take from each ROM based on the number of ones in the lower 8 bits (each output bit is only a 2:1 mux in this case - that's why the ROMs are justified right and left as noted). -- Gabor
GaborSzakacs <gabor@alacron.com> wrote:
> Rob Doyle wrote: >> On 5/11/2015 7:26 PM, Kevin Neilson wrote: >>> To be clearer, here's an example.
>>> Input (bit 15 on left):
>>> 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1
>>> 20-bit Output List: >>> 1111 1110 0010 0001 0000
>> 64K x 20 ROM?
Hmm, that probably does work. Especially with the BRAM in many FPGAs. When I did it some years ago, it was three out of 36 bits, and ROMs weren't, and still aren't, that big.
>> How badly do you want one level of logic?
> Two ROMs, 256 by 23 and 256 by 20. The first encodes the > lower 8 bits into 0 to 5 4-bit numbers right-justified plus > a 3-bit indication of how many bits were set. The second > just encodes 0 to 5 4-bit numbers, left justified (you assume > it has the remaining 1's not covered by the low 8 bits. Then > one more level to select how many bits to take from each ROM > based on the number of ones in the lower 8 bits (each output > bit is only a 2:1 mux in this case - that's why the ROMs > are justified right and left as noted).
If not in BRAM, that seems a good way. How many levels of logic is a 256 bit ROM in FPGA LUTs? -- glen
On 5/12/2015 11:54 AM, glen herrmannsfeldt wrote:
> GaborSzakacs <gabor@alacron.com> wrote: >> Rob Doyle wrote: >>> On 5/11/2015 7:26 PM, Kevin Neilson wrote: >>>> To be clearer, here's an example. > >>>> Input (bit 15 on left): > >>>> 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 > >>>> 20-bit Output List: >>>> 1111 1110 0010 0001 0000 > >>> 64K x 20 ROM? > > Hmm, that probably does work. Especially with the BRAM in many FPGAs. > > When I did it some years ago, it was three out of 36 bits, and ROMs > weren't, and still aren't, that big. > >>> How badly do you want one level of logic? > >> Two ROMs, 256 by 23 and 256 by 20. The first encodes the >> lower 8 bits into 0 to 5 4-bit numbers right-justified plus >> a 3-bit indication of how many bits were set. The second >> just encodes 0 to 5 4-bit numbers, left justified (you assume >> it has the remaining 1's not covered by the low 8 bits. Then >> one more level to select how many bits to take from each ROM >> based on the number of ones in the lower 8 bits (each output >> bit is only a 2:1 mux in this case - that's why the ROMs >> are justified right and left as noted). > > If not in BRAM, that seems a good way. > > How many levels of logic is a 256 bit ROM in FPGA LUTs?
Like most things, "it depends". The older devices have 16 bits per LUT, if any, and many of the newer devices have 64 bits per LUT. Nearly all devices have BRAMs so it's a no brainer by the split table method. Only trouble is BRAMs use a clock cycle... I'm just sayin'... -- Rick