FPGARelated.com
Forums

16->5 "Sort"

Started by Kevin Neilson May 11, 2015
rickman <gnuarm@gmail.com> wrote:

(snip, I wrote)
>> Hmm, that probably does work. Especially with the BRAM in many FPGAs.
>> When I did it some years ago, it was three out of 36 bits, and ROMs >> weren't, and still aren't, that big.
(snip)
>> How many levels of logic is a 256 bit ROM in FPGA LUTs?
> Like most things, "it depends". The older devices have 16 bits per LUT, > if any, and many of the newer devices have 64 bits per LUT.
> Nearly all devices have BRAMs so it's a no brainer by the split table > method. Only trouble is BRAMs use a clock cycle... I'm just sayin'...
OK, going back the OP says that three levels of logic is fine. Doesn't say how many pipeline stages. I presume a clock is available for the BRAM, but I am not sure. -- glen
glen herrmannsfeldt wrote:
> GaborSzakacs <gabor@alacron.com> wrote: >> Rob Doyle wrote: >>> On 5/11/2015 7:26 PM, Kevin Neilson wrote: >>>> To be clearer, here's an example. > >>>> Input (bit 15 on left): > >>>> 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 > >>>> 20-bit Output List: >>>> 1111 1110 0010 0001 0000 > >>> 64K x 20 ROM? > > Hmm, that probably does work. Especially with the BRAM in many FPGAs. > > When I did it some years ago, it was three out of 36 bits, and ROMs > weren't, and still aren't, that big. > >>> How badly do you want one level of logic? > >> Two ROMs, 256 by 23 and 256 by 20. The first encodes the >> lower 8 bits into 0 to 5 4-bit numbers right-justified plus >> a 3-bit indication of how many bits were set. The second >> just encodes 0 to 5 4-bit numbers, left justified (you assume >> it has the remaining 1's not covered by the low 8 bits. Then >> one more level to select how many bits to take from each ROM >> based on the number of ones in the lower 8 bits (each output >> bit is only a 2:1 mux in this case - that's why the ROMs >> are justified right and left as noted). > > If not in BRAM, that seems a good way. > > How many levels of logic is a 256 bit ROM in FPGA LUTs? > > -- glen >
In Xilinx 7-series: A 256-bit LUT fits in 1 SLICEL. It uses all 4 64-bit LUTS and three muxes (two for each pair of LUTS and one to combine the result), so it would show up as 3 levels of logic, but it all routes internally to the slice and the muxes are really fast. Convincing the tools to use LUT memory is the fun part. Here's my test code: module simple_lut ( input wire [7:0] addr, output wire [7:0] data ); (* RAM_STYLE = "distributed" *) reg [7:0] lut_mem [0:255]; initial $readmemh ("../source/lutcontents.hex", lut_mem); assign data = lut_mem[addr]; endmodule -- Gabor
On 5/12/2015 2:03 PM, glen herrmannsfeldt wrote:
> rickman <gnuarm@gmail.com> wrote: > > (snip, I wrote) >>> Hmm, that probably does work. Especially with the BRAM in many FPGAs. > >>> When I did it some years ago, it was three out of 36 bits, and ROMs >>> weren't, and still aren't, that big. > > (snip) > >>> How many levels of logic is a 256 bit ROM in FPGA LUTs? > >> Like most things, "it depends". The older devices have 16 bits per LUT, >> if any, and many of the newer devices have 64 bits per LUT. > >> Nearly all devices have BRAMs so it's a no brainer by the split table >> method. Only trouble is BRAMs use a clock cycle... I'm just sayin'... > > OK, going back the OP says that three levels of logic is fine. > Doesn't say how many pipeline stages. > > I presume a clock is available for the BRAM, but I am not sure.
The OP said, "minimal latency" and I think I over spec'd that to mean combinatorial. So one clock for the BRAM and one clock for the muxing should be ok. Better than the 5 clock cycles he mentions. -- Rick
On 5/12/2015 2:09 PM, GaborSzakacs wrote:
> glen herrmannsfeldt wrote: >> GaborSzakacs <gabor@alacron.com> wrote: >>> Rob Doyle wrote: >>>> On 5/11/2015 7:26 PM, Kevin Neilson wrote: >>>>> To be clearer, here's an example. >> >>>>> Input (bit 15 on left): >> >>>>> 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 >> >>>>> 20-bit Output List: >>>>> 1111 1110 0010 0001 0000 >> >>>> 64K x 20 ROM? >> >> Hmm, that probably does work. Especially with the BRAM in many FPGAs. >> >> When I did it some years ago, it was three out of 36 bits, and ROMs >> weren't, and still aren't, that big. >> >>>> How badly do you want one level of logic? >> >>> Two ROMs, 256 by 23 and 256 by 20. The first encodes the >>> lower 8 bits into 0 to 5 4-bit numbers right-justified plus >>> a 3-bit indication of how many bits were set. The second >>> just encodes 0 to 5 4-bit numbers, left justified (you assume >>> it has the remaining 1's not covered by the low 8 bits. Then >>> one more level to select how many bits to take from each ROM >>> based on the number of ones in the lower 8 bits (each output >>> bit is only a 2:1 mux in this case - that's why the ROMs >>> are justified right and left as noted). >> >> If not in BRAM, that seems a good way. >> How many levels of logic is a 256 bit ROM in FPGA LUTs? >> >> -- glen >> > > In Xilinx 7-series: > > A 256-bit LUT fits in 1 SLICEL. It uses all 4 64-bit LUTS and three > muxes (two for each pair of LUTS and one to combine the result), so it > would show up as 3 levels of logic, but it all routes internally to the > slice and the muxes are really fast. > > Convincing the tools to use LUT memory is the fun part. Here's my test > code: > > module simple_lut > ( > input wire [7:0] addr, > output wire [7:0] data > ); > > (* RAM_STYLE = "distributed" *) reg [7:0] lut_mem [0:255]; > > initial $readmemh ("../source/lutcontents.hex", lut_mem); > > assign data = lut_mem[addr]; > > endmodule
It would be 43 x 4 or 172 LUTs. A fair amount plus the muxes to combine the outputs. Still, it is mostly in parallel so it should be fairly fast. -- Rick
The idea of working from both sides is useful.  Finding the 5th bit set is =
a lot harder than finding the 1st because you have to keep a running sum of=
 the bits already set so that eats up a few inputs of each LUT.  If I searc=
h from both ends the running sum would be no bigger than 2 (finding, say, 3=
 starting from the top and 2 starting from the bottom).  When I draw this o=
ut I still have at least 3 levels of logic plus a few levels of carry chain=
 mux, which I'd probably still have to pipeline one stage, but that's accep=
table.  Vivado surely won't use the carry chain mux unless I  instantiate i=
t anyway, and then it would be 5-6 levels of logic, so I'd definitely need =
to pipeline that one stage.
Vivado made it 16 levels of logic, and I can't tell exactly what it's doing, but this is how I would expect it would work:  the first output is the easiest.  You just find the leading 1 with a priority encoder and encode it.  You can look at the first 5 bits with the first level, using the 6th LUT input for an input from the next level if none of those 5 bits are set, and so on.  This requires 4 levels of LUTs.  One could use the carry chain muxes to speed things up but you'd have to instantiate them because Vivado doesn't seem to know how to do that.  So that first output requires 4 LUTs x 4 bits.

But the 5th encoded output is harder, because you have to keep a running 3-bit sum of the number of set bits already encountered, so 3 bits of each LUT after the first are needed for the running sum, and the sum itself requires 2 levels of logic.  (I can't post pictures here, can I?)  So now you end up with what I calculate should be 7 levels of logic, or 3 levels of LUT and 5 levels of carry chain mux.  I could maybe do this if I pipeline it and I can get Vivado to synthesize it properly.  But it just seems like there should be some easier way.
Yes, that would work.  I think it would be about about 180 LUTs, which is quite a bit.  It would probably work in one cycle:  there is a LUT, F7/F8 mux, and a second level of LUT for the mux, and only 2 levels of net routed on the fabric.
I have plenty of BRAMs and don't mind using them, but they're a pain sometimes.  I have to use the output registers so they have 2 cycles of latency, and often I have to add another cycle just to route data to or from the BRAM column.  They come in handy, though.  Lately I've been using them for a lot of Galois arithmetic, such as lookup tables for 1/x.
On Tuesday, May 12, 2015 at 12:10:35 PM UTC-6, Gabor wrote:
> glen herrmannsfeldt wrote: > > GaborSzakacs <gabor@alacron.com> wrote: > >> Rob Doyle wrote: > >>> On 5/11/2015 7:26 PM, Kevin Neilson wrote: > >>>> To be clearer, here's an example. > > > >>>> Input (bit 15 on left): > > > >>>> 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 > > > >>>> 20-bit Output List: > >>>> 1111 1110 0010 0001 0000 > > > >>> 64K x 20 ROM? > > > > Hmm, that probably does work. Especially with the BRAM in many FPGAs. > > > > When I did it some years ago, it was three out of 36 bits, and ROMs > > weren't, and still aren't, that big. > > > >>> How badly do you want one level of logic? > > > >> Two ROMs, 256 by 23 and 256 by 20. The first encodes the > >> lower 8 bits into 0 to 5 4-bit numbers right-justified plus > >> a 3-bit indication of how many bits were set. The second > >> just encodes 0 to 5 4-bit numbers, left justified (you assume > >> it has the remaining 1's not covered by the low 8 bits. Then > >> one more level to select how many bits to take from each ROM > >> based on the number of ones in the lower 8 bits (each output > >> bit is only a 2:1 mux in this case - that's why the ROMs > >> are justified right and left as noted). > > > > If not in BRAM, that seems a good way. > > > > How many levels of logic is a 256 bit ROM in FPGA LUTs? > > > > -- glen > > > > In Xilinx 7-series: > > A 256-bit LUT fits in 1 SLICEL. It uses all 4 64-bit LUTS and three > muxes (two for each pair of LUTS and one to combine the result), so it > would show up as 3 levels of logic, but it all routes internally to the > slice and the muxes are really fast. > > Convincing the tools to use LUT memory is the fun part. Here's my test > code: > > module simple_lut > ( > input wire [7:0] addr, > output wire [7:0] data > ); > > (* RAM_STYLE = "distributed" *) reg [7:0] lut_mem [0:255]; > > initial $readmemh ("../source/lutcontents.hex", lut_mem); > > assign data = lut_mem[addr]; > > endmodule > > -- > Gabor
The ROM can be fast if you use the F7/F8 muxes built into the slice. I've found the key thing for V7 is to minimize the number of routes on the fabric. The F7/F8 muxes are slower than LUTs, I think, but since you don't have to route the connecting net onto the fabric you save a lot of time.
Kevin Neilson wrote:
> Yes, that would work. I think it would be about about 180 LUTs, which is quite a bit. It would probably work in one cycle: there is a LUT, F7/F8 mux, and a second level of LUT for the mux, and only 2 levels of net routed on the fabric.
If you decide to pipeline it, a register placed after the 256-deep LUT will go into the same slice with the 4 LUTs, 2 F7 muxes and F8 mux. Then the final 2:1 would go after a standard fabric register, which has pretty small clock to Q (much better than BRAM without the output register). Even in a -2 Artix you can run above 500 MHz with this arrangement. -- Gabor