On Tuesday, May 12, 2015 at 5:37:27 PM UTC-6, rickman wrote:
> On 5/12/2015 2:57 PM, Kevin Neilson wrote:
> > Vivado made it 16 levels of logic, and I can't tell exactly what it's doing, but this is how I would expect it would work:  the first output is the easiest.  You just find the leading 1 with a priority encoder and encode it.  You can look at the first 5 bits with the first level, using the 6th LUT input for an input from the next level if none of those 5 bits are set, and so on.  This requires 4 levels of LUTs.  One could use the carry chain muxes to speed things up but you'd have to instantiate them because Vivado doesn't seem to know how to do that.  So that first output requires 4 LUTs x 4 bits.
> >
> > But the 5th encoded output is harder, because you have to keep a running 3-bit sum of the number of set bits already encountered, so 3 bits of each LUT after the first are needed for the running sum, and the sum itself requires 2 levels of logic.  (I can't post pictures here, can I?)  So now you end up with what I calculate should be 7 levels of logic, or 3 levels of LUT and 5 levels of carry chain mux.  I could maybe do this if I pipeline it and I can get Vivado to synthesize it properly.  But it just seems like there should be some easier way.
> 
> Sorry, I just can't picture what you are doing.  What is the "running 
> sum" for?  I think I might understand.  You look at the first 5 inputs 
> and output codes for all five positions.  I'm not sure why you can't 
> look at the first 6 inputs though.  This outputs a three bit code of the 
> number of 1's found.   The second block looks at the next five inputs 
> and outputs five codes.  The last five bits would be like the second 
> group and have a mux with the second group when in turn is what actually 
> feeds the first mux.  The first group would be one level of LUTs.  The 
> following two groups
> 
> Let me try to draw this...
> 
>        ,------,   3                       ,-----,
>    0-5 |      |--/------------------------|SEL  |
>   -->--|      | 20                        |     |  20
>        |      |--/------------------------| BUM*|--/-->--
>        '------'                           |     |
>                                       ,---|     |
>        ,------,  3        ,-----,     |   '-----'
>   6-10 |      |--/--------|SEL  |     |
>   -->--|      | 20        |     |     |
>        |      |--/--------|     |     |
>        '------'           |     |  20 |
>                           | BUM*|--/--'
>        ,------,           |     |
> 11-15 |      |           |     |
>   -->--|      | 20        |     |
>        |      |--/--------|     |
>        '------'           '-----'
> 
> *Big, Ugly Mux
> 
> The mux might be hard to work out and will surely be more than 1 level 
> of LUTs.... unless you can use the magic muxes in the slice to combine 
> multiple LUTs into a 6 input mux.  You don't need any adders for the 
> counts since each 3 bit count controls a separate mux.  This might just 
> work in three levels of LUTs if you can use multiple LUTs to form a 6 
> input mux.
> 
> I just read your post where you said you were running at 350 MHz.  I 
> guess even this will have to be pipelined.  But it should be less logic 
> than the brute force distributed RAM approach.  But who knows until the 
> LUTs are counted?  In essence this is the same thing I guess.  It might 
> work better with the larger front end blocks and just one mux.
> 
> I'm very surprised the clock to out time on the V7 BRAM is 2.1ns.  I 
> think that is about the same number as the Spartan 3s from long ago.  Am 
> I mistaken?
> 
> -- 
> 
> Rick

The BRAM output is 2.1 ns, but if you use the output register (which I have to) it's 750 ps.  Then the BRAM has 2 cycles of latency.

Yes, something like you show would work.  The design I'd written up had the sums as inputs to the LUTs.  So the top LUT could look at 6 bits (I said 5 originally because I was going to use the MUXCY but I abandoned that).  Then the next LUT looks at 4 bits, and the other 2 inputs would be the 2-bit sum of the first 5 bits.  And the next LUT looks at 4 more bits and also as a 2-bit sum of the first 10 bits.  (This is for the 3rd encoded output so we're looking for the 3rd bit set.)  I end up with 4 of these LUTs, 2 levels of LUTs to do the sums, and an F7/F8 mux afterward to pick one of the 4 LUTs.  So that's 3 levels of LUTs and an F7/F8, which would work in 1 cycle.  The whole thing would be about 100 LUTs.  

I couldn't get that to work, though, because I can't get Vivado to synthesize anything right, and I was going to have to instantiate a lot of primitives (including the F7/F8 muxes).  I couldn't even get Vivado to do the sums correctly.  You should be able to find the mod-2 sum of up to 18 bits with 8 LUTs in 2 levels, but Vivado does 3 levels.  It's pitiful.

I ended up doing something else.  I did a trailing-one detector like this:

wire [15:0] trailing_1 = ~(input_vec[15:0]-1) & input_vec[15:0];

This uses the carry chain.  I think the idea is from Knuth.  That gives you a 16-bit vector with just the trailing 1 set.  

You encode that for the 1st output.  You the same thing with a mirrored version of input_vec to do a leading-one detector and encode that for the 2nd output.  Then you XOR those two vectors with the original to get a vector with just the 3 middle bits still set.  You do another leading/trailing 1 detector and encode those two and then XOR those with the original and you have a vector with 1 bit set and you encode that.

That's all 200 LUTs and I pipelined it for 3 cycles of latency.  There's a lot of slack so I might be able to do it in 2 but I'm not sure if I want to risk it.

On 5/12/2015 2:57 PM, Kevin Neilson wrote:
> Vivado made it 16 levels of logic, and I can't tell exactly what it's doing, but this is how I would expect it would work:  the first output is the easiest.  You just find the leading 1 with a priority encoder and encode it.  You can look at the first 5 bits with the first level, using the 6th LUT input for an input from the next level if none of those 5 bits are set, and so on.  This requires 4 levels of LUTs.  One could use the carry chain muxes to speed things up but you'd have to instantiate them because Vivado doesn't seem to know how to do that.  So that first output requires 4 LUTs x 4 bits.
>
> But the 5th encoded output is harder, because you have to keep a running 3-bit sum of the number of set bits already encountered, so 3 bits of each LUT after the first are needed for the running sum, and the sum itself requires 2 levels of logic.  (I can't post pictures here, can I?)  So now you end up with what I calculate should be 7 levels of logic, or 3 levels of LUT and 5 levels of carry chain mux.  I could maybe do this if I pipeline it and I can get Vivado to synthesize it properly.  But it just seems like there should be some easier way.

Sorry, I just can't picture what you are doing.  What is the "running 
sum" for?  I think I might understand.  You look at the first 5 inputs 
and output codes for all five positions.  I'm not sure why you can't 
look at the first 6 inputs though.  This outputs a three bit code of the 
number of 1's found.   The second block looks at the next five inputs 
and outputs five codes.  The last five bits would be like the second 
group and have a mux with the second group when in turn is what actually 
feeds the first mux.  The first group would be one level of LUTs.  The 
following two groups

Let me try to draw this...

       ,------,   3                       ,-----,
   0-5 |      |--/------------------------|SEL  |
  -->--|      | 20                        |     |  20
       |      |--/------------------------| BUM*|--/-->--
       '------'                           |     |
                                      ,---|     |
       ,------,  3        ,-----,     |   '-----'
  6-10 |      |--/--------|SEL  |     |
  -->--|      | 20        |     |     |
       |      |--/--------|     |     |
       '------'           |     |  20 |
                          | BUM*|--/--'
       ,------,           |     |
11-15 |      |           |     |
  -->--|      | 20        |     |
       |      |--/--------|     |
       '------'           '-----'

*Big, Ugly Mux

The mux might be hard to work out and will surely be more than 1 level 
of LUTs.... unless you can use the magic muxes in the slice to combine 
multiple LUTs into a 6 input mux.  You don't need any adders for the 
counts since each 3 bit count controls a separate mux.  This might just 
work in three levels of LUTs if you can use multiple LUTs to form a 6 
input mux.

I just read your post where you said you were running at 350 MHz.  I 
guess even this will have to be pipelined.  But it should be less logic 
than the brute force distributed RAM approach.  But who knows until the 
LUTs are counted?  In essence this is the same thing I guess.  It might 
work better with the larger front end blocks and just one mux.

I'm very surprised the clock to out time on the V7 BRAM is 2.1ns.  I 
think that is about the same number as the Spartan 3s from long ago.  Am 
I mistaken?

-- 

Rick

On Tuesday, May 12, 2015 at 4:29:39 PM UTC-6, rickman wrote:
> On 5/12/2015 3:11 PM, Kevin Neilson wrote:
> > I have plenty of BRAMs and don't mind using them, but they're a pain
> > sometimes.  I have to use the output registers so they have 2 cycles
> > of latency, and often I have to add another cycle just to route data
> > to or from the BRAM column.  They come in handy, though.  Lately I've
> > been using them for a lot of Galois arithmetic, such as lookup tables
> > for 1/x.
> 
> Why do you have to use the output registers?  The clock to out time on a 
> BRAM has always been very fast as is the setup time.  The ones I've 
> worked with were only slightly slower than a FF in the context of 
> typical delays in logic and fabric.  What is your clock speed?
> 
> If you are working in a large part the LUTs are not an unreasonable way 
> to implement this.  Not sure how fast the resulting logic will be, but 
> it should be in the same ballpark as the BRAM but purely combinatorial. 
>   Do you need to run faster than 100 MHz?
> 
> -- 
> 
> Rick

I'm using 350mHz, or a period of 2.8ns.  The clk->out time for a V7 -1 BRAM (without output reg) is about 2.1ns, so if I didn't use the BRAM output register, I'd barely have enough time to get the output across a net to a FF.  And I know even that usually won't meet timing, because Vivado is fond of pulling the output registers out of my BRAMs and putting them into slices, I guess because it thinks it has extra slack and can give some of it to the next path.  But then the net to the FF will be 600ps and the path will fail.  I have not figured out how to make Vivado stop doing this (except by instantiating BRAM primitives).

On 5/12/2015 3:11 PM, Kevin Neilson wrote:
> I have plenty of BRAMs and don't mind using them, but they're a pain
> sometimes.  I have to use the output registers so they have 2 cycles
> of latency, and often I have to add another cycle just to route data
> to or from the BRAM column.  They come in handy, though.  Lately I've
> been using them for a lot of Galois arithmetic, such as lookup tables
> for 1/x.

Why do you have to use the output registers?  The clock to out time on a 
BRAM has always been very fast as is the setup time.  The ones I've 
worked with were only slightly slower than a FF in the context of 
typical delays in logic and fabric.  What is your clock speed?

If you are working in a large part the LUTs are not an unreasonable way 
to implement this.  Not sure how fast the resulting logic will be, but 
it should be in the same ballpark as the BRAM but purely combinatorial. 
  Do you need to run faster than 100 MHz?

-- 

Rick

Kevin Neilson wrote:
> Yes, that would work.  I think it would be about about 180 LUTs, which is quite a bit.  It would probably work in one cycle:  there is a LUT, F7/F8 mux, and a second level of LUT for the mux, and only 2 levels of net routed on the fabric.

If you decide to pipeline it, a register placed after the 256-deep LUT
will go into the same slice with the 4 LUTs, 2 F7 muxes and F8 mux.
Then the final 2:1 would go after a standard fabric register, which
has pretty small clock to Q (much better than BRAM without the output
register).  Even in a -2 Artix you can run above 500 MHz with this
arrangement.

-- 
Gabor

On Tuesday, May 12, 2015 at 12:10:35 PM UTC-6, Gabor wrote:
> glen herrmannsfeldt wrote:
> > GaborSzakacs <gabor@alacron.com> wrote:
> >> Rob Doyle wrote:
> >>> On 5/11/2015 7:26 PM, Kevin Neilson wrote:
> >>>> To be clearer, here's an example.
> > 
> >>>> Input (bit 15 on left):
> > 
> >>>> 1 1 0 0  0 0 0 0  0 0 0 0  0 1 1 1
> > 
> >>>> 20-bit Output List:
> >>>> 1111 1110 0010 0001 0000
> > 
> >>> 64K x 20 ROM?
> > 
> > Hmm, that probably does work. Especially with the BRAM in many FPGAs.
> > 
> > When I did it some years ago, it was three out of 36 bits, and ROMs
> > weren't, and still aren't, that big.
> > 
> >>> How badly do you want one level of logic?
> >  
> >> Two ROMs, 256 by 23 and 256 by 20.  The first encodes the
> >> lower 8 bits into 0 to 5 4-bit numbers right-justified plus
> >> a 3-bit indication of how many bits were set.  The second
> >> just encodes 0 to 5 4-bit numbers, left justified (you assume
> >> it has the remaining 1's not covered by the low 8 bits.  Then
> >> one more level to select how many bits to take from each ROM
> >> based on the number of ones in the lower 8 bits (each output
> >> bit is only a 2:1 mux in this case - that's why the ROMs
> >> are justified right and left as noted).
> > 
> > If not in BRAM, that seems a good way. 
> > 
> > How many levels of logic is a 256 bit ROM in FPGA LUTs?
> > 
> > -- glen
> >  
> 
> In Xilinx 7-series:
> 
> A 256-bit LUT fits in 1 SLICEL.  It uses all 4 64-bit LUTS and three
> muxes (two for each pair of LUTS and one to combine the result), so it
> would show up as 3 levels of logic, but it all routes internally to the
> slice and the muxes are really fast.
> 
> Convincing the tools to use LUT memory is the fun part.  Here's my test
> code:
> 
> module simple_lut
> (
>    input wire    [7:0] addr,
>    output wire   [7:0] data
> );
> 
> (* RAM_STYLE = "distributed" *) reg [7:0] lut_mem [0:255];
> 
> initial $readmemh ("../source/lutcontents.hex", lut_mem);
> 
> assign data = lut_mem[addr];
> 
> endmodule
> 
> -- 
> Gabor

The ROM can be fast if you use the F7/F8 muxes built into the slice.  I've found the key thing for V7 is to minimize the number of routes on the fabric.  The F7/F8 muxes are slower than LUTs, I think, but since you don't have to route the connecting net onto the fabric you save a lot of time.

I have plenty of BRAMs and don't mind using them, but they're a pain sometimes.  I have to use the output registers so they have 2 cycles of latency, and often I have to add another cycle just to route data to or from the BRAM column.  They come in handy, though.  Lately I've been using them for a lot of Galois arithmetic, such as lookup tables for 1/x.

Yes, that would work.  I think it would be about about 180 LUTs, which is quite a bit.  It would probably work in one cycle:  there is a LUT, F7/F8 mux, and a second level of LUT for the mux, and only 2 levels of net routed on the fabric.

Vivado made it 16 levels of logic, and I can't tell exactly what it's doing, but this is how I would expect it would work:  the first output is the easiest.  You just find the leading 1 with a priority encoder and encode it.  You can look at the first 5 bits with the first level, using the 6th LUT input for an input from the next level if none of those 5 bits are set, and so on.  This requires 4 levels of LUTs.  One could use the carry chain muxes to speed things up but you'd have to instantiate them because Vivado doesn't seem to know how to do that.  So that first output requires 4 LUTs x 4 bits.

But the 5th encoded output is harder, because you have to keep a running 3-bit sum of the number of set bits already encountered, so 3 bits of each LUT after the first are needed for the running sum, and the sum itself requires 2 levels of logic.  (I can't post pictures here, can I?)  So now you end up with what I calculate should be 7 levels of logic, or 3 levels of LUT and 5 levels of carry chain mux.  I could maybe do this if I pipeline it and I can get Vivado to synthesize it properly.  But it just seems like there should be some easier way.