FPGARelated.com
Forums

2048 input or gate ?

Started by mk July 16, 2006
Hi everyone,
I am trying to 'or' a 2K vector in Virtex4. Looking at the problem as
a first approximation, it would need 6 levels of 4 input lookup
tables. So far I have tried XST but it seems to be using the initial
512 LUT4s and then 56 levels of MUXCY. Any ideas why it's using the
MUXCYs? They seem to be quite fast at 45ns each but number of levels
is quite high. I'm curious what the timing would look like if I could
force it to use only LUT4s but I really don't want to code it by hand
and I am too lazy to write a perl script to do it either. Any
suggestions ?

Thanks.

PS Here is what I am using as a test module. I am trying to map it to
a virtex4-10.

module orlt(clk, in, out);
input clk;
input [2047:0] in;
output out;

reg [2047:0] inr;
reg out;
wire outw;

orl u0(inr, outw);

always @(posedge clk)
begin
        out <= outw;
        inr <= in;
end

endmodule

module orl(in, out);
input [2047:0] in;
output out;

wire out = |in[2047:0];
endmodule
mk wrote:
> Hi everyone, > I am trying to 'or' a 2K vector in Virtex4. Looking at the problem as > a first approximation, it would need 6 levels of 4 input lookup > tables. So far I have tried XST but it seems to be using the initial > 512 LUT4s and then 56 levels of MUXCY. Any ideas why it's using the > MUXCYs? They seem to be quite fast at 45ns each but number of levels > is quite high. I'm curious what the timing would look like if I could > force it to use only LUT4s but I really don't want to code it by hand > and I am too lazy to write a perl script to do it either. Any > suggestions ?
I was thinking, what is hard about this??? Then I looked at your code and realized that you are not using VHDL. In VHDL you can use a GENERATE statement to lay out the functional elements exactly how you want them with looping to define it all without the tedium. Isn't there a similar construct in Verilog?
 I am rusty on Verilog so can't remember if you have a generate
statement available but another way to cut work is to have a layered
component such that the bottom level has say four 4ip OR gates in it.
The layer above has 4 of the supper gate and so on. If you start with
at the bottom with a or gate instantiation and do the same all the way
up with component instatiations the synthesiser won't be able to do
much to insert other gates.

The MUXCY is probably being used as the carry chain is a fast route
compared to general routing and can be used to make a wide OR function
with 2 or more LUTs. To a degree this may be the fastest way to get you
OR but probaby tempered with some imposed structure. As a guess the
synthesiser is currently generating a number of 220-228 i/p OR gates
then putting the output together in another OR function.

John Adair
Enterpoint Ltd.

mk wrote:
> Hi everyone, > I am trying to 'or' a 2K vector in Virtex4. Looking at the problem as > a first approximation, it would need 6 levels of 4 input lookup > tables. So far I have tried XST but it seems to be using the initial > 512 LUT4s and then 56 levels of MUXCY. Any ideas why it's using the > MUXCYs? They seem to be quite fast at 45ns each but number of levels > is quite high. I'm curious what the timing would look like if I could > force it to use only LUT4s but I really don't want to code it by hand > and I am too lazy to write a perl script to do it either. Any > suggestions ? > > Thanks. > > PS Here is what I am using as a test module. I am trying to map it to > a virtex4-10. > > module orlt(clk, in, out); > input clk; > input [2047:0] in; > output out; > > reg [2047:0] inr; > reg out; > wire outw; > > orl u0(inr, outw); > > always @(posedge clk) > begin > out <= outw; > inr <= in; > end > > endmodule > > module orl(in, out); > input [2047:0] in; > output out; > > wire out = |in[2047:0]; > endmodule
"mk" <kal*@dspia.*comdelete> wrote in message 
news:574jb2h3o7cv3viul4ghq7j3gt2pen6l2d@4ax.com...
> Hi everyone, > I am trying to 'or' a 2K vector in Virtex4. Looking at the problem as > a first approximation, it would need 6 levels of 4 input lookup > tables. So far I have tried XST but it seems to be using the initial > 512 LUT4s and then 56 levels of MUXCY. Any ideas why it's using the > MUXCYs? They seem to be quite fast at 45ns each but number of levels > is quite high. I'm curious what the timing would look like if I could > force it to use only LUT4s but I really don't want to code it by hand > and I am too lazy to write a perl script to do it either. Any > suggestions ? > > Thanks. >
Your synthesiser is using the MUXCYs because it uses less resource (about 75% of the tree method) and is faster. If the MUXCY propagation delay was 45ns, I'd be worried, but it's really only 45ps! :-) If you build a tree, it'll be slower. It's not just the LUT delay, it's all that routing you need for a wide OR gate. To show it, you could try synthesising a 2k XOR gate. Your synthesiser might struggle to implement that with a carry structure. HTH, Syms.
Symon wrote:
> "mk" <kal*@dspia.*comdelete> wrote in message > news:574jb2h3o7cv3viul4ghq7j3gt2pen6l2d@4ax.com... > > Hi everyone, > > I am trying to 'or' a 2K vector in Virtex4. Looking at the problem as > > a first approximation, it would need 6 levels of 4 input lookup > > tables. So far I have tried XST but it seems to be using the initial > > 512 LUT4s and then 56 levels of MUXCY. Any ideas why it's using the > > MUXCYs? They seem to be quite fast at 45ns each but number of levels > > is quite high. I'm curious what the timing would look like if I could > > force it to use only LUT4s but I really don't want to code it by hand > > and I am too lazy to write a perl script to do it either. Any > > suggestions ? > > > > Thanks. > > > Your synthesiser is using the MUXCYs because it uses less resource (about > 75% of the tree method) and is faster. If the MUXCY propagation delay was > 45ns, I'd be worried, but it's really only 45ps! :-) If you build a tree, > it'll be slower. It's not just the LUT delay, it's all that routing you need > for a wide OR gate. To show it, you could try synthesising a 2k XOR gate. > Your synthesiser might struggle to implement that with a carry structure.
Is that 45 ps per LUT of the carry or 45 ps per CLB in the carry chain? If I use the 56 elements that the OP said, I get 2.52 ns total carry delay. That is pretty remarkable if it is correct. Increasing that to 45 ps per each of the 512 LUTs the carry delay is still only 23.04 ns. A combination approach combining say 16 LUTs with the carry then using an 8 input OR gate should be a bit faster. 16 carries is about the same speed as a LUT. I have not looked at the Virtex 4 architecture so I don't know for sure if this is needed or if the carry delay is 45 ps per CLB.
"rickman" <spamgoeshere4@yahoo.com> wrote in message 
news:1153146462.943222.181680@b28g2000cwb.googlegroups.com...
> Symon wrote: > > Is that 45 ps per LUT of the carry or 45 ps per CLB in the carry chain? > If I use the 56 elements that the OP said, I get 2.52 ns total carry > delay. That is pretty remarkable if it is correct. >
Hi Rick, Yes, that's 45ps per LUT. I believe the carry is actually implemented as a two bit look ahead, so that each CLB is a two bit carry with delay of 90ps. But, now you mention it, I don't understand the 56 levels thing.
> > Increasing that to 45 ps per each of the 512 LUTs the carry delay is > still only 23.04 ns. A combination approach combining say 16 LUTs with > the carry then using an 8 input OR gate should be a bit faster. 16 > carries is about the same speed as a LUT. I have not looked at the > Virtex 4 architecture so I don't know for sure if this is needed or if > the carry delay is 45 ps per CLB. >
Thinking about it a bit harder, and after reading your post, I reckon the synthesiser must be doing what you suggest, dividing the chain up into sections, and oring together the output. Cheers, Syms.
"Symon" <symon_brewer@hotmail.com> wrote in message 
news:44bba39a$1_2@x-privat.org...
> "rickman" <spamgoeshere4@yahoo.com> wrote in message > news:1153146462.943222.181680@b28g2000cwb.googlegroups.com... >> Symon wrote: >> >> Is that 45 ps per LUT of the carry or 45 ps per CLB in the carry chain? >> If I use the 56 elements that the OP said, I get 2.52 ns total carry >> delay. That is pretty remarkable if it is correct. >> > Hi Rick, > Yes, that's 45ps per LUT. I believe the carry is actually implemented as a > two bit look ahead, so that each CLB is a two bit carry with delay of > 90ps. But, now you mention it, I don't understand the 56 levels thing. >> >> Increasing that to 45 ps per each of the 512 LUTs the carry delay is >> still only 23.04 ns. A combination approach combining say 16 LUTs with >> the carry then using an 8 input OR gate should be a bit faster. 16 >> carries is about the same speed as a LUT. I have not looked at the >> Virtex 4 architecture so I don't know for sure if this is needed or if >> the carry delay is 45 ps per CLB. >> > Thinking about it a bit harder, and after reading your post, I reckon the > synthesiser must be doing what you suggest, dividing the chain up into > sections, and oring together the output. > Cheers, Syms.
More specifically the synthesizer is probably splitting into two levels of carry chains. Rather than 512 LUTs feeding a carry chain that's 128 rows high (there are 2 carry chain paths in a CLB, 4 LUTs per carry chain) using 2 levels of carry chains with the first at 5 MUXCY stages (32 inputs) and the second at 6 MUXCY stages (64 inputs, specifying 64 initial carry chains) the delay ends up being shorter still. The Tbyp value, by the way, is about 103 ps in the Spartan3E (-5 speed grade) and corresponds to 2 LUTs worth of carry chain since the bypass is on a slice-by-slice basis. ***** Dadgummit. The 8.2.01i speedprint numbers for Tbyp doesn't match my Timing Analyzer numbers (which did seem to correspond in speedprint 8.1.03i). I've submitted a case to Xilinx on this. ***** In the Spartan3E -5 speed grade, for instance, using timing numbers from my 8.2.01i Timing Analyzer (a mixbag of SliceM and SliceL values so the actual numbers will vary) the 6-level OR would end up Tcko+5*(Tnet+Tilo)+Tnet+Tfck = 0.567+6*Tnet+5*0.660+0.776 = 4.643+6*Tnet an average Tnet of 1ns (routing to logic of 56% to 44% which is much better than what I'd expect for a wide distribution of inputs) gives = 10.643 ns While a single carry chain across 128 CLB rows would be Tcko+Tnet+Topcyf+255*Tbyp+Tcinck = 0.567+Tnet+1.011+255*(0.103)+0.518 = 28.561+Tnet or probably under = 29.561 ns Which is much worse than the tree or for 2 levels of carry chains which would be Tcko+Tnet+Topcyf+2*Tbyp+Tnet+Topcyf+2*Tbyp+Tcinck = 0.567+Tnet+1.011+2*0.103+Tnet+1.011+2*0.103+0.518 = 3.519 + 2*Tnet or around = 5.519 ns Two levels of carry chains use significantly fewer resources than an OR tree while the delay is about half what the tree would need. The key to the number of carry chains the tool generates for the longest delay would be the number of Topcyf (or Topcyg) values in the path as reported by Timing Analyzer. Ain't optimization fun?
"Symon" <symon_brewer@hotmail.com> wrote in message
news:44bba39a$1_2@x-privat.org...

> Yes, that's 45ps per LUT. I believe the carry is actually implemented as a > two bit look ahead, so that each CLB is a two bit carry with delay of
90ps.
> But, now you mention it, I don't understand the 56 levels thing. > > > > Increasing that to 45 ps per each of the 512 LUTs the carry delay is > > still only 23.04 ns. A combination approach combining say 16 LUTs with > > the carry then using an 8 input OR gate should be a bit faster. 16 > > carries is about the same speed as a LUT. I have not looked at the > > Virtex 4 architecture so I don't know for sure if this is needed or if > > the carry delay is 45 ps per CLB. > > > Thinking about it a bit harder, and after reading your post, I reckon the > synthesiser must be doing what you suggest, dividing the chain up into > sections, and oring together the output.
If you think about it just a tiny bit harder, the structure of the optimal circuit comes down to an assessment of the relative performance of the LUT delay + routing, and the carry chain delay. Intuitively, the best circuit will have minimal disparity between the fastest and slowest path. Say for the sake of argument that four stages of carry-OR takes as long as one LUT-OR. Then an extremely coarse rendition of the fastest circuit to do a big OR will look a bit like this (L = LUT, ^ = carry-mux OR, inputs [not shown] on left): L-L-L-^ (top (result)) L-L-L-^ L-L-L-^ L-L-L-^ L-L-^ L-L-^ L-L-^ L-L-^ L-^ L-^ L-^ L-^ (bottom) The further up the carry chain you get, the more the inputs to the carry-mux elements are just "waiting around" for the carry propagation. Eventually it reaches the point where you can squeeze in an extra level of LUTs in these higher stages, and thus reduce the total size of the carry chain. Go further up, and you can afford two extra levels, and so on. I'd hope that at least some tools are clever enough to exploit this. (Note: in reality, the ratio of LUT:CY speed in this context is somewhere in the 12:1 to 16:1 ballpark for most Xilinx architectures.) Hope this makes sense... perhaps someone can take it a step further and work out where the 56 levels thing really comes from (and thus deduce what this particular synthesis tool believes the LUT:CY speed ratio is!). Cheers, -Ben-
"Ben Jones" <ben.jones@xilinx.com> wrote in message 
news:e9gfcl$4mg1@cliff.xsj.xilinx.com...
> > > Hope this makes sense... perhaps someone can take it a step further and > work > out where the 56 levels thing really comes from (and thus deduce what this > particular synthesis tool believes the LUT:CY speed ratio is!). > > Cheers, > > -Ben- > >
Hi Ben, Thanks for that, it made sense to me. I think we might need to know what part the design was in because the carry chain length is limited by the number of rows in the FPGA. Smaller parts have smaller maximum length chains. Also, as a BTW, I see from the datasheet that the ORCY structure that was in V2PRO has been dropped from the V4. That made wide gates even faster. Cheers, Syms.
"John_H" <johnhandwork@mail.com> wrote in message 
news:t3Pug.5676$Oh1.1853@news01.roc.ny...
<snip>

> Which is much worse than the tree or for 2 levels of carry chains which > would be > > Tcko+Tnet+Topcyf+2*Tbyp+Tnet+Topcyf+2*Tbyp+Tcinck > = 0.567+Tnet+1.011+2*0.103+Tnet+1.011+2*0.103+0.518 > = 3.519 + 2*Tnet > or around > = 5.519 ns > > Two levels of carry chains use significantly fewer resources than an OR > tree while the delay is about half what the tree would need. > > The key to the number of carry chains the tool generates for the longest > delay would be the number of Topcyf (or Topcyg) values in the path as > reported by Timing Analyzer. > > Ain't optimization fun?
I thought through this too quickly. The first stage in the example I was drawing out could do 64-wide ORs with the first carry chain which is 8 slices or 7*Tbyp, not 2*Tbyp. The second stage would be from 32 carry chains for 4 slices of MUXCY-based OR for 3*Tbyp, not 2*Tbyp so the timing would be more like 6.137 ns, still significantly better than the LUT tree. I missed the 56 elements mentioned initially; this is probably just poor partitioning, relying instead on a "maximum carry width" value. I'd manually partition the OR into two sets based on the 2 levels of carries. The generate can be used to shorthand the 32 intermediate values. The KEEP attribute may be what's needed in XST - I use the syn_keep=1 in the synplicity synthesizer. This synthesized okay but I didn't put a wrapper around it to get into a physiacl part (2k I/O is too much for me). module orlt(clk, in, out); input clk; input [2047:0] in; output out; reg [2047:0] inr; reg out; wire outw; orl u0(inr, outw); always @(posedge clk) begin out <= outw; inr <= in; end endmodule module orl(in, out); input [2047:0] in; output out; (* KEEP *) wire [31:0] mid; generate genvar i; for( i=0; i<32; i=i+1) begin : MUXCYtree assign mid[i] = |in[i*64 +: 64]; end endgenerate wire out = |mid[31:0]; endmodule