FPGARelated.com
Forums

LUT6 FPGAs and Carry Logic

Started by Jan Bruns February 14, 2012
Summary This discussion explores the architectural transition from 4-input Look-Up Tables (LUT4) to 6-input LUTs (LUT6) in Xilinx FPGAs, specifically addressing changes in carry logic and slice structures like SLICEM and SLICEX.The consensus is that while LUT6 architectures may seem to have more "wasted" logic in simple designs, they significantly optimize routing—the most expensive silicon resource—and improve software synthesis runtimes for complex modern designs.

This discussion explores the architectural transition from 4-input Look-Up Tables (LUT4) to 6-input LUTs (LUT6) in Xilinx FPGAs, specifically addressing changes in carry logic and slice structures like SLICEM and SLICEX.

The consensus is that while LUT6 architectures may seem to have more "wasted" logic in simple designs, they significantly optimize routing—the most expensive silicon resource—and improve software synthesis runtimes for complex modern designs.

  • Routing resources, rather than logic gates, comprise the majority of FPGA silicon area and cost.
  • LUT6 architectures are optimized to reduce the number of routing hops required for high fan-in logic.
  • The removal of certain carry-bypass options in newer LUT6 slices is offset by the efficiency gains in 6-input function mapping.
  • Research indicates that LUT sizes between 4 and 6 offer the best balance of area efficiency and timing performance.
  • Xilinx maintains LUT6 as the standard for newer families to ensure architecture consistency across Artix, Kintex, and Virtex lines.
FPGA ArchitectureXilinx SpartanLogic SynthesisRouting Optimization
Hallo.

Some questions about Xilinx LUT6 FPGAs (my WebPack Toolchain is
a little outdated, and the newer LUT6-FPGAs don't seem to show
up correctly in fpga_editor).

* Is there really no carry-bypass option in LUT6-paths like the 
  CYMUX(any,cin,1) living in LUT4 paths, apart from
  constraining the LUTs?

* SLICEMs and SLICELs always have a carry chain, while SLICEXs 
  neither have a RAM nor a carry option?

* Within a CLB, SLICEMs are paired with SCLICEX (if there are 
  SLICEXs in the device)? Sounds strange to me: If a LUT is
  configured to be dynamic, it is probably very likely that
  additional Carry logic isn't used, compared to static LUTs
  (with LUT4s, one rare reason to this is using the carry chain
  to implement a post-invert option for the RAM...). Have you ever 
  seen a dynamic LUT6 really gain something in also using carry?

* What about production? Does it look like Xilinx might stop selling
  and developing new LUT4-FPGAs in the near future? I personally
  don't have enough overview about these two FPGA classes, so I
  can't see the detailed pros and cons. 

Gruss

Jan Bruns

-- 
Ein paar Fotos: http://abnuto.de/gal/
Jan Bruns <jansaccount@arcor.de> writes:

> Hallo. > > Some questions about Xilinx LUT6 FPGAs (my WebPack Toolchain is > a little outdated, and the newer LUT6-FPGAs don't seem to show > up correctly in fpga_editor). >
The datasheets and usermanuals show everything you would need I think though... see UG384 for example. Pages 9-11 show the various slices in some detail.
> * Is there really no carry-bypass option in LUT6-paths like the > CYMUX(any,cin,1) living in LUT4 paths, apart from > constraining the LUTs? >
I'm not sure what you mean by constraining the LUTs. There are various muxes shown in Fig 3,4,5 - can you achieve what you want with them?
> * SLICEMs and SLICELs always have a carry chain, while SLICEXs > neither have a RAM nor a carry option? >
Correct
> * Within a CLB, SLICEMs are paired with SCLICEX (if there are > SLICEXs in the device)? Sounds strange to me: If a LUT is > configured to be dynamic, it is probably very likely that > additional Carry logic isn't used, compared to static LUTs > (with LUT4s, one rare reason to this is using the carry chain > to implement a post-invert option for the RAM...). Have you ever > seen a dynamic LUT6 really gain something in also using carry? >
It seems to me that there are "big slices", "medium slices" and "smal slices" - the silicon area taken up by the carry chain may well be "free" compared to the rest of the big/medium slices. Additionally, SLICEMs can be used for dynamic filter-coefficient storage, the arithmetic logic is also useful then. Xilinx will have pushed an awful lot of existing and potential designs through this architecture and decided its a win overall. Whether it's a win for your particular designs and style is immaterial to them (unless you are an *enormous* customer!)
> * What about production? Does it look like Xilinx might stop selling > and developing new LUT4-FPGAs in the near future?
Selling... I doubt they'll stop selling Spartan 3 (for example) for a very long time yet - Xilinx have a long history of keeping old families going for many many years after it was sensible to design them into new systems. Developing... Spartan 3(and E,A,ADSP) was the last LUT4 generation, so yes, I think it's stopped! My understanding of the the Series 7 goal is to make as much of the user-visible logic as possible identical across the three ranges (Artix, Kintex and Virtex). There are differences in power/speed tradeoffs and the mix of memory, DSP, gigabit IO, logic etc. But the fundamental blocks are the same throughout. Unlike in the V5/S3 era when the LUTS, DSPs, BRAMs, IOs were all different between the two families!
> I personally don't have enough overview about these two FPGA classes, > so I can't see the detailed pros and cons.
I'm not sure there's much to care about pros and cons. LUT6 is here, unless you want to design with relatively old chips. Cheers, Martin -- martin.j.thompson@trw.com TRW Conekt - Consultancy in Engineering, Knowledge and Technology http://www.conekt.co.uk/capabilities/39-electronic-hardware
Martin Thompson:
> Jan Bruns:
>> Some questions about Xilinx LUT6 FPGAs (my WebPack Toolchain is a >> little outdated, and the newer LUT6-FPGAs don't seem to show up >> correctly in fpga_editor).
> The datasheets and usermanuals show everything you would need I think > though... see UG384 for example. Pages 9-11 show the various slices in > some detail.
Thanks. I've tested the wrong datasheets then.
>> * Is there really no carry-bypass option in LUT6-paths like the >> CYMUX(any,cin,1) living in LUT4 paths, apart from constraining the >> LUTs?
> I'm not sure what you mean by constraining the LUTs. There are various > muxes shown in Fig 3,4,5 - can you achieve what you want with them?
The select-Input of the main Carry-Select Muxes is directly connected to the LUT-output, without an option to put another signal on it. If you make use of the Carry Logic, the function you put on the LUT will always become part of the Carry calculation. Xilinx LUT4 FPGAs had an option to make the main CarrySelect Mux always fprward the cin to cout, no matter of the LUT said. This was pretty useful, because it was possible to make relatively huge logic feed the Carry Chain, without ever crossing CLB boundaries. For example, within a SLICE, it was possible to have one LUT act as 16-bit RAM, and have it added (or whatever) to some external value on the other LUT. The RAM-LUTs output was not expected to directly connect to the Carry Logic, but had relatively fast routes to the arithmetic then. However, I'd expect many reasons to use "partial populated" carry chains to be gone with LUT6.
>> * Within a CLB, SLICEMs are paired with SCLICEX (if there are >> SLICEXs in the device)? Sounds strange to me: If a LUT is configured >> to be dynamic, it is probably very likely that additional Carry logic >> isn't used, compared to static LUTs (with LUT4s, one rare reason to >> this is using the carry chain to implement a post-invert option for >> the RAM...). Have you ever seen a dynamic LUT6 really gain something >> in also using carry?
> It seems to me that there are "big slices", "medium slices" and "smal > slices" - the silicon area taken up by the carry chain may well be > "free" compared to the rest of the big/medium slices.
Hmn, sounds like that's only one theory of yours.
> Additionally, SLICEMs can be used for dynamic filter-coefficient > storage, the arithmetic logic is also useful then.
Hm, what about details, then? A dynloadable LUT1 calculating "external signal xor stored bit"?
> Xilinx will have pushed an awful lot of existing and potential designs > through this architecture and decided its a win overall. > Whether it's a win for your particular designs and style is immaterial > to them (unless you are an *enormous* customer!)
Compared to what? LUT4 vs. LUT6, given the same silicon process? What would you expect the term "win" to represent, then? I don't believe there's no market for LUT4 FPGAs using current silicon process.
>> * What about production? Does it look like Xilinx might stop selling >> and developing new LUT4-FPGAs in the near future?
> Selling... I doubt they'll stop selling Spartan 3 (for example) for a > very long time yet - Xilinx have a long history of keeping old families > going for many many years after it was sensible to design them into new > systems.
> Developing... Spartan 3(and E,A,ADSP) was the last LUT4 generation, so > yes, I think it's stopped!
>> I personally don't have enough overview about these two FPGA classes, >> so I can't see the detailed pros and cons.
> I'm not sure there's much to care about pros and cons. LUT6 is here, > unless you want to design with relatively old chips.
Argh. So all these valuable customers have to rework all parts of their highly optimized, huge module database, just because Xilinx engineers thought it might be less work for them to ever put LUT6 in silicon? Gruss Jan Bruns -- Ein paar Fotos: http://abnuto.de/gal/
On Feb 15, 10:21=A0pm, Jan Bruns <jansacco...@arcor.de> wrote:

> I don't believe there's no market for LUT4 FPGAs using current > silicon process.
Market: Maybe. Resonable facts to support such an architecture: No. The problem here is that users tend to evaluate the capabilites of an FPGA mainly as logic, while really you pay mostly for routing. Logic is a very small portion of the silicon area. Of course the vendors don't publish the numbers, but university research suggests the area of LUT and LUT configuration is only a few percent of total area. Therefore when going from 4-LUT to 6-LUT you don't get a 4x area increase (16 entries to 64 entries) but more like a 60% increase (going from 4 inputs that must be routed to 6 inputs that must be routed in a somewhat worse than linear routing area). This is offset by the fact that routing now gets a lot simpler. Routing increases faster than linear with the number of wires. Therefore with bigger FPGAs the percentage of logic goes down. The optimum LUT size therefore tends to go up with technology improvements. Research shows that the efficiency curve for FPGA technologies is relatively flat around the optimum. E.g. for a given technology there are multiple LUT sizes that get you almost the same area efficiency. Because performce tends to be better for the larger LUTs and because the software runtimes go down for larger LUTs (mapping is polynomial time, routing exponential) a typical design decision would be to chose the largest LUT size within the flat region of the curve, expecting that future implementations of the architecture would move the optimum spot in that direction. This is exactly what FPGA vendors did: In the early 90ies the sweet spot was consistenly show to be between 3- LUTs and 4-LUTs so most vendors chose 4-LUTs. Newer research shows the flat region to be go from 4-LUTs to 6-LUTs. While 4-LUTs probably would be still a good choice, it is clear that there must be switch to 6-LUTs at some time, and one might just as well do the switch now getting much better EDA software run times. Kolja Sulimma cronologic.de
Jan Bruns <jansaccount@arcor.de> writes:

> Martin Thompson: >> Jan Bruns: > >>> * Is there really no carry-bypass option in LUT6-paths like the >>> CYMUX(any,cin,1) living in LUT4 paths, apart from constraining the >>> LUTs? > >> I'm not sure what you mean by constraining the LUTs. There are various >> muxes shown in Fig 3,4,5 - can you achieve what you want with them? > > The select-Input of the main Carry-Select Muxes is directly connected > to the LUT-output, without an option to put another signal on it. > If you make use of the Carry Logic, the function you put on the LUT > will always become part of the Carry calculation. > > Xilinx LUT4 FPGAs had an option to make the main CarrySelect Mux always > fprward the cin to cout, no matter of the LUT said. This was pretty > useful, because it was possible to make relatively huge logic feed > the Carry Chain, without ever crossing CLB boundaries. > > For example, within a SLICE, it was possible to have one LUT act as > 16-bit RAM, and have it added (or whatever) to some external value on > the other LUT. The RAM-LUTs output was not expected to directly > connect to the Carry Logic, but had relatively fast routes to > the arithmetic then. > > However, I'd expect many reasons to use "partial populated" > carry chains to be gone with LUT6. >
Yes, I agree. No doubt there will be *some* designs which don't work out so well in the newer architectures.
>>> * Within a CLB, SLICEMs are paired with SCLICEX (if there are >>> SLICEXs in the device)? Sounds strange to me: If a LUT is configured >>> to be dynamic, it is probably very likely that additional Carry logic >>> isn't used, compared to static LUTs (with LUT4s, one rare reason to >>> this is using the carry chain to implement a post-invert option for >>> the RAM...). Have you ever seen a dynamic LUT6 really gain something >>> in also using carry? > >> It seems to me that there are "big slices", "medium slices" and "smal >> slices" - the silicon area taken up by the carry chain may well be >> "free" compared to the rest of the big/medium slices. > > Hmn, sounds like that's only one theory of yours. >
Well, yes, it is - you'll have to wait for someone from Xilinx for anything better than that :)
>> Additionally, SLICEMs can be used for dynamic filter-coefficient >> storage, the arithmetic logic is also useful then. > > Hm, what about details, then?
Well, I only offer it as a possibility (haven't done an actual comparison), but distributed arithmetic FIR filters were what I was thinking of.
>> Xilinx will have pushed an awful lot of existing and potential designs >> through this architecture and decided its a win overall. >> Whether it's a win for your particular designs and style is immaterial >> to them (unless you are an *enormous* customer!) > > Compared to what? LUT4 vs. LUT6, given the same silicon process? > What would you expect the term "win" to represent, then? >
Don't ask me - I'm not making the decisions. Ultimately, Xilinx presumably decided it was a "win" in business terms: "We'll make the most money doing it this way."
> I don't believe there's no market for LUT4 FPGAs using current > silicon process.
No-one is saying there is not a market. Just that it's not big enough for Xilinx to be targetting it.
> >>> * What about production? Does it look like Xilinx might stop selling >>> and developing new LUT4-FPGAs in the near future? > >> Selling... I doubt they'll stop selling Spartan 3 (for example) for a >> very long time yet - Xilinx have a long history of keeping old families >> going for many many years after it was sensible to design them into new >> systems. > >> Developing... Spartan 3(and E,A,ADSP) was the last LUT4 generation, so >> yes, I think it's stopped! > >>> I personally don't have enough overview about these two FPGA classes, >>> so I can't see the detailed pros and cons. > >> I'm not sure there's much to care about pros and cons. LUT6 is here, >> unless you want to design with relatively old chips. > > Argh. So all these valuable customers have to rework all parts of > their highly optimized, huge module database,
That's progress :) This is how bare-metal-assembly-language programmers felt as processors developed and their highly-tuned routines needed to be rewritten. Of course, the processors were faster and compilers were better, so the smart ones just wrote straightforward, portable C-code which turned out to be good-enough most of the time. And that code was much more re-usable.
> just because Xilinx engineers thought it might be less work for them > to ever put LUT6 in silicon?
I'm sure it wasn't done on a whim! There are sound business reasons for how it's been done. Sounds like they just don't fit what you'd like :( Cheers, Martin -- martin.j.thompson@trw.com TRW Conekt - Consultancy in Engineering, Knowledge and Technology http://www.conekt.co.uk/capabilities/39-electronic-hardware
Martin Thompson <martin.j.thompson@trw.com> wrote:

(snip)
> Don't ask me - I'm not making the decisions. Ultimately, Xilinx > presumably decided it was a "win" in business terms: "We'll make the > most money doing it this way."
Well, they do have some competition. If they don't design and build what works for their customers, they will lose out.
>> I don't believe there's no market for LUT4 FPGAs using current >> silicon process.
> No-one is saying there is not a market. Just that it's not > big enough for Xilinx to be targetting it.
As I understand it, 6LUT is better for larger chips. For smaller ones, it likely doesn't make so much difference. There is some advantage as far as synthesis software of keeping a minimum number of different architectures. Still, 4LUT chips should be around for a while. -- glen
Kolja Sulimma:
> The problem here is that users tend to evaluate the > capabilites of an FPGA mainly as logic, while really > you pay mostly for routing. Logic is a very small > portion of the silicon area. Of course the vendors > don't publish the numbers, but university research > suggests the area of LUT and LUT configuration is > only a few percent of total area.
That's what I expected. This becomes pretty obvious if you imagine a LUT2 FPGA, where everyone should intuitively understand that the entire silicon would be filled up with routing resources. And LUT4 can't be far off.
> Therefore when going from 4-LUT to 6-LUT you don't > get a 4x area increase (16 entries to 64 entries) > but more like a 60% increase (going from 4 inputs > that must be routed to 6 inputs that must be routed > in a somewhat worse than linear routing area).
So let's compare Spartans: Spartan6 LUT6: about 7 ins, about 3 outs = 10 ports Spartan3 Slice: about 10 ins, about 6 outs = 16 ports Where the port count for the Sparta3 Slice doesn't include the FXMUX path, but the full XB/YB (I doubt this path has/needs full routing caps, anyway). So from what you said about area with taking routing resources into account, the Spartan3 Slice might very well consume a little more area, although it has only about half the SRAM bits. What do we get for that? For SLICEL, I think of: 2*any 4 inp-func: LUT4:yes, LUT6:no 2*any 4 inp-func, paired invert: LUT4:yes, LUT6:no any 5-inp func: both any 6-inp func: LUT4:no LUT6:yes MUX4: both half/partial populated Carry: LUT4:yes, LUT6:no 2 Bit full Adder: both 2 Bits of long Adder: LUT4:yes, LUT6:one, but 2? 2 Bits of long MulAdder: LUT4:yes, LUT6:one, but 2? 1 Bit ALU (fast Carry): maybe both --with dual Ext-feedin: LUT4:yes(paired with DPram), LUT6:no Large Chain Logic: LUT4: 8Bit/Slice, LUT6:6Bit/LUT DblLUTed Chain Logic: LUT4: no, BX, only, LUT6: yes For SLICEM, I also think of: 64x1 RAM: LUT4:no, LUT6 yes 32x2 RAM: LUT4:no, LUT6 yes 32x1 RAM: LUT4:yes, LUT6 yes 16x2 RAM: LUT4:yes, LUT6 yes 16x1 RAM+Adder: LUT4:yes, LUT6 no Well, for the SLICEM-Part, the LUT6 might be a better choice, but for SLICEL, I'd still prefer the LUT4, given 50% area overhead, although I'm missing a little partial bit more of static MUXes and FF-paths (independent clock-inverters, or something). Gruss Jan Bruns -- Ein paar Fotos: http://abnuto.de/gal/
On Feb 16, 7:07 am, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> Martin Thompson <martin.j.thomp...@trw.com> wrote: > > (snip) > > > Don't ask me - I'm not making the decisions. Ultimately, Xilinx > > presumably decided it was a "win" in business terms: "We'll make the > > most money doing it this way." > > Well, they do have some competition. If they don't design > and build what works for their customers, they will lose out. > > >> I don't believe there's no market for LUT4 FPGAs using current > >> silicon process. > > No-one is saying there is not a market. Just that it's not > > big enough for Xilinx to be targetting it. > > As I understand it, 6LUT is better for larger chips. > > For smaller ones, it likely doesn't make so much difference. > There is some advantage as far as synthesis software of > keeping a minimum number of different architectures. > > Still, 4LUT chips should be around for a while. > > -- glen
I believe that is what it comes down to. Given the fact that routing is a huge percentage of the chip area (and so cost) this becomes a more important factor as the chips get larger. After all, routing does go up at a faster rate than linear. So minimizing routing is more important in larger chips. The tradeoff provides for lower costs with LUT6 in larger devices. The other side of the coin is more "wasted" logic when larger LUTs are underutilized. So it would seem that we have reached the point where the LUT6 is optimal for many if not the vast majority of designs. I don't know that there is a performance penalty in using LUT6. I would expect that is minimal since the muxes in the LUTs are done with transmission gates with very little delay, but I don't really know. If so, the only issue then becomes cost. So if you design is one of the minority designs that can indeed be done more efficiently in a LUT4 architecture, then you will pay a bit more for a LUT6 based part... but given the advantages of smaller feature size you will likely get lower costs with the newer parts than sticking with an old generation. As to design reworks required to optimize a design for a newer part, I expect that would be done for speed and/or cost. My experience is that Xilinx is more than willing to help you with that, especially if it means a design win over a competitor. But would anyone really expect much lost ground from a LUT4 design to a current LUT6 design? Software changes can greatly impact results, but I can't see needing to touch a design from a Spartan 3 to get it to run well in a newer device given the large improvements in the hardware from using a much smaller process. I suppose if you have used hard constraints you may have to remove them. But you knew the risk when you used those features, no? Rick
rickman <gnuarm@gmail.com> wrote:

(snip, I wrote)
>> For smaller ones, it likely doesn't make so much difference. >> There is some advantage as far as synthesis software of >> keeping a minimum number of different architectures.
(snip)
> I believe that is what it comes down to. Given the fact that routing > is a huge percentage of the chip area (and so cost) this becomes a > more important factor as the chips get larger. After all, routing > does go up at a faster rate than linear. So minimizing routing is > more important in larger chips. The tradeoff provides for lower > costs with LUT6 in larger devices.
> The other side of the coin is more "wasted" logic when larger LUTs are > underutilized. So it would seem that we have reached the point where > the LUT6 is optimal for many if not the vast majority of designs.
One that I am interested in, though, is that 6LUT should be much better for building the MUX needed for barrel shifters. A 4LUT makes a two input MUX, but 6LUT can make a 4 input (and two select line) MUX. Other than that, I haven't though much about how useful differnet sizes are. The less logic between FF's, the less advantage to larger ones.
> I don't know that there is a performance penalty in using LUT6. I > would expect that is minimal since the muxes in the LUTs are done with > transmission gates with very little delay, but I don't really know. > If so, the only issue then becomes cost. So if you design is one of > the minority designs that can indeed be done more efficiently in a > LUT4 architecture, then you will pay a bit more for a LUT6 based > part... but given the advantages of smaller feature size you will > likely get lower costs with the newer parts than sticking with an old > generation.
Well, they have to be designed not to glitch when switching between entries with the same output value. That doesn't naturally happen with an SRAM. Also, with transmission gates you can't go through too many without a buffer, but presumably that is part of optimizing the cell. -- glen
On Feb 16, 4:50=A0pm, Jan Bruns <jansacco...@arcor.de> wrote:

> Spartan6 =A0LUT6: about =A07 ins, about 3 outs =3D 10 ports > Spartan3 Slice: about 10 ins, about 6 outs =3D 16 ports > > So from what you said about area with taking routing > resources into account, the Spartan3 Slice might very > well consume a little more area, although it has only > about half the SRAM bits.
Not. It will consume a lot more area if you include routing. Routing grows faster than linear (look up "rent exponent"). Of course it can cover more flexible circuit areas because you can chose much more combinations of input signals with two 4-luts compared to one 6-lut (except if you have high fanin random logic. But the area is much larger. The point is: It does not matter if a LUT-6 on average has lower utilization, as LUT area is virtually free. What matters is routing utilization. There is research that clearly shows that from an efficiency standpoint FPGAs are best that can't achieve 100% LUT utilization because they have sparse routing. The reasons why vendors choose to provide lots of routing anyway is: a) customers don't understand this and tend to start whining when they don't get 100% LUT utilization instead of beeing happy that they get better wire utilization. (Remember: Wires are the expensive part) b) It get's hard to predict what can be implemented and what can't. c) software gets harder to do and slower with worse routing ressources. So you pay a premium to be able to reliably plan your design and to simplify marketing. Back to LUT size: Have a look at figure 3.3 in this: http://www.eecg.utoronto.ca/~jayar/pubs/theses/Ahmed/EliasAhmed.pdf area is virtually constant in that analysis for LUT sizes from 4 to 6. But with LUT size 6 you get much better software runtimes. Kolja