FPGARelated.com
Forums

Why the second flip-flop in Virtex-6?

Started by Nathan Bialke February 2, 2009
Hello,

In case anyone hasn't already seen, Xilinx has some preliminary
information about Virtex-6 and Spartan-6 online here -
http://www.xilinx.com/products/v6s6.htm .

I do have a question about Virtex-6 and it's one LUT6/two flip-flop
architecture. I'm struggling to think of why a user would have any use
for that second flip-flop. It seems to me that the second flip-flop
only has use when the LUT6 is split into two LUT5's. However, the
family overview indicates that a dual-LUT5 has the same restriction as
in Virtex-5 - the inputs to the dual-LUT5s have to be the same. I know
in my designs I don't tend to get many LUT5s synthesized, so I'm not
sure how often that actually happens. The only other case I can even
think of is to use the second flip-flop as solely a storage element,
but without the ability to drive the clock enable input of the flop by
some sort of combinational signal (ie, an address decode) without
"spending" the associated LUT6, it's use seems very limited.

I am very cognizant of the fact that the people here and at Xilinx are
smarter than me. So, I figured that I'd give them a chance to explain
the design choice. I'm always interested in ways to use FPGA resources
more effectively.

Thanks!

- Nathan
Nathan Bialke wrote:
> Hello, > > In case anyone hasn't already seen, Xilinx has some preliminary > information about Virtex-6 and Spartan-6 online here - > http://www.xilinx.com/products/v6s6.htm . > > I do have a question about Virtex-6 and it's one LUT6/two flip-flop > architecture. I'm struggling to think of why a user would have any use > for that second flip-flop. It seems to me that the second flip-flop > only has use when the LUT6 is split into two LUT5's. However, the > family overview indicates that a dual-LUT5 has the same restriction as > in Virtex-5 - the inputs to the dual-LUT5s have to be the same. I know > in my designs I don't tend to get many LUT5s synthesized, so I'm not > sure how often that actually happens. The only other case I can even > think of is to use the second flip-flop as solely a storage element, > but without the ability to drive the clock enable input of the flop by > some sort of combinational signal (ie, an address decode) without > "spending" the associated LUT6, it's use seems very limited. > > I am very cognizant of the fact that the people here and at Xilinx are > smarter than me. So, I figured that I'd give them a chance to explain > the design choice. I'm always interested in ways to use FPGA resources > more effectively.
I haven't looked at the new Xilinx architecture, but it sounds like the capabilities provided by shrinking geometries has reached the point of saying bye-bye to the highly regarded LUT4. We discussed this recently here and it was mentioned that as geometries continue to shrink, it will be come more advantageous to provide more capability in the logic with relatively reduced routing. By "relatively" reduced, I mean they will not have as much routing per "gate equivalent" as the logic cells get more complex. It actually makes sense to do this and I am a bit surprised that it has taken this long to get to the LUT5/6 instead of the LUT4/5. In the meantime they have added, first memory blocks, then MACs and finally more functional DSP units. Along the way the I/O has been suped up with high speed serdes, but that is not really related to the logic/routing mix. The question of LUTn vs FF ratio is a tradeoff. Different designs has different needs. It is not at all uncommon for comms designs to use naked FFs for delay elements. Other designs might use only a fraction as many FFs as LUTs although I don't know how this will change with 6 LUTs. It is very common for multiple 4 LUTs to feed a single FF. So it is not surprising that more FFs are included for a given number of 6 LUTs. So what other functional blocks will be coming along as the densities continue to increase? Some MCU devices provide complex serial I/O units that can flexibly drive multiple serial I/O interface types. Likewise they often provide widely capable timer functions. These specific functions may not have wide utility in FPGAs, but I expect some types of generically capable logic other than LUTs, memory and DSP blocks will identified and implemented. I think that LUTs can only go so far. The problem of selling the routing and giving the logic for free limits how low the price can get and therefore use in the highest volume applications. Rick
Nathan Bialke <nathan@bialke.com> wrote:
 
> In case anyone hasn't already seen, Xilinx has some preliminary > information about Virtex-6 and Spartan-6 online here - > http://www.xilinx.com/products/v6s6.htm .
> I do have a question about Virtex-6 and it's one LUT6/two flip-flop > architecture. I'm struggling to think of why a user would have any use > for that second flip-flop. It seems to me that the second flip-flop > only has use when the LUT6 is split into two LUT5's. However, the > family overview indicates that a dual-LUT5 has the same restriction as > in Virtex-5 - the inputs to the dual-LUT5s have to be the same.
If the inputs to an LUT5 are the same, that doesn't mean the outputs are the same. I could easily imagine designs that would take five inputs, put them through two different blocks of logic, and register the two outputs. -- glen
FIY, to obtain the number of Logic cells in the spartan-6 device, you
multiply the number of slices by 6.4, even though there are only 4
LUTs in each slice. Basically, Xilinx expects each 6-input LUT to be
1.6 times more efficient than a 4-input LUT. Simple arithmetic tells
us that it's 1.5, I suppose that the people at Xilinx have their
reasons. Time will tell, if that comparison is valid, though I suppose
it depends on the kind of logic you use, whether or not can split your
logic in 2 5-input LUTs, etc. Plus only 25% of the slices will be able
to implement SRL32, don't know why, but it remains to be seen what
kind of impact this will have.

On the bright side, there are now integrated memeory controllers for
DDR, DDR-II and DDR-3, apparently up to 800 MHz DDR .

So far, it seems promising.

My 2 cents

Nathan Bialke wrote:

> Hello, > > In case anyone hasn't already seen, Xilinx has some preliminary > information about Virtex-6 and Spartan-6 online here - > http://www.xilinx.com/products/v6s6.htm . > > I do have a question about Virtex-6 and it's one LUT6/two flip-flop > architecture. I'm struggling to think of why a user would have any use > for that second flip-flop.
Hopefully the tools will be able to use it. :) They have split to three slice types: Slice M,L,X - and the most comprehensive has SRL32, 2 x SRL16, 2 x 32 bit RAM - with those dual choices, it also points to a dual-output / dual flip- flip cell being needed to handle their outputs. All sounds logical to me. -jg
On 2 Feb., 21:33, Benjamin Couillard <benjamin.couill...@gmail.com>
wrote:
> FIY, to obtain the number of Logic cells in the spartan-6 device, you > multiply the number of slices by 6.4, even though there are only 4 > LUTs in each slice. Basically, Xilinx expects each 6-input LUT to be > 1.6 times more efficient than a 4-input LUT. Simple arithmetic tells > us that it's 1.5,
How is that? Why should a 6 LUT be 1.5 times a 4 LUT? If all functions would be equally likely to occur in a netlist than the ratio would be 4x. (There are O(2**(2**N)) function of N inputs. It can be shown that at least half of those functions require at least O(2**N) gates to be implemented). Also, a 6-LUT is 4x as large as a 4-LUT. On the other hand, adders are among the most common functions in real world circuits, and for these even a 4-LUT is wasted area and a 6-LUT gains nothing. What you really need to compare is how many gates of a typical netlist can be covered by a single LUT on average. The reason why it makes sense to spend 4x the logic area to cover only 1.6x the number of gates is that in todays FPGAs most of the area is spent by routing. So blowing up the logic does not hurt the area a lot and actually makes routing simpler. Kolja Sulimma
Kolja Sulimma <ksulimma@googlemail.com> wrote:
> On 2 Feb., 21:33, Benjamin Couillard <benjamin.couill...@gmail.com> > wrote: >> FIY, to obtain the number of Logic cells in the spartan-6 device, you >> multiply the number of slices by 6.4, even though there are only 4 >> LUTs in each slice. Basically, Xilinx expects each 6-input LUT to be >> 1.6 times more efficient than a 4-input LUT. Simple arithmetic tells >> us that it's 1.5,
> What you really need to compare is how many gates of a typical netlist > can be covered by a single LUT on average.
Well, the ratio of the number of 4LUTs to the number of 6LUTs to cover an average netlist. That makes it independent of the actual 'gate' count. -- glen
On Feb 2, 3:33 pm, Benjamin Couillard <benjamin.couill...@gmail.com>
wrote:
> FIY, to obtain the number of Logic cells in the spartan-6 device, you > multiply the number of slices by 6.4, even though there are only 4 > LUTs in each slice. Basically, Xilinx expects each 6-input LUT to be > 1.6 times more efficient than a 4-input LUT.
This is all too bizarre. No other company in the world counts "logic cells" or "slices". No one but Xilinx knows what a "logic cell" is and I don't know what a slice is in the Spartan 6 parts because I can't open the Spartan 6 overview in Acrobat 6. For whatever reason I can't install Acrobat Reader 8 and Sumatra PDF seems to be considered a virus by my AVS and it won't let it run. Regardless, counting logic cells and slices is pointless. Counting LUT4s was a valid comparison because most FPGAs used them. But with the introduction of LUT6s in Altera and now Xilinx parts, the jig is up and we can't compare across families with these logic elements. Trying to compare parts with different LUT sizes (or different mixtures of LUTs and FFs) depends too much on the details of your design. Different applications require different mixtures of logic and FFs just like you can't compare processor speeds based on the CPU MHz or the memory bandwidth.
> Simple arithmetic tells > us that it's 1.5, I suppose that the people at Xilinx have their > reasons.
You mean the marketing people?...
> Time will tell, if that comparison is valid, though I suppose > it depends on the kind of logic you use, whether or not can split your > logic in 2 5-input LUTs, etc. Plus only 25% of the slices will be able > to implement SRL32, don't know why, but it remains to be seen what > kind of impact this will have.
Using a LUT as an SRL requires added logic in the LUT. No design will use a large fraction of the LUTs as SRLs. So why add the logic in all LUTs? The only real impact is in how to route to the SRLs that are implemented.
> On the bright side, there are now integrated memeory controllers for > DDR, DDR-II and DDR-3, apparently up to 800 MHz DDR .
Is that in the Virtex only or also the Spartan parts? Rick
On 2 f=E9v, 21:59, rickman <gnu...@gmail.com> wrote:
> On Feb 2, 3:33 pm, Benjamin Couillard <benjamin.couill...@gmail.com> > wrote: > > > FIY, to obtain the number of Logic cells in the spartan-6 device, you > > multiply the number of slices by 6.4, even though there are only 4 > > LUTs in each slice. Basically, Xilinx expects each 6-input LUT to be > > 1.6 times more efficient than a 4-input LUT. > > This is all too bizarre. =A0No other company in the world counts "logic > cells" or "slices". =A0No one but Xilinx knows what a "logic cell" is > and I don't know what a slice is in the Spartan 6 parts because I > can't open the Spartan 6 overview in Acrobat 6. =A0For whatever reason I > can't install Acrobat Reader 8 and Sumatra PDF seems to be considered > a virus by my AVS and it won't let it run.
Yeah, of course, you can't really compare a 6-input LUT with a 4-input LUT. Plus ALtera seems to have a more flexible 6-input LUT than Xilinx, so I wonder how we can compare exactly 2 FPGAs from 2 differente companies.
> > Regardless, counting logic cells and slices is pointless. =A0Counting > LUT4s was a valid comparison because most FPGAs used them. =A0But with > the introduction of LUT6s in Altera and now Xilinx parts, the jig is > up and we can't compare across families with these logic elements. > > Trying to compare parts with different LUT sizes (or different > mixtures of LUTs and FFs) depends too much on the details of your > design. =A0Different applications require different mixtures of logic > and FFs just like you can't compare processor speeds based on the CPU > MHz or the memory bandwidth. > > > Simple arithmetic tells > > us that it's 1.5, I suppose that the people at Xilinx have their > > reasons. > > You mean the marketing people?...
Yeah,but I hope that they have benchmarks, like FFTs, memory controllers, microblaze, etc. that will give us an idea of how efficient is the new 6-input LUT.
> > > Time will tell, if that comparison is valid, though I suppose > > it depends on the kind of logic you use, whether or not can split your > > logic in 2 5-input LUTs, etc. Plus only 25% of the slices will be able > > to implement SRL32, don't know why, but it remains to be seen what > > kind of impact this will have. > > Using a LUT as an SRL requires added logic in the LUT. =A0No design will > use a large fraction of the LUTs as SRLs. =A0So why add the logic in all > LUTs? =A0The only real impact is in how to route to the SRLs that are > implemented.
Yeah, you're right I guess.
> > > On the bright side, there are now integrated memeory controllers for > > DDR, DDR-II and DDR-3, apparently up to 800 MHz DDR . > > Is that in the Virtex only or also the Spartan parts? > > Rick
Spartan-6 only, I suppose that they determined that the cost of adding embedded memory controllers was worth it. Furthermore, it seems to be a multiport memory controller, so the wrapping needed might be minimal. Of course, time will tell.
Nathan Bialke wrote:
> Hello, > > In case anyone hasn't already seen, Xilinx has some preliminary > information about Virtex-6 and Spartan-6 online here - > http://www.xilinx.com/products/v6s6.htm . > > I do have a question about Virtex-6 and it's one LUT6/two flip-flop > architecture. I'm struggling to think of why a user would have any use > for that second flip-flop. It seems to me that the second flip-flop > only has use when the LUT6 is split into two LUT5's. However, the > family overview indicates that a dual-LUT5 has the same restriction as > in Virtex-5 - the inputs to the dual-LUT5s have to be the same. I know > in my designs I don't tend to get many LUT5s synthesized, so I'm not > sure how often that actually happens. The only other case I can even > think of is to use the second flip-flop as solely a storage element, > but without the ability to drive the clock enable input of the flop by > some sort of combinational signal (ie, an address decode) without > "spending" the associated LUT6, it's use seems very limited. > > I am very cognizant of the fact that the people here and at Xilinx are > smarter than me. So, I figured that I'd give them a chance to explain > the design choice. I'm always interested in ways to use FPGA resources > more effectively. > > Thanks! >
Perhaps it is something to do with Altera having used a larger LUT connected to two FF's for some time now (since the Stratix II, but not on the Cyclones). Altera's is a bit more advanced (8-input LUT that can be split in many ways, but still with a maximum full LUT of 6 inputs), but it sounds a little "me too" from Xilinx. Of course, if it really does give faster or more compact designs, then "me too" is the right move!