Hi, A thought cross my mind ... I've been working much on Virtex4 lately and getting fast (~300-350 Mhz) logic for the datapath isn't really hard. But making the control stuff go that fast is a whole lot more tricky, just a 10 bits comparator becomes "a lot" at that speed ... and some control signals have high fanout and that brings the net delay in the 1 - 1.5 ns range which is half of the period ... So what if every now and then in the FPGA fabric, there was a small cluster of like 1 CLB with "Super LUTs" that would have a whole lot faster logic (but no special func like SRL and distributed ram) and "bigger" drivers to charge/dischare the net faster to propagate the controls. Maybe it's un-feasible for some reason, it's just a thought ... Sylvain
Adding "super-LUTs" to FPGA, good idea ?
Started by ●December 9, 2005
Reply by ●December 9, 20052005-12-09
"Sylvain Munaut" <com.246tNt@tnt> schrieb im Newsbeitrag news:4399cf94$0$9070$ba620e4c@news.skynet.be...> Hi, > > A thought cross my mind ... > > I've been working much on Virtex4 lately and getting fast (~300-350 Mhz) > logic for the datapath isn't really hard. But making the control stuff > go that fast is a whole lot more tricky, just a 10 bits comparator > becomes "a lot" at that speed ... and some control signals have high > fanout and that brings the net delay in the 1 - 1.5 ns range which is > half of the period ... > > So what if every now and then in the FPGA fabric, there was a small > cluster of like 1 CLB with "Super LUTs" that would have a whole lot > faster logic (but no special func like SRL and distributed ram) and > "bigger" drivers to charge/dischare the net faster to propagate the > controls. > > Maybe it's un-feasible for some reason, it's just a thought ... > > > SylvainI guess altera would claim they have it in the stratix ALm AL (Antti Lukats)
Reply by ●December 9, 20052005-12-09
>>So what if every now and then in the FPGA fabric, there was a small >>cluster of like 1 CLB with "Super LUTs" that would have a whole lot >>faster logic (but no special func like SRL and distributed ram) and >>"bigger" drivers to charge/dischare the net faster to propagate the >>controls. >> >>Maybe it's un-feasible for some reason, it's just a thought ... >> >> >>Sylvain > > > I guess altera would claim they have it in the stratix ALm > AL > (Antti Lukats)They do ? I'm gonna check that out ... Sylvain
Reply by ●December 9, 20052005-12-09
"Sylvain Munaut" <com.246tNt@tnt> schrieb im Newsbeitrag news:4399e34b$0$10953$ba620e4c@news.skynet.be...> >>>So what if every now and then in the FPGA fabric, there was a small >>>cluster of like 1 CLB with "Super LUTs" that would have a whole lot >>>faster logic (but no special func like SRL and distributed ram) and >>>"bigger" drivers to charge/dischare the net faster to propagate the >>>controls. >>> >>>Maybe it's un-feasible for some reason, it's just a thought ... >>> >>> >>>Sylvain >> >> >> I guess altera would claim they have it in the stratix ALm >> AL >> (Antti Lukats) > > They do ? > I'm gonna check that out ... > > > Sylvainnot quite so but they claim to have 7-input lut capabilities for better logic opt. antti
Reply by ●December 9, 20052005-12-09
>>A thought cross my mind ... >> >>I've been working much on Virtex4 lately and getting fast (~300-350 Mhz) >>logic for the datapath isn't really hard. But making the control stuff >>go that fast is a whole lot more tricky, just a 10 bits comparator >>becomes "a lot" at that speed ... and some control signals have high >>fanout and that brings the net delay in the 1 - 1.5 ns range which is >>half of the period ... >> >>So what if every now and then in the FPGA fabric, there was a small >>cluster of like 1 CLB with "Super LUTs" that would have a whole lot >>faster logic (but no special func like SRL and distributed ram) and >>"bigger" drivers to charge/dischare the net faster to propagate the >>controls. >>I think if you look at the logic that is not making speed, it is probably using the carry chain (comparators over 7 bits do, for example). General logic is quite fast in V4. The carry chain is very slow comparatively, which has been a beef of mine. Simply speeding up the carry chain so that reasonable sized adders (16-24 bits) can run at speeds similar to the block rams and DSP slices would make all the difference. (yes Austin, I know the "simply" isn't all that easy). You already do have "super LUTs" in the Virtex4. They are called RAMB16, and can be used for logic functions with up to 14 inputs, at clock rates of 400 MHz in a -10 part. The other option you do have is to optimize your control logic to reduce the reliance on difficult structures such as carry. For example, if your control is using a compare to decode a count, consider instead using a down counter so that the terminal count is the most significant bit. Also consider other counter architectures, such as linear feedback shift register counters to eliminate wide logic functions.>
Reply by ●December 9, 20052005-12-09
Ray Andraka wrote:>>> A thought cross my mind ... >>> >>> I've been working much on Virtex4 lately and getting fast (~300-350 Mhz) >>> logic for the datapath isn't really hard. But making the control stuff >>> go that fast is a whole lot more tricky, just a 10 bits comparator >>> becomes "a lot" at that speed ... and some control signals have high >>> fanout and that brings the net delay in the 1 - 1.5 ns range which is >>> half of the period ... >>> >>> So what if every now and then in the FPGA fabric, there was a small >>> cluster of like 1 CLB with "Super LUTs" that would have a whole lot >>> faster logic (but no special func like SRL and distributed ram) and >>> "bigger" drivers to charge/dischare the net faster to propagate the >>> controls. >>>> You already do have "super LUTs" in the Virtex4. They are called > RAMB16, and can be used for logic functions with up to 14 inputs, at > clock rates of 400 MHz in a -10 part.... Well 400 MHz if you register both side and don't have too many logic before and after. A block ram without output reg is like 2.1 ns clock to out and around 0.5 ns net delay after. If you have output reg then it's 0.9 ns clock to out. But sometimes you just can't have a 1 or 2 clock cycle latency ... And here I was more referring to the drive strenght than the number of input nets. For example if you have to generate a clock ena combinatorially (just a single LUT level but still) and it controls like 50 FFs, the net take like 1.5 ns propagation ... half of my period ...> The other option you do have is to optimize your control logic to reduce > the reliance on difficult structures such as carry. For example, if > your control is using a compare to decode a count, consider instead > using a down counter so that the terminal count is the most significant > bit. Also consider other counter architectures, such as linear feedback > shift register counters to eliminate wide logic functions.Well, yes optimizing control is good but sometimes very hard ... I've basically spent the last few days just doing that to finally meet timing. My comparators are not for counters but to detect a "empty" condition in a FIFO like block. ('FIFO like' because it's quite more complicated than a simple FIFO). Sylvain
Reply by ●December 10, 20052005-12-10
Agreed about the BRAM speed. You pretty much have to use the DO_Reg for a 400 MHz design in a -10 part. There shouldn't be any logic between the previous register and inputs to the BRAM, and the outputs can go through a single level of logic, but placement isn't critical. As I said, the real stumbling block for fast fabric stuff is the carry chain. If you are using an SX part, you can use the DSP48's to get faster arithmetic, but at a considerable cost. I stand by my contention that if the carry chains were faster (more specifically, the time to get on and off them), you'd probably find it a lot easier to make timing in your design.
Reply by ●December 10, 20052005-12-10
Sylvain Munaut wrote:> And here I was more referring to the drive strenght than the number of > input nets. For example if you have to generate a clock ena > combinatorially (just a single LUT level but still) and it controls like > 50 FFs, the net take like 1.5 ns propagation ... half of my period ... >I don't know if your initial idea is feasible or not, but it sounds good to me. In the meantime, you can reduce the fanout (at cost) by using logic duplication. If you duplicate the signal and drive only half the flip flops, that should improve your timing (at the increased cost in terms of area). You can do that in one of two ways: 1. Manually (in your code) create two signals, and set options so that your synthesis tool does not optimize redundant logic 2. Turn on logic duplication,and hope the synthesis tool will recognize that the critical path can be improved by duplicating that piece of logic Fred
Reply by ●December 10, 20052005-12-10
Sylvain Munaut schrieb:> So what if every now and then in the FPGA fabric, there was a small > cluster of like 1 CLB with "Super LUTs" that would have a whole lot > faster logic (but no special func like SRL and distributed ram) and > "bigger" drivers to charge/dischare the net faster to propagate the > controls.Well, there are couple of 14-Input LUTs in their newer devices. The speed is about 2ns in Virtex-4. They call them BRAMs. Kolja Sulimma
Reply by ●December 10, 20052005-12-10





