FPGARelated.com
Forums

HDL tricks for better timing closure in FPGAs

Started by JeDi June 5, 2008
Hi,

I am working with FPGAs and trying to take advantage of the
parallelism that is available in them. My design is in Verilog HDL and
my design is using a lot of resources in the FPGA - ~ 80 % of the
available LABs (or LEs or ALE),  ~50% of Interconnect and ~80% of
available on-chip memory. Also I am required to run this design at
very high clock rate - almost equal to on chip memory clock rate. With
this high utilization, its becoming very difficult to meet high timing
requirements !!

So, my question is regarding Verilog HDL coding - Is there any
recommended coding style which improves timing closure (or in other
words makes it easier for the tools to meet timing ) ??

If yes, please point me to the right location.

Thanks in advance.
JeDi
"JeDi" <jaydev.shelat@gmail.com> wrote in message 
news:5927360c-b4ca-488f-9019-cee1a0ed569f@a9g2000prl.googlegroups.com...
> > So, my question is regarding Verilog HDL coding - Is there any > recommended coding style which improves timing closure (or in other > words makes it easier for the tools to meet timing ) ?? >
No, there are no 'coding styles' in any language that will do anything to improve clock cycle performance. To improve timing you change your design to pipeline the processing. To pipeline you break up a 'big' computation (i.e. one that takes a long time and therefore becomes a critical timing path) into smaller ones that take multiple clock cycles. Kevin Jennings
I disagree; however, I would include 'pipelining' as part of the coding 
style/trick.  You can also try to code such that the critical path(s) with 
have small enough blocks of logic between flip-flops to enable timing 
closure.  You may add attributes to signals to try to coerce the synthesizer 
into doing 'the right thing'; if it doesn't come automatically, you might do 
some low-level coding, synthesize that, and then use the resultant edif file 
as a black box to the the next level up.

Before troubling with all that, though, do some bottom-up evaluations, 
particularly of things you feel will have trouble meeting timing.  The tools 
will often produce sub-optimal solutions when trying to solve many 
simultaneous, conflicting requirements (resource location, timing, ...), and 
thus have trouble for the total design.  Generally, the timing performance 
achieved at the chip level is less optimal that at the block level, so make 
sure your blocks will meet timing.  If they do, and the whole doesn't, you 
might try incremental design techniques, where you solve one problem, build 
on it for the next, etc...  If they don't, you can try re-coding and/or 
re-architecting to get the block(s) to meet timing, and then try the whole. 
Solve the relatively simple problems first... and sometimes the big problems 
become simple.

Other (non-coding) tricks:  location constraints, multi-cycle constraints, 
...

JTW

"KJ" <kkjennings@sbcglobal.net> wrote in message 
news:kOY1k.3933$89.3886@nlpi069.nbdc.sbc.com...
> > "JeDi" <jaydev.shelat@gmail.com> wrote in message > news:5927360c-b4ca-488f-9019-cee1a0ed569f@a9g2000prl.googlegroups.com... >> >> So, my question is regarding Verilog HDL coding - Is there any >> recommended coding style which improves timing closure (or in other >> words makes it easier for the tools to meet timing ) ?? >> > > No, there are no 'coding styles' in any language that will do anything to > improve clock cycle performance. > > To improve timing you change your design to pipeline the processing. To > pipeline you break up a 'big' computation (i.e. one that takes a long time > and therefore becomes a critical timing path) into smaller ones that take > multiple clock cycles. > > Kevin Jennings >
Hi jtw !!

"I disagree; however, I would include 'pipelining' as part of the
coding style/trick.  You can also try to code such that the critical
path(s) with have small enough blocks of logic between flip-flops to
enable timing closure."

I tried pipelining  - mainly to break large combinational blocks into
smaller ones. But, the problem I run into is that the logic
utilization increase almost 15-20 % more than the already high
utilization !!! This creates a situation where the tool is not able to
place everything close enough to meet timing - because the
interconnect delay of the FPGA now becomes the bottle neck !!!

" You may add attributes to signals to try to coerce the synthesizer
into doing 'the right thing'; if it doesn't come automatically, you
might do some low-level coding, synthesize that, and then use the
resultant edif file as a black box to the the next level up"

This is what I am trying to do now and seeing what impact does this
have - fingers crossed !!! Thanks for the suggestion though.

"The tools will often produce sub-optimal solutions when trying to
solve many simultaneous, conflicting requirements (resource location,
timing, ...), and
thus have trouble for the total design.  Generally, the timing
performance achieved at the chip level is less optimal that at the
block level, so make sure your blocks will meet timing. Other (non-
coding) tricks:  location constraints, multi-cycle constraints,"

So true !! I have seen exactly this - block level everything is
fine .. but chip level ... performance starts deteriorating !! I have
tried the resource allocation and timing constraints also. Though
these improve the frequency a little, I think the biggest constraint
comes from the very high logic utilization and hence the tool is not
able to concentrate its efforts efficiently !!! However, I am working
on a few things ... lets see if the efforts pay dividends !!!

Thanks for the suggestions. If you come across anything else, please
point it to me !!

JeDi
What's the difference between target and the result freq.?


On Jun 6, 12:48=A0pm, JeDi <jaydev.she...@gmail.com> wrote:
> Hi jtw !! > > "I disagree; however, I would include 'pipelining' as part of the > coding style/trick. =A0You can also try to code such that the critical > path(s) with have small enough blocks of logic between flip-flops to > enable timing closure." > > I tried pipelining =A0- mainly to break large combinational blocks into > smaller ones. But, the problem I run into is that the logic > utilization increase almost 15-20 % more than the already high > utilization !!! This creates a situation where the tool is not able to > place everything close enough to meet timing - because the > interconnect delay of the FPGA now becomes the bottle neck !!! > > " You may add attributes to signals to try to coerce the synthesizer > into doing 'the right thing'; if it doesn't come automatically, you > might do some low-level coding, synthesize that, and then use the > resultant edif file as a black box to the the next level up" > > This is what I am trying to do now and seeing what impact does this > have - fingers crossed !!! Thanks for the suggestion though. > > "The tools will often produce sub-optimal solutions when trying to > solve many simultaneous, conflicting requirements (resource location, > timing, ...), and > thus have trouble for the total design. =A0Generally, the timing > performance achieved at the chip level is less optimal that at the > block level, so make sure your blocks will meet timing. Other (non- > coding) tricks: =A0location constraints, multi-cycle constraints," > > So true !! I have seen exactly this - block level everything is > fine .. but chip level ... performance starts deteriorating !! I have > tried the resource allocation and timing constraints also. Though > these improve the frequency a little, I think the biggest constraint > comes from the very high logic utilization and hence the tool is not > able to concentrate its efforts efficiently !!! However, I am working > on a few things ... lets see if the efforts pay dividends !!! > > Thanks for the suggestions. If you come across anything else, please > point it to me !! > > JeDi
"Aiken" <aikenpang@gmail.com> wrote in message 
news:0eaf6519-6902-42c2-8686-79be14ba6ba8@t54g2000hsg.googlegroups.com...
>>What's the difference between target and the result freq.?
What yer want v. what yer get
On Jun 6, 12:48=A0pm, JeDi <jaydev.she...@gmail.com> wrote:
> Hi jtw !! > > I tried pipelining =A0- mainly to break large combinational blocks into > smaller ones. But, the problem I run into is that the logic > utilization increase almost 15-20 % more than the already high > utilization !!!
Then you have the following options (in no particular order): 1. Choose a faster speed grade part. 2. Choose a larger part and keep pipelining. 3. Design better algorithms that can be implemented with better performance. 4. Slow the clock down
> " You may add attributes to signals to try to coerce the synthesizer > into doing 'the right thing'; if it doesn't come automatically, you > might do some low-level coding, synthesize that, and then use the > resultant edif file as a black box to the the next level up" > > This is what I am trying to do now and seeing what impact does this > have - fingers crossed !!! Thanks for the suggestion though. >
Unless you're just missing by a little bit, don't expect to attribute your way to happiness, you'll most likely be sadly disappointed...after spending a (possibly) considerable amount of time trying to get it to work. Uncross your fingers, let the blood flow through and go back to one of the four suggestions given previously...or give it the old college try and hope for the best.
> "The tools will often produce sub-optimal solutions when trying to > solve many simultaneous, conflicting requirements (resource location, > timing, ...), and > thus have trouble for the total design. =A0Generally, the timing > performance achieved at the chip level is less optimal that at the > block level, so make sure your blocks will meet timing. Other (non- > coding) tricks: =A0location constraints, multi-cycle constraints," > > So true !! I have seen exactly this - block level everything is > fine .. but chip level ... performance starts deteriorating !! I have > tried the resource allocation and timing constraints also. Though > these improve the frequency a little, I think the biggest constraint > comes from the very high logic utilization and hence the tool is not > able to concentrate its efforts efficiently !!! However, I am working > on a few things ... lets see if the efforts pay dividends !!! >
Saying that each block has good performance but tying them all together is a problem caused by 'sub-optimal' placement is without any basis. While it could be a contributor, the most likely cause is not the synthesis tool but the algorithm you're trying to implement. Look at your worst case timing path. If you see it going through a whole bunch of levels of logic, then it's not the synthesis tool's poor placement, it's your logic. If you see only one or two levels of logic and unreasonably long delays then it is either the synthesis tool (as you suggest) or you have an unrealistic expectation of what kind of clock speed you can expect to run at. Good luck Kevin Jennings
"KJ" <kkjennings@sbcglobal.net> wrote in message 
news:f1ac06eb-bd12-4ff5-a9bd-f88d2abb7104@k37g2000hsf.googlegroups.com...
On Jun 6, 12:48 pm, JeDi <jaydev.she...@gmail.com> wrote:
.. snip..


> "The tools will often produce sub-optimal solutions when trying to > solve many simultaneous, conflicting requirements (resource location, > timing, ...), and > thus have trouble for the total design. Generally, the timing > performance achieved at the chip level is less optimal that at the > block level, so make sure your blocks will meet timing. Other (non- > coding) tricks: location constraints, multi-cycle constraints," > > So true !! I have seen exactly this - block level everything is > fine .. but chip level ... performance starts deteriorating !! I have > tried the resource allocation and timing constraints also. Though > these improve the frequency a little, I think the biggest constraint > comes from the very high logic utilization and hence the tool is not > able to concentrate its efforts efficiently !!! However, I am working > on a few things ... lets see if the efforts pay dividends !!! >
Saying that each block has good performance but tying them all together is a problem caused by 'sub-optimal' placement is without any basis. While it could be a contributor, the most likely cause is not the synthesis tool but the algorithm you're trying to implement. Look at your worst case timing path. If you see it going through a whole bunch of levels of logic, then it's not the synthesis tool's poor placement, it's your logic. If you see only one or two levels of logic and unreasonably long delays then it is either the synthesis tool (as you suggest) or you have an unrealistic expectation of what kind of clock speed you can expect to run at. Good luck Kevin Jennings ----- I have seen specific cases where the synthesizer gets carried away optimizing away redundant logic; e.g., several instances of the same logic. When the individual block is sent through synthesis and then place & route, everything is fine; when the chip, containing multiple copies of the logic, is sent through, the synthesizer perceives (correctly) redundant logic. Unfortunately, in my case it made par work much harder, because of the increase in fanout (more spacing/distribution than number of loads.) When I synthesized the block independently, and then did the top-level sim with the several instances appearing as black boxes, the chip-level place & route improved significantly, achieving timing closure. I often run low-level blocks through preliminary synthesis & par, even when I don't do black-box instantiation, to give me that realistic expectation. I try to find out the limits of performance here, not just that it meets my 'top-level' timing; if it just barely meets at the low-level, I expect trouble later.... It also gives me the opportunity to experiment with different optimization schemes (coding style, re-architecting, synthesis directives, etc.) with a quick turnaround. In many of my designs, some blocks may be used 4 or more times; the more times, the more relevant to optimize for utilization, particularly when pushing the limits (space/routing/speed.) JTW
JeDi wrote:
>> I tried pipelining - mainly to break large combinational blocks into > smaller ones.
And there it is... Stop making large combinatorial blocks. As others have pointed out, make 1 or 2 levels of logic, then a flip flop. Here's an example that I see often in image processing: you're processing a raster-scan image, so you need to know when you're at the end of an image line. You make a counter to count out the pixels on a line, then you pepper the control equations with a comparison of the row counter to the limit value. It might meet timing when synthesized and placed in unit testing, but as the chip grows, the placement of these equations has to compete with the placement of all other logic, and it no longer meets timing. The worst case timing path is the counter increment signal (hopefully a register output), through the counter, and through the carry chain, the counter output goes through the comparator, the comparator output combines with the rest of the control logic, then mercifully shows up at the D of a flip-flop. Now, if you had just registered the comparator output instead, the increment logic, counter logic and comparator delays would no longer contribute to your timing problem because they all happened in the previous clock cycle. Yes, this means that you have to compensate for the flip-flop delay of the comparator, but you just work that out in the design before you code (or you'll work that out anyway when the design doesn't meet timing). --- Joe Samson Pixel Velocity
Joseph Samson wrote:
The worst case timing path is the counter increment signal
> (hopefully a register output), through the counter, and through the > carry chain, the counter output goes through the comparator, the > comparator output combines with the rest of the control logic, then > mercifully shows up at the D of a flip-flop.
Oops! Those counter outputs are registered! In my defense, it was 7AM. The principle remains, though. Think of your logic register to register, not as large combinatorial chunks. --- Joe Samson Pixel Velocity