FPGARelated.com
Forums

XST Help - Device Utilization Woes

Started by Brandon August 18, 2005
Hello,

I'm synthesizing a design in XST and I'm having a hard time figuring
out what's consuming all of the devices resources.

I wrote mostly structural VHDL, so I decided to synthesize each
component separately to get a better idea of the low level utilization.
I haven't seen any option in XST to see a hierarchal analysis of
area... Anyway, I estimated the resource consumption of my design,
excluding routing, the FSM, and some other small amounts of logic and
multiplexing:

       Slice Count    Slice FFs        4-input LUTs
       -----------    ---------        ------------
used:  10936          29048            12406
total: 23616          47232            47232
       -----------    ---------        ------------
       46.31%         61.50%           26.27%

Here is the actual:
 Number of Slices:                   45523  out of  23616   192% (*)
 Number of Slice Flip Flops:         22611  out of  47232    47%
 Number of 4 input LUTs:             78378  out of  47232   165% (*)


When looking in the synthesis report, I noticed some warnings
indicating that duplicate FFs were removed, so that explains the
reduction in FF count. However, I cannot explain the HUGE increase in
LUT and Slice usage. What can I infer from this?

The report also tells me that some of my 6-bit counter signals are
being replicated (once or twice). What is the cause of this? High
fan-out?
<SNIP>
FlipFlop cnt_dout_ins_cnt_v_0 has been replicated 2 time(s)
FlipFlop cnt_dout_ins_cnt_v_1 has been replicated 1 time(s)
FlipFlop cnt_hreg_ins0_cnt_v_0 has been replicated 2 time(s)
FlipFlop cnt_hreg_ins0_cnt_v_1 has been replicated 1 time(s)
FlipFlop cnt_hreg_ins10_cnt_v_0 has been replicated 2 time(s)
FlipFlop cnt_hreg_ins10_cnt_v_1 has been replicated 1 time(s)
FlipFlop cnt_hreg_ins11_cnt_v_0 has been replicated 2 time(s)
</SNIP>

Is there anyway to decipher the cell usage count perhaps? Does anyone
have a URL that includes an explanation of all the cell names? I also
checked the macro statistics and everything is accounted for in that
table. 

Thanks.
-Brandon

Hi Brandon,
The floorplanner tool might help you track down where most of your usage is.
HTH, Syms. 


Brandon,

I would suggest taking a look at the synthesis warnings. Maybe you 
instantiated the same component twice, maybe you took wrong device size...

But if you did everything ok, then the only thing that could happen here is 
that the estimation you get is not close (sometimes) to the actual placement 
results.

Any details on this?

Vladislav
"Brandon" <killerhertz@gmail.com> wrote in message 
news:1124375215.587610.307000@g43g2000cwa.googlegroups.com...
> Hello, > > I'm synthesizing a design in XST and I'm having a hard time figuring > out what's consuming all of the devices resources. > > I wrote mostly structural VHDL, so I decided to synthesize each > component separately to get a better idea of the low level utilization. > I haven't seen any option in XST to see a hierarchal analysis of > area... Anyway, I estimated the resource consumption of my design, > excluding routing, the FSM, and some other small amounts of logic and > multiplexing: > > Slice Count Slice FFs 4-input LUTs > ----------- --------- ------------ > used: 10936 29048 12406 > total: 23616 47232 47232 > ----------- --------- ------------ > 46.31% 61.50% 26.27% > > Here is the actual: > Number of Slices: 45523 out of 23616 192% (*) > Number of Slice Flip Flops: 22611 out of 47232 47% > Number of 4 input LUTs: 78378 out of 47232 165% (*) > > > When looking in the synthesis report, I noticed some warnings > indicating that duplicate FFs were removed, so that explains the > reduction in FF count. However, I cannot explain the HUGE increase in > LUT and Slice usage. What can I infer from this? > > The report also tells me that some of my 6-bit counter signals are > being replicated (once or twice). What is the cause of this? High > fan-out? > <SNIP> > FlipFlop cnt_dout_ins_cnt_v_0 has been replicated 2 time(s) > FlipFlop cnt_dout_ins_cnt_v_1 has been replicated 1 time(s) > FlipFlop cnt_hreg_ins0_cnt_v_0 has been replicated 2 time(s) > FlipFlop cnt_hreg_ins0_cnt_v_1 has been replicated 1 time(s) > FlipFlop cnt_hreg_ins10_cnt_v_0 has been replicated 2 time(s) > FlipFlop cnt_hreg_ins10_cnt_v_1 has been replicated 1 time(s) > FlipFlop cnt_hreg_ins11_cnt_v_0 has been replicated 2 time(s) > </SNIP> > > Is there anyway to decipher the cell usage count perhaps? Does anyone > have a URL that includes an explanation of all the cell names? I also > checked the macro statistics and everything is accounted for in that > table. > > Thanks. > -Brandon >
replication is usually from two sources...
1/ the fanout as you suggest
2/ speed improvement
3/ borg

I would suggest there is little hope for your design as you have too much
logic.  Correct answer is get a bigger device :-)

However.. take a look at any memories... if they are distributed and not
block they will eat memory
Then look at shift registers.. SLR16 ?
Think about what you are trying to achieve and see if there's a simpler
solution.

Simon


"Brandon" <killerhertz@gmail.com> wrote in message
news:1124375215.587610.307000@g43g2000cwa.googlegroups.com...
> Hello, > > I'm synthesizing a design in XST and I'm having a hard time figuring > out what's consuming all of the devices resources. > > I wrote mostly structural VHDL, so I decided to synthesize each > component separately to get a better idea of the low level utilization. > I haven't seen any option in XST to see a hierarchal analysis of > area... Anyway, I estimated the resource consumption of my design, > excluding routing, the FSM, and some other small amounts of logic and > multiplexing: > > Slice Count Slice FFs 4-input LUTs > ----------- --------- ------------ > used: 10936 29048 12406 > total: 23616 47232 47232 > ----------- --------- ------------ > 46.31% 61.50% 26.27% > > Here is the actual: > Number of Slices: 45523 out of 23616 192% (*) > Number of Slice Flip Flops: 22611 out of 47232 47% > Number of 4 input LUTs: 78378 out of 47232 165% (*) > > > When looking in the synthesis report, I noticed some warnings > indicating that duplicate FFs were removed, so that explains the > reduction in FF count. However, I cannot explain the HUGE increase in > LUT and Slice usage. What can I infer from this? > > The report also tells me that some of my 6-bit counter signals are > being replicated (once or twice). What is the cause of this? High > fan-out? > <SNIP> > FlipFlop cnt_dout_ins_cnt_v_0 has been replicated 2 time(s) > FlipFlop cnt_dout_ins_cnt_v_1 has been replicated 1 time(s) > FlipFlop cnt_hreg_ins0_cnt_v_0 has been replicated 2 time(s) > FlipFlop cnt_hreg_ins0_cnt_v_1 has been replicated 1 time(s) > FlipFlop cnt_hreg_ins10_cnt_v_0 has been replicated 2 time(s) > FlipFlop cnt_hreg_ins10_cnt_v_1 has been replicated 1 time(s) > FlipFlop cnt_hreg_ins11_cnt_v_0 has been replicated 2 time(s) > </SNIP> > > Is there anyway to decipher the cell usage count perhaps? Does anyone > have a URL that includes an explanation of all the cell names? I also > checked the macro statistics and everything is accounted for in that > table. > > Thanks. > -Brandon >
I checked all the warnings and none of them seem significant.

Ok, I'm trying to synthesize a design for an N-tap complex MAC FIR. The
design serially loads the complex coefficients (currently 64-taps) via
a 64 long, 16-bit word size shift register. Here is the synthesis blurb
of that unit:

Synthesizing Unit <srsipo>.
    Related source file is
"/../../../Modeltech_6.0c/projects/espfep/work/srsipo.vhd".
    Found 1024-bit register for signal <d_r>.
INFO:Xst:738 - HDL ADVISOR - 1024 flip-flops were inferred for signal
<d_r>. You may be trying to describe a RAM in a way that is
incompatible with block and distributed RAM resources available on
Xilinx devices, or with a specific template that is not supported.
Please review the Xilinx resources documentation and the XST user
manual for coding guidelines. Taking advantage of RAM resources will
lead to improved device usage and reduced synthesis time.
    Summary:
	inferred 1024 D-type flip-flop(s).
Unit <srsipo> synthesized.

Since I was worried about the size of this unit, I synthesized it alone
and noticed that it consumed very few slices/FFs (2%,4%), so I don't
think that this is the problem. This HDL ADVISOR message is just an FYI
correct? I do no believe I'd have any benefit from using the on-chip
RAM resources?

Anyway, here is the final report for the design.. if it helps.

=========================================================================
*                            Final Report
*
=========================================================================
Final Results
RTL Top Level Output File Name     : mac_fircplx_wrapper.ngr
Top Level Output File Name         : mac_fircplx_wrapper
Output Format                      : NGC
Optimization Goal                  : Area
Keep Hierarchy                     : NO

Design Statistics
# IOs                              : 102

Macro Statistics :
# Registers                        : 3651
#      1-bit register              : 2817
#      16-bit register             : 258
#      32-bit register             : 256
#      33-bit register             : 128
#      38-bit register             : 128
#      6-bit register              : 64
# Counters                         : 2
#      6-bit up counter            : 2
# Multiplexers                     : 130
#      16-bit 64-to-1 multiplexer  : 130
# Adders/Subtractors               : 320
#      33-bit adder carry in       : 128
#      38-bit adder                : 128
#      6-bit subtractor            : 64
# Multipliers                      : 256
#      16x16-bit multiplier        : 256
# Xors                             : 128
#      1-bit xor3                  : 128

Cell Usage :
# BELS                             : 159480
#      BUF                         : 4
#      GND                         : 1
#      INV                         : 367
#      LUT1                        : 64
#      LUT2                        : 4304
#      LUT3                        : 73793
#      LUT4                        : 217
#      MUXCY                       : 9164
#      MUXF5                       : 33281
#      MUXF6                       : 16640
#      MUXF7                       : 8320
#      MUXF8                       : 4160
#      VCC                         : 1
#      XORCY                       : 9164
# FlipFlops/Latches                : 22611
#      FDC                         : 130
#      FDCE                        : 22114
#      FDCPE                       : 15
#      FDP                         : 64
#      FDPE                        : 288
# Clock Buffers                    : 1
#      BUFGP                       : 1
# IO Buffers                       : 101
#      IBUF                        : 67
#      OBUF                        : 34
# MULTs                            : 256
#      MULT18X18                   : 256
=========================================================================

Device utilization summary:
---------------------------

Selected Device : 2vp50ff1152-5

 Number of Slices:                   45523  out of  23616   192% (*)
 Number of Slice Flip Flops:         22611  out of  47232    47%
 Number of 4 input LUTs:             78378  out of  47232   165% (*)
 Number of bonded IOBs:                102  out of    692    14%
 Number of MULT18X18s:                 256  out of    232   110% (*)
 Number of GCLKs:                        1  out of     16     6%

WARNING:Xst:1336 -  (*) More than 100% of Device resources are used


=========================================================================

I'm not worried about the multiplier over-utilization. I'll probably
just reduce the number of taps once I get the other numbers down...
I've yet to find any info on the Xilinx cell primitives,i.e. what's the
difference between FDC, FDCE, FDP, etc.? Does anyone have any technical
documentation on these? If anyone is interested, I could provide a
design schematic...

Much thanks,
-Brandon

Brandon wrote:
> I checked all the warnings and none of them seem significant. > > Ok, I'm trying to synthesize a design for an N-tap complex MAC FIR. The > design serially loads the complex coefficients (currently 64-taps) via > a 64 long, 16-bit word size shift register. Here is the synthesis blurb > of that unit: > > Synthesizing Unit <srsipo>. > Related source file is > "/../../../Modeltech_6.0c/projects/espfep/work/srsipo.vhd". > Found 1024-bit register for signal <d_r>. > INFO:Xst:738 - HDL ADVISOR - 1024 flip-flops were inferred for signal > <d_r>. You may be trying to describe a RAM in a way that is > incompatible with block and distributed RAM resources available on > Xilinx devices, or with a specific template that is not supported. > Please review the Xilinx resources documentation and the XST user > manual for coding guidelines. Taking advantage of RAM resources will > lead to improved device usage and reduced synthesis time. > Summary: > inferred 1024 D-type flip-flop(s). > Unit <srsipo> synthesized. > > Since I was worried about the size of this unit, I synthesized it alone > and noticed that it consumed very few slices/FFs (2%,4%), so I don't > think that this is the problem. This HDL ADVISOR message is just an FYI > correct? I do no believe I'd have any benefit from using the on-chip > RAM resources? > > Anyway, here is the final report for the design.. if it helps. > > ========================================================================= > * Final Report > * > ========================================================================= > Final Results > RTL Top Level Output File Name : mac_fircplx_wrapper.ngr > Top Level Output File Name : mac_fircplx_wrapper > Output Format : NGC > Optimization Goal : Area > Keep Hierarchy : NO > > Design Statistics > # IOs : 102 > > Macro Statistics : > # Registers : 3651 > # 1-bit register : 2817 > # 16-bit register : 258 > # 32-bit register : 256 > # 33-bit register : 128 > # 38-bit register : 128 > # 6-bit register : 64 > # Counters : 2 > # 6-bit up counter : 2 > # Multiplexers : 130 > # 16-bit 64-to-1 multiplexer : 130 > # Adders/Subtractors : 320 > # 33-bit adder carry in : 128 > # 38-bit adder : 128 > # 6-bit subtractor : 64 > # Multipliers : 256 > # 16x16-bit multiplier : 256 > # Xors : 128 > # 1-bit xor3 : 128 > > Cell Usage : > # BELS : 159480 > # BUF : 4 > # GND : 1 > # INV : 367 > # LUT1 : 64 > # LUT2 : 4304 > # LUT3 : 73793 > # LUT4 : 217 > # MUXCY : 9164 > # MUXF5 : 33281 > # MUXF6 : 16640 > # MUXF7 : 8320 > # MUXF8 : 4160 > # VCC : 1 > # XORCY : 9164 > # FlipFlops/Latches : 22611 > # FDC : 130 > # FDCE : 22114 > # FDCPE : 15 > # FDP : 64 > # FDPE : 288 > # Clock Buffers : 1 > # BUFGP : 1 > # IO Buffers : 101 > # IBUF : 67 > # OBUF : 34 > # MULTs : 256 > # MULT18X18 : 256 > ========================================================================= > > Device utilization summary: > --------------------------- > > Selected Device : 2vp50ff1152-5 > > Number of Slices: 45523 out of 23616 192% (*) > Number of Slice Flip Flops: 22611 out of 47232 47% > Number of 4 input LUTs: 78378 out of 47232 165% (*) > Number of bonded IOBs: 102 out of 692 14% > Number of MULT18X18s: 256 out of 232 110% (*) > Number of GCLKs: 1 out of 16 6% > > WARNING:Xst:1336 - (*) More than 100% of Device resources are used > > > ========================================================================= > > I'm not worried about the multiplier over-utilization. I'll probably > just reduce the number of taps once I get the other numbers down... > I've yet to find any info on the Xilinx cell primitives,i.e. what's the > difference between FDC, FDCE, FDP, etc.? Does anyone have any technical > documentation on these? If anyone is interested, I could provide a > design schematic... > > Much thanks, > -Brandon
There are 130 16 bit wide, 64-to-1 multiplexers in your design. That is where your excess logic utilization is comming from. What are they for? Each 64-to-1 mux is going to be 32 luts plus some MUXFXs per bit. 32*16*130 = 66560 Luts. Regards, John McCaskill
I believe it is the xilinx "libraries guide" you are after for FDC,
FDCE etc documentation. Look in the ISE install directory and you
should find it under xilinx\doc\usenglish\books\docs\lib\lib.pf. Or in
ISE 7.x go to Help->online documentation, then in left hand pane you
will see "libraries guide".

What is the sample rate and number of bits in your input data? You may
have already considered it but a "distributed arithmetic" filter uses
much less FPGA resources than a fully parallel filter. The cost is in
sample rate. A distributed arithmetic approach to the FIR filter allows
you to trade off sample rate and FPGA resource usage. 

Regards
Andrew

John,

I need those multiplexers to multiplex the coefficients h[0] through
h[63] to each MAC b input. There are 64 complex MACs, so I need
64x2=128, 64 to 1 multiplexers for the complex 'b' input. There are two
64 to 1 multiplexers to multiplex the complex accumulator outputs s[0]
through s[63] to the output y.

Here is how the timing goes for first two samples:
__@ t = 0__
b[0] <= h[0]    y[0] = x[0]h[0]
b[1] <= h[1]           x[0]h[1]
.
.
.
b[63] <= h[63]         x[0]h[63]

__@ t = 1__
b[0] <= h[63]          x[1]h[63]
b[1] <= h[0]    y[1] = x[1]h[0]+x[0]h[1]
.
.
.
b[63] <= h[62]         x[1]h[62]

So, I have a counter for each multiplexer that controls which MAC gets
which filter coefficient. Is there another way I can do this with less
hardware, but 100% throughput?

Andrew,

Originally I had looked at DA FIRs, but we can't really use a
multi-cycle approach. The sample rate for this real time system is 215
MSPS in the front end. We'll also end up interpolating later... We have
FIFOs, but I'm worried that they will fill up quickly, especially since
I'm not sure we'll be able to clock the FPGAs at 215 MHz.

Thanks,
-Brandon

Hi Brandon,
What is the signal you are analysing/filtering? Presuamably you have
ruled out decimation to get the rate down?

Regards
Andrew

Brandon,

I'm not 100% clear on what your filter structure looks like, but for a
high throughput, large FIR like this, you should be using a transposed
direct form I structure. If you have fanout problems with this, and can
tolerate additional latency, you can use a systolic structure. Xilinx
has a nice depiction of both in (pp. 84-85):

http://www.xilinx.com/bvdocs/userguides/ug073.pdf

For either structure, you can simply string a shift register together
to load your tap coefficients. I hope this helps.

cheers,
aaron