comp.arch.fpga | Downsizing Verilog synthesization.| page 2

Reply by Joseph Samson ●August 6, 20082008-08-06

eromlignod wrote:
> Hi guys:
> 
> I'm prototyping an application using a Xilinx Spartan-3 development
> board.  I'm using this particular development kit because it is suited
> to the large amount of I/O I need.
> 
> I'm new to FPGA, so I have written the code in Verilog using almost
> exclusively a high-level, behavioural style.  The program works, but
> synthesizes using 99% of the available slices.  So if I try to change
> or improve the code, it often synthesizes to over 100% and kicks out
> an error.

Are you talking about synthesis or place and route? It's not an error 
(at least through 9.1i) to synthesize to more logic than is available in 
a part. My current design synthesizes to 105% of device resources. 
Mapping optimizes the design and gets rid of redundant or unused logic.

It's also not unusual to have 99% of the slices occupied. The tools 
prefer to spread the design over as many slices as possible.

The map report will show how many resources are really used (LUTs, FF, 
BRAM, Clocks, MULT.....).

---
Joe Samson

Reply by John McCaskill ●August 6, 20082008-08-06

On Aug 6, 11:31=A0am, eromlignod <eromlig...@aol.com> wrote:
> On Aug 6, 11:24=A0am, eromlignod <eromlig...@aol.com> wrote:
>
>
>
> > On Aug 6, 10:49=A0am, Gabor <ga...@alacron.com> wrote:
>
> > > On Aug 6, 10:43 am, eromlignod <eromlig...@aol.com> wrote:
>
> > > > On Aug 6, 9:21 am, John McCaskill <jhmccask...@gmail.com> wrote:
>
> > > > > On Aug 6, 8:40 am, eromlignod <eromlig...@aol.com> wrote:
>
> > > > > > Hi guys:
>
> > > > > > I'm prototyping an application using a Xilinx Spartan-3 develop=
ment
> > > > > > board. =A0I'm using this particular development kit because it =
is suited
> > > > > > to the large amount of I/O I need.
>
> > > > > > I'm new to FPGA, so I have written the code in Verilog using al=
most
> > > > > > exclusively a high-level, behavioural style. =A0The program wor=
ks, but
> > > > > > synthesizes using 99% of the available slices. =A0So if I try t=
o change
> > > > > > or improve the code, it often synthesizes to over 100% and kick=
s out
> > > > > > an error.
>
> > > > > > I need to condense what I've got to give me some space to work =
with.
>
> > > > > > The application is basically a large number of high-speed pulse
> > > > > > inputs. =A0I count them all independently and average several r=
eadings
> > > > > > over time for each to produce a 21-bit number. =A0Each of these=
 21-bit
> > > > > > vectors (there are almost 100) is sent to a central processing =
module
> > > > > > that evaluates and compares them using simple arithmetic. =A0Ba=
sed on
> > > > > > these comparisons, another set of vectors is sent on to a coupl=
e of
> > > > > > modules that arrange them into a special synchronous serial out=
put.
> > > > > > That's all it does.
>
> > > > > > Are there any standard tips or general guidelines that you migh=
t offer
> > > > > > to condense my synthesis? =A0I have found, for example, that ma=
king the
> > > > > > vectors smaller doesn't really change the overall slice count, =
yet
> > > > > > commenting out a single line of the processing code can change =
it
> > > > > > drastically.
>
> > > > > > Any ideas or comments would be greatly appreciated.
>
> > > > > > Don
>
> > > > > Since you state that you run out of slices, I know that your desi=
gn is
> > > > > larger than the FPGA can hold, but I would still point out that t=
he
> > > > > slice utilization is a pessimistic view of how much of the FPGA y=
ou
> > > > > are using, the mapping stage spreads the logic out by default ins=
tead
> > > > > of packing it as tightly as possible. =A0The Register and LUT
> > > > > utilization is an optimistic measure of how much of the FPGA you =
have
> > > > > left. =A0You need to watch all of them to get a good idea of how =
full
> > > > > your design really is.
>
> > > > > You mention both a high speed pulse counting section that counts =
and
> > > > > averages over time, and then a processing section that sounds lik=
e it
> > > > > is slower. How much slower is it? =A0If you can share resources o=
ver
> > > > > time in this section you could save resources.
>
> > > > > You can look in the reports to see how many adders, etc the tools
> > > > > inferred from your code. =A0Your goal is to reduce that number to=
 the
> > > > > minimum required to perform the comparisons. =A0You have a range =
of
> > > > > options that depend on your constraints. =A0At one end of the spe=
ctrum,
> > > > > just find any redundant calculations and rearrange your code to s=
hare
> > > > > those calculations. At the other end, you could use a soft proces=
sor
> > > > > such as a PicoBlaze to do the calculations in software.
>
> > > > > Regards,
>
> > > > > John McCaskillwww.FasterTechnology.com-Hidequotedtext-
>
> > > > > - Show quoted text -
>
> > > > What sorts of operations are the biggest gate-hogs?
>
> > > > I have a lot of comparison "if" operations, counters, and non-block=
ing
> > > > assignments to convert lots of inputs into usable arrays. =A0The
> > > > averagers each divide by 32 and I have another single divider towar=
d
> > > > the end that divides by 256. =A0Other than that, I'm not doing anyt=
hing
> > > > very fancy. =A0I have no multipliers (though I might like to add on=
e),
> > > > no "for" loops, etc.
>
> > > > I do have a series of hard-coded standard values that I use for
> > > > comparison. =A0They are in the form of parameters that are fed to e=
ach
> > > > of the input counter modules when they are instantiated in the top
> > > > module. =A0I suppose these could be EPROM memories, but I haven't
> > > > figured out yet how to use the memory provided on the development
> > > > board.
>
> > > > Don
>
> > > What tools are you using for synthesis? =A0If ISE / XST (webpack or
> > > foundation from Xilinx) which version?
>
> > > Things like divide by power of two should take no resources whatever
> > > (i.e. shift operators are basically wires). =A0However a synthesis to=
ol
> > > may look at the division operator and think you need a divider, which
> > > will take a lot of logic.
>
> > > Also since you seem to be register-heavy, see where you can
> > > use serial shift registers or memory instead of loose flip-flops.
> > > In Spartan 3 you get 16 stages of serial shift =A0register or 16
> > > bits of distributed RAM from a single LUT site. =A0Coding shift
> > > registers without a reset term allows the synthesizer to place
> > > them in these structures instead of flip-flops (which come
> > > one to a LUT site).
>
> > > Did you look at your map report or "design summary"? =A0In
> > > the latest version of ISE the design summary can show you
> > > where your largest resource allocations come from.
>
> > > Regards,
> > > Gabor- Hide quoted text -
>
> > > - Show quoted text -
>
> > Interesting.
>
> > Thanks Gabor! =A0This may be very useful. =A0I have a large number of 8=
-
> > bit vectors in my design. =A0I have about 220 of them passing from one
> > module to another. =A0They each begin as an "output reg [7:0]" in one
> > module and are all assigned to an array in the other module like this.
>
> > reg [7:0] array [219:0];
> > ...
> > y[0] <=3D array[0];
> > y[1] <=3D array[1];
> > y[2] <=3D array[3];
> > ...etc.
>
> > Is this bad form?
>
> > Don- Hide quoted text -
>
> > - Show quoted text -
>
> Oops. =A0I meant for that code to be:
>
> input [7:0] y0;
> input [7:0] y1;
> ...
> reg [7:0] array [219:0];
> ...
> array[0] <=3D y0;
> array[1] <=3D y1;
> ...etc.
>
> Don

If you can map this onto a block ram, you will save quite a bit of
registers. Whether or not you can do this depends on if you can write
the vectors in one (or a few) at a time, and process them sequentially
in the time you have available.  How much time do you have to process
the vectors? Ns, us, ms ?

Regards,

John McCaskill
www.FasterTechnology.com

Reply by eromlignod ●August 6, 20082008-08-06

On Aug 6, 11:50=A0am, John McCaskill <jhmccask...@gmail.com> wrote:
> If you can map this onto a block ram, you will save quite a bit of
> registers. Whether or not you can do this depends on if you can write
> the vectors in one (or a few) at a time, and process them sequentially
> in the time you have available. =A0How much time do you have to process
> the vectors? Ns, us, ms ?

Ah, I think I'm following along now.  Are you talking about sending
the numbers over a single 8-bit vector wire one-at-a-time?  Hmmm.

The vectors are actually independent from each other and refresh at
various random rates, so a few usec here or there shouldn't make a
difference.  I'll give it a try!

Don

Reply by eromlignod ●August 6, 20082008-08-06

On Aug 6, 11:43=A0am, Joseph Samson <u...@not.my.company> wrote:
> Are you talking about synthesis or place and route? It's not an error
> (at least through 9.1i) to synthesize to more logic than is available in
> a part. My current design synthesizes to 105% of device resources.
> Mapping optimizes the design and gets rid of redundant or unused logic.
>
> It's also not unusual to have 99% of the slices occupied. The tools
> prefer to spread the design over as many slices as possible.
>
> The map report will show how many resources are really used (LUTs, FF,
> BRAM, Clocks, MULT.....).

Right, I meant when I process all the way to place & route, not just
synthesis.  It is funny that it is at 99%.  In fact, it shows that all
but two of the slices are used (!!).

It does list usage as 76% related logic and 24% unrelated logic.  I'm
not sure how to remedy that.

Don

Reply by John_H ●August 6, 20082008-08-06

On Aug 6, 10:36=A0am, eromlignod <eromlig...@aol.com> wrote:
<snip>
> Right, I meant when I process all the way to place & route, not just
> synthesis. =A0It is funny that it is at 99%. =A0In fact, it shows that al=
l
> but two of the slices are used (!!).
>
> It does list usage as 76% related logic and 24% unrelated logic. =A0I'm
> not sure how to remedy that.
>
> Don

The xilinx mapper will spread logic out with one LUT per slice until
it fills to nearly 100% of the slices then it will backfill the 2nd
LUT in each slice where conditions permit.  There's usually a good
stretch between 99% and 101%.

If you explained better what you're trying to do (signal quantity,
what the counts represent, frequencies of the signals and the system
clock) you might get better suggestions on how to code things.  Right
now most of us are taking stabs in the dark.

Reply by Jim Granville ●August 6, 20082008-08-06

eromlignod wrote:

> Hi guys:
> 
> I'm prototyping an application using a Xilinx Spartan-3 development
> board.  I'm using this particular development kit because it is suited
> to the large amount of I/O I need.
> 
> I'm new to FPGA, so I have written the code in Verilog using almost
> exclusively a high-level, behavioural style.  The program works, but
> synthesizes using 99% of the available slices.  So if I try to change
> or improve the code, it often synthesizes to over 100% and kicks out
> an error.
> 
> I need to condense what I've got to give me some space to work with.
> 
> The application is basically a large number of high-speed pulse
> inputs.

Define 'high speed', and what timebase reading rate ?
What are the pulses coming from ?

To some, microseconds is high speed, to others, femtoseconds is high 
speed.....

> I count them all independently and average several readings
> over time for each to produce a 21-bit number.  

If you mean several readings from the same channel, a longer
count time will do that for free.

Are you reading frequency? (fixed time readout of the counters)

> Each of these 21-bit
> vectors (there are almost 100) is sent to a central processing module
> that evaluates and compares them using simple arithmetic.  Based on
> these comparisons, another set of vectors is sent on to a couple of
> modules that arrange them into a special synchronous serial output.
> That's all it does.

What sort of comparison, and what decision rates are you talking ?
Is that processing software, or hardware ?

Do you need 21 bits of precision, or just 21 bits of dynamic range ?

A quasi-log counter bus would drop the fan-out.
(so a 13 bit MSB and a 3 bit exponent, would mux on 16 bit
data paths - 76% of the mux logic right there.

-jg

Reply by John McCaskill ●August 6, 20082008-08-06

On Aug 6, 12:31=A0pm, eromlignod <eromlig...@aol.com> wrote:
> On Aug 6, 11:50=A0am, John McCaskill <jhmccask...@gmail.com> wrote:
>
> > If you can map this onto a block ram, you will save quite a bit of
> > registers. Whether or not you can do this depends on if you can write
> > the vectors in one (or a few) at a time, and process them sequentially
> > in the time you have available. =A0How much time do you have to process
> > the vectors? Ns, us, ms ?
>
> Ah, I think I'm following along now. =A0Are you talking about sending
> the numbers over a single 8-bit vector wire one-at-a-time? =A0Hmmm.
>
> The vectors are actually independent from each other and refresh at
> various random rates, so a few usec here or there shouldn't make a
> difference. =A0I'll give it a try!
>
> Don

You are asking good questions, so there are multiple people here that
will be happy to help you out. However, you are asking for some low
level suggestions without giving enough high level detail.  The best
optimizations are the ones that you apply at the high level where you
have the most leverage.

If you can tell us more about what you are trying to do you will get
better responses. You said that you have almost 100 high speed
channels.

How many channels are there?
How fast are the pulses arriving on average?
Over what time is the average?
What is the air speed of an unladen swallow?
What is the minimum spacing between pulses?
How fast does your central processing module need to compare the
channels?

As Jim Granville pointed our, the various time bases of your problem
have a major impact on the potential solutions.

Regards,

John McCaskill
www.FasterTechnology.com

Reply by Nial Stewart ●August 7, 20082008-08-07

> I'm intrigued by your answer, but don't fully understand what you
> propose.

You need to have a better understanding of what's generated by your code.
Remember you're describing hardware.

> My last serial generating module has a big 256 vector input that it is
> translating to a serial output that repeats the 256 bits over and
> over.  The code is basically something like this:
> input [255:0] invector;
> output serout;
> reg [7:0] x;
> always @(negedge shiftclock)
>    begin
>       x = x + 1;
>       serout = invector[x];
>    end

That (probably) creates a 256 bit vector and a massive mux to select
one of the bits.

In VHDL the following generates a big shift register which the tools
will find dead easy to place and route as each logical path is just
from one register to the next.....

if(rising_edge(clk)) then

    invector(254 downto 0) <= invector(255 downto 1);
    serout <= invector(0);

end if;

This should be easily translated to verilog.




Nial.

Reply by Gabor ●August 7, 20082008-08-07

On Aug 6, 10:21 am, John McCaskill <jhmccask...@gmail.com> wrote:
[snip]
> > Don
>
> Since you state that you run out of slices, I know that your design is
> larger than the FPGA can hold, but I would still point out that the
> slice utilization is a pessimistic view of how much of the FPGA you
> are using, the mapping stage spreads the logic out by default instead
> of packing it as tightly as possible.  The Register and LUT
> utilization is an optimistic measure of how much of the FPGA you have
> left.  You need to watch all of them to get a good idea of how full
> your design really is.
[snip]
> Regards,
>
> John McCaskillwww.FasterTechnology.com

On a related note, I had a design with about 60% LUT and flip-flop
utilization (Spartan 2) that would not fit until I checked "Disable
Register Ordering" in the mapping options.  If you're not exceeding
the capacity of the device by too much you can look at tuning
the tools a bit to help fit the design.

Regards,
Gabor

Reply by eromlignod ●August 7, 20082008-08-07

On Aug 6, 8:05=A0pm, John McCaskill <jhmccask...@gmail.com> wrote:
> On Aug 6, 12:31=A0pm, eromlignod <eromlig...@aol.com> wrote:
>
>
>
>
>
> > On Aug 6, 11:50=A0am, John McCaskill <jhmccask...@gmail.com> wrote:
>
> > > If you can map this onto a block ram, you will save quite a bit of
> > > registers. Whether or not you can do this depends on if you can write
> > > the vectors in one (or a few) at a time, and process them sequentiall=
y
> > > in the time you have available. =A0How much time do you have to proce=
ss
> > > the vectors? Ns, us, ms ?
>
> > Ah, I think I'm following along now. =A0Are you talking about sending
> > the numbers over a single 8-bit vector wire one-at-a-time? =A0Hmmm.
>
> > The vectors are actually independent from each other and refresh at
> > various random rates, so a few usec here or there shouldn't make a
> > difference. =A0I'll give it a try!
>
> > Don
>
> You are asking good questions, so there are multiple people here that
> will be happy to help you out. However, you are asking for some low
> level suggestions without giving enough high level detail. =A0The best
> optimizations are the ones that you apply at the high level where you
> have the most leverage.
>
> If you can tell us more about what you are trying to do you will get
> better responses. You said that you have almost 100 high speed
> channels.
>
> How many channels are there?
> How fast are the pulses arriving on average?
> Over what time is the average?
> What is the air speed of an unladen swallow?
> What is the minimum spacing between pulses?
> How fast does your central processing module need to compare the
> channels?
>
> As Jim Granville pointed our, the various time bases of your problem
> have a major impact on the potential solutions.
>
> Regards,
>
> John McCaskillwww.FasterTechnology.com- Hide quoted text -
>
> - Show quoted text -

Well, I'm being a little cryptic because there are new patent
applications involved and I don't want to give too much away.  I can
tell you those things that are already covered in the first patent
though.

The device is a self-tuning piano.  You can read/listen about it
here...

New York Times:
http://query.nytimes.com/gst/fullpage.html?res=3D9800E1D8133FF931A35752C0A9=
659C8B63

NPR:
http://www.npr.org/templates/story/story.php?storyId=3D878091

New Scientist Magazine:
http://www.newscientist.com/article/dn3143-hotwired-piano-tunes-itself.html

The incoming signals are square waves at the fundamental frequency of
each of the 219 strings in the piano that are being magnetically
sustained (vibrate forever).  I only read 44 strings at a time, tune
them, then go on to the next 44, etc.  So there are 44 counting
modules and an output address signal to instruct the sustainer
circuits when to vibrate the next string.

I first convert the wave to a "period" wave that has an "on" time
equal to one period of the string's vibration.  I then use this wave
to enable counting of the 50-MHz system clock.  So I get a count of
how many clock ticks of the system clock occur for one period of
string vibration.  This takes up to 21 bits for the low strings.  I
average 32 of these numbers and calculate an error based on a stored
setpoint.  Currently I'm using a theoretical setpoint, but eventually
I will want to add the feature whereby a piano tech can hand-tune the
piano and then "store" his tuning numbers for subsequent use.

The output of each of these modules is a 16-bit, signed "error"
vector.  All 44 errors go to an "evaluation" module where it is
decided how much warmth needs to be applied to each string to tune it,
in the form of 219 PWM duty cycles (0  to 256 each).  These numbers
are sent to another two modules and translated to a synchronous serial
output where a separate power circuit decodes and uses them to produce
the individual PWM control lines.  Once the "in tune" value of PWM is
found for each string, it is simply maintained until the system is
switched off.  Actual adjustments only occur when the system is first
turned on.  Currently I can tune each set of 44 strings in about 20 to
30 seconds.

The whole system does indeed work just fine so far.  I'm just running
out of FPGA space, ostensibly because of my poorly-written code.  I
would like to add code to refine the tuning accuracy and time and
possibly add the "store" option, so I need all the space I can get.

Thanks for all of the excellent help so far!

Don A. Gilmore
Kansas City

Previous 123 4 5 Next

Downsizing Verilog synthesization.

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group