FPGARelated.com
Forums

Downsizing Verilog synthesization.

Started by eromlignod August 6, 2008
eromlignod wrote:
> Hi guys: > > I'm prototyping an application using a Xilinx Spartan-3 development > board. I'm using this particular development kit because it is suited > to the large amount of I/O I need. > > I'm new to FPGA, so I have written the code in Verilog using almost > exclusively a high-level, behavioural style. The program works, but > synthesizes using 99% of the available slices. So if I try to change > or improve the code, it often synthesizes to over 100% and kicks out > an error.
Are you talking about synthesis or place and route? It's not an error (at least through 9.1i) to synthesize to more logic than is available in a part. My current design synthesizes to 105% of device resources. Mapping optimizes the design and gets rid of redundant or unused logic. It's also not unusual to have 99% of the slices occupied. The tools prefer to spread the design over as many slices as possible. The map report will show how many resources are really used (LUTs, FF, BRAM, Clocks, MULT.....). --- Joe Samson
On Aug 6, 11:31=A0am, eromlignod <eromlig...@aol.com> wrote:
> On Aug 6, 11:24=A0am, eromlignod <eromlig...@aol.com> wrote: > > > > > On Aug 6, 10:49=A0am, Gabor <ga...@alacron.com> wrote: > > > > On Aug 6, 10:43 am, eromlignod <eromlig...@aol.com> wrote: > > > > > On Aug 6, 9:21 am, John McCaskill <jhmccask...@gmail.com> wrote: > > > > > > On Aug 6, 8:40 am, eromlignod <eromlig...@aol.com> wrote: > > > > > > > Hi guys: > > > > > > > I'm prototyping an application using a Xilinx Spartan-3 develop=
ment
> > > > > > board. =A0I'm using this particular development kit because it =
is suited
> > > > > > to the large amount of I/O I need. > > > > > > > I'm new to FPGA, so I have written the code in Verilog using al=
most
> > > > > > exclusively a high-level, behavioural style. =A0The program wor=
ks, but
> > > > > > synthesizes using 99% of the available slices. =A0So if I try t=
o change
> > > > > > or improve the code, it often synthesizes to over 100% and kick=
s out
> > > > > > an error. > > > > > > > I need to condense what I've got to give me some space to work =
with.
> > > > > > > The application is basically a large number of high-speed pulse > > > > > > inputs. =A0I count them all independently and average several r=
eadings
> > > > > > over time for each to produce a 21-bit number. =A0Each of these=
21-bit
> > > > > > vectors (there are almost 100) is sent to a central processing =
module
> > > > > > that evaluates and compares them using simple arithmetic. =A0Ba=
sed on
> > > > > > these comparisons, another set of vectors is sent on to a coupl=
e of
> > > > > > modules that arrange them into a special synchronous serial out=
put.
> > > > > > That's all it does. > > > > > > > Are there any standard tips or general guidelines that you migh=
t offer
> > > > > > to condense my synthesis? =A0I have found, for example, that ma=
king the
> > > > > > vectors smaller doesn't really change the overall slice count, =
yet
> > > > > > commenting out a single line of the processing code can change =
it
> > > > > > drastically. > > > > > > > Any ideas or comments would be greatly appreciated. > > > > > > > Don > > > > > > Since you state that you run out of slices, I know that your desi=
gn is
> > > > > larger than the FPGA can hold, but I would still point out that t=
he
> > > > > slice utilization is a pessimistic view of how much of the FPGA y=
ou
> > > > > are using, the mapping stage spreads the logic out by default ins=
tead
> > > > > of packing it as tightly as possible. =A0The Register and LUT > > > > > utilization is an optimistic measure of how much of the FPGA you =
have
> > > > > left. =A0You need to watch all of them to get a good idea of how =
full
> > > > > your design really is. > > > > > > You mention both a high speed pulse counting section that counts =
and
> > > > > averages over time, and then a processing section that sounds lik=
e it
> > > > > is slower. How much slower is it? =A0If you can share resources o=
ver
> > > > > time in this section you could save resources. > > > > > > You can look in the reports to see how many adders, etc the tools > > > > > inferred from your code. =A0Your goal is to reduce that number to=
the
> > > > > minimum required to perform the comparisons. =A0You have a range =
of
> > > > > options that depend on your constraints. =A0At one end of the spe=
ctrum,
> > > > > just find any redundant calculations and rearrange your code to s=
hare
> > > > > those calculations. At the other end, you could use a soft proces=
sor
> > > > > such as a PicoBlaze to do the calculations in software. > > > > > > Regards, > > > > > > John McCaskillwww.FasterTechnology.com-Hidequotedtext- > > > > > > - Show quoted text - > > > > > What sorts of operations are the biggest gate-hogs? > > > > > I have a lot of comparison "if" operations, counters, and non-block=
ing
> > > > assignments to convert lots of inputs into usable arrays. =A0The > > > > averagers each divide by 32 and I have another single divider towar=
d
> > > > the end that divides by 256. =A0Other than that, I'm not doing anyt=
hing
> > > > very fancy. =A0I have no multipliers (though I might like to add on=
e),
> > > > no "for" loops, etc. > > > > > I do have a series of hard-coded standard values that I use for > > > > comparison. =A0They are in the form of parameters that are fed to e=
ach
> > > > of the input counter modules when they are instantiated in the top > > > > module. =A0I suppose these could be EPROM memories, but I haven't > > > > figured out yet how to use the memory provided on the development > > > > board. > > > > > Don > > > > What tools are you using for synthesis? =A0If ISE / XST (webpack or > > > foundation from Xilinx) which version? > > > > Things like divide by power of two should take no resources whatever > > > (i.e. shift operators are basically wires). =A0However a synthesis to=
ol
> > > may look at the division operator and think you need a divider, which > > > will take a lot of logic. > > > > Also since you seem to be register-heavy, see where you can > > > use serial shift registers or memory instead of loose flip-flops. > > > In Spartan 3 you get 16 stages of serial shift =A0register or 16 > > > bits of distributed RAM from a single LUT site. =A0Coding shift > > > registers without a reset term allows the synthesizer to place > > > them in these structures instead of flip-flops (which come > > > one to a LUT site). > > > > Did you look at your map report or "design summary"? =A0In > > > the latest version of ISE the design summary can show you > > > where your largest resource allocations come from. > > > > Regards, > > > Gabor- Hide quoted text - > > > > - Show quoted text - > > > Interesting. > > > Thanks Gabor! =A0This may be very useful. =A0I have a large number of 8=
-
> > bit vectors in my design. =A0I have about 220 of them passing from one > > module to another. =A0They each begin as an "output reg [7:0]" in one > > module and are all assigned to an array in the other module like this. > > > reg [7:0] array [219:0]; > > ... > > y[0] <=3D array[0]; > > y[1] <=3D array[1]; > > y[2] <=3D array[3]; > > ...etc. > > > Is this bad form? > > > Don- Hide quoted text - > > > - Show quoted text - > > Oops. =A0I meant for that code to be: > > input [7:0] y0; > input [7:0] y1; > ... > reg [7:0] array [219:0]; > ... > array[0] <=3D y0; > array[1] <=3D y1; > ...etc. > > Don
If you can map this onto a block ram, you will save quite a bit of registers. Whether or not you can do this depends on if you can write the vectors in one (or a few) at a time, and process them sequentially in the time you have available. How much time do you have to process the vectors? Ns, us, ms ? Regards, John McCaskill www.FasterTechnology.com
On Aug 6, 11:50=A0am, John McCaskill <jhmccask...@gmail.com> wrote:
> If you can map this onto a block ram, you will save quite a bit of > registers. Whether or not you can do this depends on if you can write > the vectors in one (or a few) at a time, and process them sequentially > in the time you have available. =A0How much time do you have to process > the vectors? Ns, us, ms ?
Ah, I think I'm following along now. Are you talking about sending the numbers over a single 8-bit vector wire one-at-a-time? Hmmm. The vectors are actually independent from each other and refresh at various random rates, so a few usec here or there shouldn't make a difference. I'll give it a try! Don
On Aug 6, 11:43=A0am, Joseph Samson <u...@not.my.company> wrote:
> Are you talking about synthesis or place and route? It's not an error > (at least through 9.1i) to synthesize to more logic than is available in > a part. My current design synthesizes to 105% of device resources. > Mapping optimizes the design and gets rid of redundant or unused logic. > > It's also not unusual to have 99% of the slices occupied. The tools > prefer to spread the design over as many slices as possible. > > The map report will show how many resources are really used (LUTs, FF, > BRAM, Clocks, MULT.....).
Right, I meant when I process all the way to place & route, not just synthesis. It is funny that it is at 99%. In fact, it shows that all but two of the slices are used (!!). It does list usage as 76% related logic and 24% unrelated logic. I'm not sure how to remedy that. Don
On Aug 6, 10:36=A0am, eromlignod <eromlig...@aol.com> wrote:
<snip>
> Right, I meant when I process all the way to place & route, not just > synthesis. =A0It is funny that it is at 99%. =A0In fact, it shows that al=
l
> but two of the slices are used (!!). > > It does list usage as 76% related logic and 24% unrelated logic. =A0I'm > not sure how to remedy that. > > Don
The xilinx mapper will spread logic out with one LUT per slice until it fills to nearly 100% of the slices then it will backfill the 2nd LUT in each slice where conditions permit. There's usually a good stretch between 99% and 101%. If you explained better what you're trying to do (signal quantity, what the counts represent, frequencies of the signals and the system clock) you might get better suggestions on how to code things. Right now most of us are taking stabs in the dark.
eromlignod wrote:

> Hi guys: > > I'm prototyping an application using a Xilinx Spartan-3 development > board. I'm using this particular development kit because it is suited > to the large amount of I/O I need. > > I'm new to FPGA, so I have written the code in Verilog using almost > exclusively a high-level, behavioural style. The program works, but > synthesizes using 99% of the available slices. So if I try to change > or improve the code, it often synthesizes to over 100% and kicks out > an error. > > I need to condense what I've got to give me some space to work with. > > The application is basically a large number of high-speed pulse > inputs.
Define 'high speed', and what timebase reading rate ? What are the pulses coming from ? To some, microseconds is high speed, to others, femtoseconds is high speed.....
> I count them all independently and average several readings > over time for each to produce a 21-bit number.
If you mean several readings from the same channel, a longer count time will do that for free. Are you reading frequency? (fixed time readout of the counters)
> Each of these 21-bit > vectors (there are almost 100) is sent to a central processing module > that evaluates and compares them using simple arithmetic. Based on > these comparisons, another set of vectors is sent on to a couple of > modules that arrange them into a special synchronous serial output. > That's all it does.
What sort of comparison, and what decision rates are you talking ? Is that processing software, or hardware ? Do you need 21 bits of precision, or just 21 bits of dynamic range ? A quasi-log counter bus would drop the fan-out. (so a 13 bit MSB and a 3 bit exponent, would mux on 16 bit data paths - 76% of the mux logic right there. -jg
On Aug 6, 12:31=A0pm, eromlignod <eromlig...@aol.com> wrote:
> On Aug 6, 11:50=A0am, John McCaskill <jhmccask...@gmail.com> wrote: > > > If you can map this onto a block ram, you will save quite a bit of > > registers. Whether or not you can do this depends on if you can write > > the vectors in one (or a few) at a time, and process them sequentially > > in the time you have available. =A0How much time do you have to process > > the vectors? Ns, us, ms ? > > Ah, I think I'm following along now. =A0Are you talking about sending > the numbers over a single 8-bit vector wire one-at-a-time? =A0Hmmm. > > The vectors are actually independent from each other and refresh at > various random rates, so a few usec here or there shouldn't make a > difference. =A0I'll give it a try! > > Don
You are asking good questions, so there are multiple people here that will be happy to help you out. However, you are asking for some low level suggestions without giving enough high level detail. The best optimizations are the ones that you apply at the high level where you have the most leverage. If you can tell us more about what you are trying to do you will get better responses. You said that you have almost 100 high speed channels. How many channels are there? How fast are the pulses arriving on average? Over what time is the average? What is the air speed of an unladen swallow? What is the minimum spacing between pulses? How fast does your central processing module need to compare the channels? As Jim Granville pointed our, the various time bases of your problem have a major impact on the potential solutions. Regards, John McCaskill www.FasterTechnology.com
> I'm intrigued by your answer, but don't fully understand what you > propose.
You need to have a better understanding of what's generated by your code. Remember you're describing hardware.
> My last serial generating module has a big 256 vector input that it is > translating to a serial output that repeats the 256 bits over and > over. The code is basically something like this: > input [255:0] invector; > output serout; > reg [7:0] x; > always @(negedge shiftclock) > begin > x = x + 1; > serout = invector[x]; > end
That (probably) creates a 256 bit vector and a massive mux to select one of the bits. In VHDL the following generates a big shift register which the tools will find dead easy to place and route as each logical path is just from one register to the next..... if(rising_edge(clk)) then invector(254 downto 0) <= invector(255 downto 1); serout <= invector(0); end if; This should be easily translated to verilog. Nial.
On Aug 6, 10:21 am, John McCaskill <jhmccask...@gmail.com> wrote:
[snip]
> > Don > > Since you state that you run out of slices, I know that your design is > larger than the FPGA can hold, but I would still point out that the > slice utilization is a pessimistic view of how much of the FPGA you > are using, the mapping stage spreads the logic out by default instead > of packing it as tightly as possible. The Register and LUT > utilization is an optimistic measure of how much of the FPGA you have > left. You need to watch all of them to get a good idea of how full > your design really is.
[snip]
> Regards, > > John McCaskillwww.FasterTechnology.com
On a related note, I had a design with about 60% LUT and flip-flop utilization (Spartan 2) that would not fit until I checked "Disable Register Ordering" in the mapping options. If you're not exceeding the capacity of the device by too much you can look at tuning the tools a bit to help fit the design. Regards, Gabor
On Aug 6, 8:05=A0pm, John McCaskill <jhmccask...@gmail.com> wrote:
> On Aug 6, 12:31=A0pm, eromlignod <eromlig...@aol.com> wrote: > > > > > > > On Aug 6, 11:50=A0am, John McCaskill <jhmccask...@gmail.com> wrote: > > > > If you can map this onto a block ram, you will save quite a bit of > > > registers. Whether or not you can do this depends on if you can write > > > the vectors in one (or a few) at a time, and process them sequentiall=
y
> > > in the time you have available. =A0How much time do you have to proce=
ss
> > > the vectors? Ns, us, ms ? > > > Ah, I think I'm following along now. =A0Are you talking about sending > > the numbers over a single 8-bit vector wire one-at-a-time? =A0Hmmm. > > > The vectors are actually independent from each other and refresh at > > various random rates, so a few usec here or there shouldn't make a > > difference. =A0I'll give it a try! > > > Don > > You are asking good questions, so there are multiple people here that > will be happy to help you out. However, you are asking for some low > level suggestions without giving enough high level detail. =A0The best > optimizations are the ones that you apply at the high level where you > have the most leverage. > > If you can tell us more about what you are trying to do you will get > better responses. You said that you have almost 100 high speed > channels. > > How many channels are there? > How fast are the pulses arriving on average? > Over what time is the average? > What is the air speed of an unladen swallow? > What is the minimum spacing between pulses? > How fast does your central processing module need to compare the > channels? > > As Jim Granville pointed our, the various time bases of your problem > have a major impact on the potential solutions. > > Regards, > > John McCaskillwww.FasterTechnology.com- Hide quoted text - > > - Show quoted text -
Well, I'm being a little cryptic because there are new patent applications involved and I don't want to give too much away. I can tell you those things that are already covered in the first patent though. The device is a self-tuning piano. You can read/listen about it here... New York Times: http://query.nytimes.com/gst/fullpage.html?res=3D9800E1D8133FF931A35752C0A9= 659C8B63 NPR: http://www.npr.org/templates/story/story.php?storyId=3D878091 New Scientist Magazine: http://www.newscientist.com/article/dn3143-hotwired-piano-tunes-itself.html The incoming signals are square waves at the fundamental frequency of each of the 219 strings in the piano that are being magnetically sustained (vibrate forever). I only read 44 strings at a time, tune them, then go on to the next 44, etc. So there are 44 counting modules and an output address signal to instruct the sustainer circuits when to vibrate the next string. I first convert the wave to a "period" wave that has an "on" time equal to one period of the string's vibration. I then use this wave to enable counting of the 50-MHz system clock. So I get a count of how many clock ticks of the system clock occur for one period of string vibration. This takes up to 21 bits for the low strings. I average 32 of these numbers and calculate an error based on a stored setpoint. Currently I'm using a theoretical setpoint, but eventually I will want to add the feature whereby a piano tech can hand-tune the piano and then "store" his tuning numbers for subsequent use. The output of each of these modules is a 16-bit, signed "error" vector. All 44 errors go to an "evaluation" module where it is decided how much warmth needs to be applied to each string to tune it, in the form of 219 PWM duty cycles (0 to 256 each). These numbers are sent to another two modules and translated to a synchronous serial output where a separate power circuit decodes and uses them to produce the individual PWM control lines. Once the "in tune" value of PWM is found for each string, it is simply maintained until the system is switched off. Actual adjustments only occur when the system is first turned on. Currently I can tune each set of 44 strings in about 20 to 30 seconds. The whole system does indeed work just fine so far. I'm just running out of FPGA space, ostensibly because of my poorly-written code. I would like to add code to refine the tuning accuracy and time and possibly add the "store" option, so I need all the space I can get. Thanks for all of the excellent help so far! Don A. Gilmore Kansas City