FPGARelated.com
Forums

Downsizing Verilog synthesization.

Started by eromlignod August 6, 2008
Hi guys:

I'm prototyping an application using a Xilinx Spartan-3 development
board.  I'm using this particular development kit because it is suited
to the large amount of I/O I need.

I'm new to FPGA, so I have written the code in Verilog using almost
exclusively a high-level, behavioural style.  The program works, but
synthesizes using 99% of the available slices.  So if I try to change
or improve the code, it often synthesizes to over 100% and kicks out
an error.

I need to condense what I've got to give me some space to work with.

The application is basically a large number of high-speed pulse
inputs.  I count them all independently and average several readings
over time for each to produce a 21-bit number.  Each of these 21-bit
vectors (there are almost 100) is sent to a central processing module
that evaluates and compares them using simple arithmetic.  Based on
these comparisons, another set of vectors is sent on to a couple of
modules that arrange them into a special synchronous serial output.
That's all it does.

Are there any standard tips or general guidelines that you might offer
to condense my synthesis?  I have found, for example, that making the
vectors smaller doesn't really change the overall slice count, yet
commenting out a single line of the processing code can change it
drastically.

Any ideas or comments would be greatly appreciated.

Don
eromlignod wrote:

> The application is basically a large number of high-speed pulse > inputs. I count them all independently and average several readings > over time for each to produce a 21-bit number. Each of these 21-bit > vectors (there are almost 100) is sent to a central processing module > that evaluates and compares them using simple arithmetic. Based on > these comparisons, another set of vectors is sent on to a couple of > modules that arrange them into a special synchronous serial output.
Since the answer is shifted out in serial, maybe it could be constructed a bit at a time to save resources.
> Are there any standard tips or general guidelines that you might offer > to condense my synthesis?
A basic trade is time for gates. A serial crc is slower, but requires less resources than the parallel version, for example. -- Mike Treseler
eromlignod wrote:
<snip>
> > Are there any standard tips or general guidelines that you might offer > to condense my synthesis? I have found, for example, that making the > vectors smaller doesn't really change the overall slice count, yet > commenting out a single line of the processing code can change it > drastically. > > Any ideas or comments would be greatly appreciated. > > Don
Time multiplexing can often help significantly. If you have 100 21-bit counters, the 2100 registers and associated muxing can take a lot of space. If you're not running near the limit of the part, you could increase the clock rate and share some counters in distributed memory. If things are really slow, you can go to BlockRAMs, eliminating the redundancy and reducing read mux logic significantly. If you can't increase the processing frequency, you could still count the LSbits and cycle through the counters, adding and clearing the LSbits to a BlockRAM worth of counter values. To cycle through 255 32-bit counters, you'd need 8-bit counters for each signal and a read-add-write cycle (using the dual-port mode) for each entry in your list. You end up only using half the BlockRAM for this extreme number of counters. It's more housekeeping but you use a fraction of the count resources.
On Aug 6, 8:40=A0am, eromlignod <eromlig...@aol.com> wrote:
> Hi guys: > > I'm prototyping an application using a Xilinx Spartan-3 development > board. =A0I'm using this particular development kit because it is suited > to the large amount of I/O I need. > > I'm new to FPGA, so I have written the code in Verilog using almost > exclusively a high-level, behavioural style. =A0The program works, but > synthesizes using 99% of the available slices. =A0So if I try to change > or improve the code, it often synthesizes to over 100% and kicks out > an error. > > I need to condense what I've got to give me some space to work with. > > The application is basically a large number of high-speed pulse > inputs. =A0I count them all independently and average several readings > over time for each to produce a 21-bit number. =A0Each of these 21-bit > vectors (there are almost 100) is sent to a central processing module > that evaluates and compares them using simple arithmetic. =A0Based on > these comparisons, another set of vectors is sent on to a couple of > modules that arrange them into a special synchronous serial output. > That's all it does. > > Are there any standard tips or general guidelines that you might offer > to condense my synthesis? =A0I have found, for example, that making the > vectors smaller doesn't really change the overall slice count, yet > commenting out a single line of the processing code can change it > drastically. > > Any ideas or comments would be greatly appreciated. > > Don
Since you state that you run out of slices, I know that your design is larger than the FPGA can hold, but I would still point out that the slice utilization is a pessimistic view of how much of the FPGA you are using, the mapping stage spreads the logic out by default instead of packing it as tightly as possible. The Register and LUT utilization is an optimistic measure of how much of the FPGA you have left. You need to watch all of them to get a good idea of how full your design really is. You mention both a high speed pulse counting section that counts and averages over time, and then a processing section that sounds like it is slower. How much slower is it? If you can share resources over time in this section you could save resources. You can look in the reports to see how many adders, etc the tools inferred from your code. Your goal is to reduce that number to the minimum required to perform the comparisons. You have a range of options that depend on your constraints. At one end of the spectrum, just find any redundant calculations and rearrange your code to share those calculations. At the other end, you could use a soft processor such as a PicoBlaze to do the calculations in software. Regards, John McCaskill www.FasterTechnology.com
On Aug 6, 8:56=A0am, Mike Treseler <mtrese...@gmail.com> wrote:
> eromlignod wrote: > > The application is basically a large number of high-speed pulse > > inputs. =A0I count them all independently and average several readings > > over time for each to produce a 21-bit number. =A0Each of these 21-bit > > vectors (there are almost 100) is sent to a central processing module > > that evaluates and compares them using simple arithmetic. =A0Based on > > these comparisons, another set of vectors is sent on to a couple of > > modules that arrange them into a special synchronous serial output. > > Since the answer is shifted out in serial, > maybe it could be constructed a bit at a time > to save resources. > > > Are there any standard tips or general guidelines that you might offer > > to condense my synthesis? > > A basic trade is time for gates. > A serial crc is slower, but requires less resources > than the parallel version, for example. > > =A0 =A0 -- Mike Treseler
Mike: I'm intrigued by your answer, but don't fully understand what you propose. You say that I should construct my serial signal a bit at a time, but how else can I? My last serial generating module has a big 256 vector input that it is translating to a serial output that repeats the 256 bits over and over. The code is basically something like this: input [255:0] invector; output serout; reg [7:0] x; always @(negedge shiftclock) begin x =3D x + 1; serout =3D invector[x]; end I'll bet there's a better way. Don
On Aug 6, 9:21=A0am, John McCaskill <jhmccask...@gmail.com> wrote:
> On Aug 6, 8:40=A0am, eromlignod <eromlig...@aol.com> wrote: > > > > > > > Hi guys: > > > I'm prototyping an application using a Xilinx Spartan-3 development > > board. =A0I'm using this particular development kit because it is suite=
d
> > to the large amount of I/O I need. > > > I'm new to FPGA, so I have written the code in Verilog using almost > > exclusively a high-level, behavioural style. =A0The program works, but > > synthesizes using 99% of the available slices. =A0So if I try to change > > or improve the code, it often synthesizes to over 100% and kicks out > > an error. > > > I need to condense what I've got to give me some space to work with. > > > The application is basically a large number of high-speed pulse > > inputs. =A0I count them all independently and average several readings > > over time for each to produce a 21-bit number. =A0Each of these 21-bit > > vectors (there are almost 100) is sent to a central processing module > > that evaluates and compares them using simple arithmetic. =A0Based on > > these comparisons, another set of vectors is sent on to a couple of > > modules that arrange them into a special synchronous serial output. > > That's all it does. > > > Are there any standard tips or general guidelines that you might offer > > to condense my synthesis? =A0I have found, for example, that making the > > vectors smaller doesn't really change the overall slice count, yet > > commenting out a single line of the processing code can change it > > drastically. > > > Any ideas or comments would be greatly appreciated. > > > Don > > Since you state that you run out of slices, I know that your design is > larger than the FPGA can hold, but I would still point out that the > slice utilization is a pessimistic view of how much of the FPGA you > are using, the mapping stage spreads the logic out by default instead > of packing it as tightly as possible. =A0The Register and LUT > utilization is an optimistic measure of how much of the FPGA you have > left. =A0You need to watch all of them to get a good idea of how full > your design really is. > > You mention both a high speed pulse counting section that counts and > averages over time, and then a processing section that sounds like it > is slower. How much slower is it? =A0If you can share resources over > time in this section you could save resources. > > You can look in the reports to see how many adders, etc the tools > inferred from your code. =A0Your goal is to reduce that number to the > minimum required to perform the comparisons. =A0You have a range of > options that depend on your constraints. =A0At one end of the spectrum, > just find any redundant calculations and rearrange your code to share > those calculations. At the other end, you could use a soft processor > such as a PicoBlaze to do the calculations in software. > > Regards, > > John McCaskillwww.FasterTechnology.com- Hide quoted text - > > - Show quoted text -
What sorts of operations are the biggest gate-hogs? I have a lot of comparison "if" operations, counters, and non-blocking assignments to convert lots of inputs into usable arrays. The averagers each divide by 32 and I have another single divider toward the end that divides by 256. Other than that, I'm not doing anything very fancy. I have no multipliers (though I might like to add one), no "for" loops, etc. I do have a series of hard-coded standard values that I use for comparison. They are in the form of parameters that are fed to each of the input counter modules when they are instantiated in the top module. I suppose these could be EPROM memories, but I haven't figured out yet how to use the memory provided on the development board. Don
On Aug 6, 10:43 am, eromlignod <eromlig...@aol.com> wrote:
> On Aug 6, 9:21 am, John McCaskill <jhmccask...@gmail.com> wrote: > > > > > On Aug 6, 8:40 am, eromlignod <eromlig...@aol.com> wrote: > > > > Hi guys: > > > > I'm prototyping an application using a Xilinx Spartan-3 development > > > board. I'm using this particular development kit because it is suited > > > to the large amount of I/O I need. > > > > I'm new to FPGA, so I have written the code in Verilog using almost > > > exclusively a high-level, behavioural style. The program works, but > > > synthesizes using 99% of the available slices. So if I try to change > > > or improve the code, it often synthesizes to over 100% and kicks out > > > an error. > > > > I need to condense what I've got to give me some space to work with. > > > > The application is basically a large number of high-speed pulse > > > inputs. I count them all independently and average several readings > > > over time for each to produce a 21-bit number. Each of these 21-bit > > > vectors (there are almost 100) is sent to a central processing module > > > that evaluates and compares them using simple arithmetic. Based on > > > these comparisons, another set of vectors is sent on to a couple of > > > modules that arrange them into a special synchronous serial output. > > > That's all it does. > > > > Are there any standard tips or general guidelines that you might offer > > > to condense my synthesis? I have found, for example, that making the > > > vectors smaller doesn't really change the overall slice count, yet > > > commenting out a single line of the processing code can change it > > > drastically. > > > > Any ideas or comments would be greatly appreciated. > > > > Don > > > Since you state that you run out of slices, I know that your design is > > larger than the FPGA can hold, but I would still point out that the > > slice utilization is a pessimistic view of how much of the FPGA you > > are using, the mapping stage spreads the logic out by default instead > > of packing it as tightly as possible. The Register and LUT > > utilization is an optimistic measure of how much of the FPGA you have > > left. You need to watch all of them to get a good idea of how full > > your design really is. > > > You mention both a high speed pulse counting section that counts and > > averages over time, and then a processing section that sounds like it > > is slower. How much slower is it? If you can share resources over > > time in this section you could save resources. > > > You can look in the reports to see how many adders, etc the tools > > inferred from your code. Your goal is to reduce that number to the > > minimum required to perform the comparisons. You have a range of > > options that depend on your constraints. At one end of the spectrum, > > just find any redundant calculations and rearrange your code to share > > those calculations. At the other end, you could use a soft processor > > such as a PicoBlaze to do the calculations in software. > > > Regards, > > > John McCaskillwww.FasterTechnology.com-Hide quoted text - > > > - Show quoted text - > > What sorts of operations are the biggest gate-hogs? > > I have a lot of comparison "if" operations, counters, and non-blocking > assignments to convert lots of inputs into usable arrays. The > averagers each divide by 32 and I have another single divider toward > the end that divides by 256. Other than that, I'm not doing anything > very fancy. I have no multipliers (though I might like to add one), > no "for" loops, etc. > > I do have a series of hard-coded standard values that I use for > comparison. They are in the form of parameters that are fed to each > of the input counter modules when they are instantiated in the top > module. I suppose these could be EPROM memories, but I haven't > figured out yet how to use the memory provided on the development > board. > > Don
What tools are you using for synthesis? If ISE / XST (webpack or foundation from Xilinx) which version? Things like divide by power of two should take no resources whatever (i.e. shift operators are basically wires). However a synthesis tool may look at the division operator and think you need a divider, which will take a lot of logic. Also since you seem to be register-heavy, see where you can use serial shift registers or memory instead of loose flip-flops. In Spartan 3 you get 16 stages of serial shift register or 16 bits of distributed RAM from a single LUT site. Coding shift registers without a reset term allows the synthesizer to place them in these structures instead of flip-flops (which come one to a LUT site). Did you look at your map report or "design summary"? In the latest version of ISE the design summary can show you where your largest resource allocations come from. Regards, Gabor
On Aug 6, 10:49=A0am, Gabor <ga...@alacron.com> wrote:
> On Aug 6, 10:43 am, eromlignod <eromlig...@aol.com> wrote: > > > > > > > On Aug 6, 9:21 am, John McCaskill <jhmccask...@gmail.com> wrote: > > > > On Aug 6, 8:40 am, eromlignod <eromlig...@aol.com> wrote: > > > > > Hi guys: > > > > > I'm prototyping an application using a Xilinx Spartan-3 development > > > > board. =A0I'm using this particular development kit because it is s=
uited
> > > > to the large amount of I/O I need. > > > > > I'm new to FPGA, so I have written the code in Verilog using almost > > > > exclusively a high-level, behavioural style. =A0The program works, =
but
> > > > synthesizes using 99% of the available slices. =A0So if I try to ch=
ange
> > > > or improve the code, it often synthesizes to over 100% and kicks ou=
t
> > > > an error. > > > > > I need to condense what I've got to give me some space to work with=
.
> > > > > The application is basically a large number of high-speed pulse > > > > inputs. =A0I count them all independently and average several readi=
ngs
> > > > over time for each to produce a 21-bit number. =A0Each of these 21-=
bit
> > > > vectors (there are almost 100) is sent to a central processing modu=
le
> > > > that evaluates and compares them using simple arithmetic. =A0Based =
on
> > > > these comparisons, another set of vectors is sent on to a couple of > > > > modules that arrange them into a special synchronous serial output. > > > > That's all it does. > > > > > Are there any standard tips or general guidelines that you might of=
fer
> > > > to condense my synthesis? =A0I have found, for example, that making=
the
> > > > vectors smaller doesn't really change the overall slice count, yet > > > > commenting out a single line of the processing code can change it > > > > drastically. > > > > > Any ideas or comments would be greatly appreciated. > > > > > Don > > > > Since you state that you run out of slices, I know that your design i=
s
> > > larger than the FPGA can hold, but I would still point out that the > > > slice utilization is a pessimistic view of how much of the FPGA you > > > are using, the mapping stage spreads the logic out by default instead > > > of packing it as tightly as possible. =A0The Register and LUT > > > utilization is an optimistic measure of how much of the FPGA you have > > > left. =A0You need to watch all of them to get a good idea of how full > > > your design really is. > > > > You mention both a high speed pulse counting section that counts and > > > averages over time, and then a processing section that sounds like it > > > is slower. How much slower is it? =A0If you can share resources over > > > time in this section you could save resources. > > > > You can look in the reports to see how many adders, etc the tools > > > inferred from your code. =A0Your goal is to reduce that number to the > > > minimum required to perform the comparisons. =A0You have a range of > > > options that depend on your constraints. =A0At one end of the spectru=
m,
> > > just find any redundant calculations and rearrange your code to share > > > those calculations. At the other end, you could use a soft processor > > > such as a PicoBlaze to do the calculations in software. > > > > Regards, > > > > John McCaskillwww.FasterTechnology.com-Hidequoted text - > > > > - Show quoted text - > > > What sorts of operations are the biggest gate-hogs? > > > I have a lot of comparison "if" operations, counters, and non-blocking > > assignments to convert lots of inputs into usable arrays. =A0The > > averagers each divide by 32 and I have another single divider toward > > the end that divides by 256. =A0Other than that, I'm not doing anything > > very fancy. =A0I have no multipliers (though I might like to add one), > > no "for" loops, etc. > > > I do have a series of hard-coded standard values that I use for > > comparison. =A0They are in the form of parameters that are fed to each > > of the input counter modules when they are instantiated in the top > > module. =A0I suppose these could be EPROM memories, but I haven't > > figured out yet how to use the memory provided on the development > > board. > > > Don > > What tools are you using for synthesis? =A0If ISE / XST (webpack or > foundation from Xilinx) which version? > > Things like divide by power of two should take no resources whatever > (i.e. shift operators are basically wires). =A0However a synthesis tool > may look at the division operator and think you need a divider, which > will take a lot of logic. > > Also since you seem to be register-heavy, see where you can > use serial shift registers or memory instead of loose flip-flops. > In Spartan 3 you get 16 stages of serial shift =A0register or 16 > bits of distributed RAM from a single LUT site. =A0Coding shift > registers without a reset term allows the synthesizer to place > them in these structures instead of flip-flops (which come > one to a LUT site). > > Did you look at your map report or "design summary"? =A0In > the latest version of ISE the design summary can show you > where your largest resource allocations come from. > > Regards, > Gabor- Hide quoted text - > > - Show quoted text -
Interesting. Thanks Gabor! This may be very useful. I have a large number of 8- bit vectors in my design. I have about 220 of them passing from one module to another. They each begin as an "output reg [7:0]" in one module and are all assigned to an array in the other module like this. reg [7:0] array [219:0]; =2E.. y[0] <=3D array[0]; y[1] <=3D array[1]; y[2] <=3D array[3]; =2E..etc. Is this bad form? Don
On Aug 6, 11:24=A0am, eromlignod <eromlig...@aol.com> wrote:
> On Aug 6, 10:49=A0am, Gabor <ga...@alacron.com> wrote: > > > > > > > On Aug 6, 10:43 am, eromlignod <eromlig...@aol.com> wrote: > > > > On Aug 6, 9:21 am, John McCaskill <jhmccask...@gmail.com> wrote: > > > > > On Aug 6, 8:40 am, eromlignod <eromlig...@aol.com> wrote: > > > > > > Hi guys: > > > > > > I'm prototyping an application using a Xilinx Spartan-3 developme=
nt
> > > > > board. =A0I'm using this particular development kit because it is=
suited
> > > > > to the large amount of I/O I need. > > > > > > I'm new to FPGA, so I have written the code in Verilog using almo=
st
> > > > > exclusively a high-level, behavioural style. =A0The program works=
, but
> > > > > synthesizes using 99% of the available slices. =A0So if I try to =
change
> > > > > or improve the code, it often synthesizes to over 100% and kicks =
out
> > > > > an error. > > > > > > I need to condense what I've got to give me some space to work wi=
th.
> > > > > > The application is basically a large number of high-speed pulse > > > > > inputs. =A0I count them all independently and average several rea=
dings
> > > > > over time for each to produce a 21-bit number. =A0Each of these 2=
1-bit
> > > > > vectors (there are almost 100) is sent to a central processing mo=
dule
> > > > > that evaluates and compares them using simple arithmetic. =A0Base=
d on
> > > > > these comparisons, another set of vectors is sent on to a couple =
of
> > > > > modules that arrange them into a special synchronous serial outpu=
t.
> > > > > That's all it does. > > > > > > Are there any standard tips or general guidelines that you might =
offer
> > > > > to condense my synthesis? =A0I have found, for example, that maki=
ng the
> > > > > vectors smaller doesn't really change the overall slice count, ye=
t
> > > > > commenting out a single line of the processing code can change it > > > > > drastically. > > > > > > Any ideas or comments would be greatly appreciated. > > > > > > Don > > > > > Since you state that you run out of slices, I know that your design=
is
> > > > larger than the FPGA can hold, but I would still point out that the > > > > slice utilization is a pessimistic view of how much of the FPGA you > > > > are using, the mapping stage spreads the logic out by default inste=
ad
> > > > of packing it as tightly as possible. =A0The Register and LUT > > > > utilization is an optimistic measure of how much of the FPGA you ha=
ve
> > > > left. =A0You need to watch all of them to get a good idea of how fu=
ll
> > > > your design really is. > > > > > You mention both a high speed pulse counting section that counts an=
d
> > > > averages over time, and then a processing section that sounds like =
it
> > > > is slower. How much slower is it? =A0If you can share resources ove=
r
> > > > time in this section you could save resources. > > > > > You can look in the reports to see how many adders, etc the tools > > > > inferred from your code. =A0Your goal is to reduce that number to t=
he
> > > > minimum required to perform the comparisons. =A0You have a range of > > > > options that depend on your constraints. =A0At one end of the spect=
rum,
> > > > just find any redundant calculations and rearrange your code to sha=
re
> > > > those calculations. At the other end, you could use a soft processo=
r
> > > > such as a PicoBlaze to do the calculations in software. > > > > > Regards, > > > > > John McCaskillwww.FasterTechnology.com-Hidequotedtext - > > > > > - Show quoted text - > > > > What sorts of operations are the biggest gate-hogs? > > > > I have a lot of comparison "if" operations, counters, and non-blockin=
g
> > > assignments to convert lots of inputs into usable arrays. =A0The > > > averagers each divide by 32 and I have another single divider toward > > > the end that divides by 256. =A0Other than that, I'm not doing anythi=
ng
> > > very fancy. =A0I have no multipliers (though I might like to add one)=
,
> > > no "for" loops, etc. > > > > I do have a series of hard-coded standard values that I use for > > > comparison. =A0They are in the form of parameters that are fed to eac=
h
> > > of the input counter modules when they are instantiated in the top > > > module. =A0I suppose these could be EPROM memories, but I haven't > > > figured out yet how to use the memory provided on the development > > > board. > > > > Don > > > What tools are you using for synthesis? =A0If ISE / XST (webpack or > > foundation from Xilinx) which version? > > > Things like divide by power of two should take no resources whatever > > (i.e. shift operators are basically wires). =A0However a synthesis tool > > may look at the division operator and think you need a divider, which > > will take a lot of logic. > > > Also since you seem to be register-heavy, see where you can > > use serial shift registers or memory instead of loose flip-flops. > > In Spartan 3 you get 16 stages of serial shift =A0register or 16 > > bits of distributed RAM from a single LUT site. =A0Coding shift > > registers without a reset term allows the synthesizer to place > > them in these structures instead of flip-flops (which come > > one to a LUT site). > > > Did you look at your map report or "design summary"? =A0In > > the latest version of ISE the design summary can show you > > where your largest resource allocations come from. > > > Regards, > > Gabor- Hide quoted text - > > > - Show quoted text - > > Interesting. > > Thanks Gabor! =A0This may be very useful. =A0I have a large number of 8- > bit vectors in my design. =A0I have about 220 of them passing from one > module to another. =A0They each begin as an "output reg [7:0]" in one > module and are all assigned to an array in the other module like this. > > reg [7:0] array [219:0]; > ... > y[0] <=3D array[0]; > y[1] <=3D array[1]; > y[2] <=3D array[3]; > ...etc. > > Is this bad form? > > Don- Hide quoted text - > > - Show quoted text -
Oops. I meant for that code to be: input [7:0] y0; input [7:0] y1; =2E.. reg [7:0] array [219:0]; =2E.. array[0] <=3D y0; array[1] <=3D y1; =2E..etc. Don
eromlignod wrote:

> Mike: > > I'm intrigued by your answer, but don't fully understand what you > propose. You say that I should construct my serial signal a bit at a > time, but how else can I?
I meant to suggest arranging some sort of pipeline to work on the math while you are shifting the answer out. -- Mike Treseler