FPGARelated.com
Forums

Achieving required speed in Virtex-II Pro FPGA

Started by Unknown March 30, 2005
Hi, ALL!

Several months ago I did schematic based design, implementing median
image filtering in Altera EP1K30TC144-2. It was running something close
to 150MHz without any explicit constraints, except target clock
frequency. At that time I did not need that much speed, because my
device was providing me data only 60MHz or below.

Now we are busy with another device, capable to run at 150MHz and we
have XC2VP4 speed grade 6 Xilinx FPGA as a data processing unit within
the device. I rewrote design using VHDL language. During verification,
RTL schematic of synthesized VHDL code was looking exact like schematic
for Altera ACEX-1k device. The only issue was speed. VHDL reincarnation
of median filter was running only 134+MHz. Flor-plan editor was showing
half of the chip polluted with registers and multiplexers of the
design. I tried to set some constraints on VHDL code to reduce area
where this block located. I spent about 6 hours playing with various
placement/timing/routing attributes and constraints but failed to get
any better.

So, is there any guide about constraints strategy? I read the guide
about constraints, but there are too many choices. I managed to remove
couple setup errors by explicit placing combinatorial logic and
registers in adjacent slices, but it would be horrible idea to do
manual chip routing :(

With best regards,
Vladimir S. Mirgorodsky

v_mirgorodsky@yahoo.com wrote:

> Several months ago I did schematic based design, implementing median > image filtering in Altera EP1K30TC144-2. It was running something close > to 150MHz without any explicit constraints, except target clock > frequency.
That is quite often all you need.
> Now we are busy with another device, capable to run at 150MHz and we > have XC2VP4 speed grade 6 Xilinx FPGA as a data processing unit within > the device. I rewrote design using VHDL language. During verification, > RTL schematic of synthesized VHDL code was looking exact like schematic > for Altera ACEX-1k device. The only issue was speed. VHDL reincarnation > of median filter was running only 134+MHz.
You have changed not just design entry, but the device and the function. There are lots of ways to drop fmax from "something close to 150MHz" to 134+ MHz. Constraints are for fine tuning. You could evaluate other synthesis tools, but I expect you need to make some design tradeoffs. Maybe a faster part, or use up extra resources for wider datapaths. -- Mike Treseler
v_mirgorodsky@yahoo.com wrote:
[...]
> Now we are busy with another device, capable to run at 150MHz and we > have XC2VP4 speed grade 6 Xilinx FPGA as a data processing unit
within
> the device. I rewrote design using VHDL language. During
verification,
> RTL schematic of synthesized VHDL code was looking exact like
schematic
> for Altera ACEX-1k device. The only issue was speed. VHDL
reincarnation
> of median filter was running only 134+MHz.
Howdy Vladimir, I'd investigate the timing analyzer output and see what is holding the design back. Do you have 5 failing paths or 100? Are there large fanouts involved on some of the failing paths? Are there too many levels of logic on some of the paths, and if so, can a pipeline stage be moved forward or back to help break up the levels of logic? Honestly, a -6 speed grade V2Pro should pretty easily meet 150+ MHz if fanout and levels of logic are kept under control.
> So, is there any guide about constraints strategy? I read the guide > about constraints, but there are too many choices. I managed to
remove
> couple setup errors by explicit placing combinatorial logic and > registers in adjacent slices, but it would be horrible idea to do > manual chip routing :(
I agree, and in this case, I suspect it would be unnecessary. What synthesis tool are you using? What clock speed are you telling the tools? You might try a slightly faster target speed to see if it helps you come much closer to meeting your period constraint, in addition to lowering fanout limits and investigating the number of levels of logic. Good luck, Marc
Dear Marc,

Marc Randolph wrote:
> v_mirgorodsky@yahoo.com wrote: > [...] > > I'd investigate the timing analyzer output and see what is > holding the design back. Do you have 5 failing paths or > 100? Are there large fanouts involved on some of the > failing paths? Are there too many levels of logic on some > of the paths, and if so, can a pipeline stage be moved >forward or
back to help break up the levels of
> logic?
I have only 10 timing errors. I have a 10-bit wide data bus inside of the filter, delayed on SRL16 elements to save some triggers, the output of SRL16 goes directly to two comparators and two mux'es, driven by -ge and -lt output bits and I have two bits within this bus, violating timing requirements. Doing manual placing I managed to cure one bit, but got error in another. Timing report says I have about 5-7 logic levels on failing logic paths. The fan-out for erroneous bits is 3-4 average, or at least tool reports that. Actually, I can add another pipeline stage between 2-to-1 mux'es and comparator outputs, but this will bring another 30+ triggers into design, which is not good. In ACEX-1K such optimization brought the speed up to 200MHz; actually, it was there from the beginning, but we did not need that fast solution and trade-off speed for area. It did run in slower Altera chip, what should I do to get the same result out of considerably faster Xilinx chip?
> Honestly, a -6 speed grade V2Pro should pretty easily meet > 150+ MHz if fanout and levels of logic are kept under >control.
This is the only hope :)
> > So, is there any guide about constraints strategy? I read the guide > > about constraints, but there are too many choices. I managed to
remove
> > couple setup errors by explicit placing combinatorial logic and > > registers in adjacent slices, but it would be horrible idea to do > > manual chip routing :( > > I agree, and in this case, I suspect it would be unnecessary. What > synthesis tool are you using? What clock speed are you telling the > tools? You might try a slightly faster target speed to see if it
helps
> you come much closer to meeting your period constraint, in addition
to
> lowering fanout limits and investigating the number of levels of
logic. I tried to use slightly faster clock constraints. Instead of 150MHz I asked the tool to PAR my design to meet something 166+MHz. The result was exactly the same. 134+MHz is some sort of hard border, which is almost never crossed :( I am using ISE 6.3 SP1 for all synthesis, routing and placement operations. With best regards, Vladimir S. Mirgorodsky
Dear Mike Treseler,

It WAS running fast enough in slower Altera chip, so it SHOULD run the
same fmax or better in faster Xilinx chip, right?

Regards,
Vladimir S. Mirgorodsky

<v_mirgorodsky@yahoo.com> schrieb im Newsbeitrag
news:1112252395.759518.274210@z14g2000cwz.googlegroups.com...
> Dear Mike Treseler, > > It WAS running fast enough in slower Altera chip, so it SHOULD run the > same fmax or better in faster Xilinx chip, right? > > Regards, > Vladimir S. Mirgorodsky >
wrong! dont ever assume anything like that. the faster fmax most likely can be achived on the faster xilinx part but in generic if one designs has some fmax on some device than retargetting to the new FPGA architecture may require some adjustment to achive the comparable performance. The way synthesis tools map the design to the FPGA are way different. Antti
Dear Antti Lukats,

I am just curious, how to optimize VHDL code to use with Xilinx versus
Altera? Yes, I know, some elements may be created more efficiently in
Xilinx chips, anothers - in Altera chips. You may target your design to
use one or another element, but generic triggers, multiplexers and
adders are not optimizable for certain FPGA architecture within VHDL
language without using black box primitives.

My concern about Xilinx tools is that they are not giving comparable
performance versus Altera tools with default settings.

With best regards,
Vladimir S. MIrgorodsky

v_mirgorodsky@yahoo.com wrote:
> Dear Marc, > > I have only 10 timing errors. I have a 10-bit wide data bus inside of > the filter, delayed on SRL16 elements to save some triggers, the
output
> of SRL16 goes directly to two comparators and two mux'es, driven by > -ge and -lt output bits and I have two bits within this bus,
violating
> timing requirements. Doing manual placing I managed to cure one bit, > but got error in another. Timing report says I have about 5-7 logic > levels on failing logic paths.
7 is quite a few, but probably not impossible. Unfortunately, that many levels of logic, combined with almost any fanout, gives the tools a chance to make very poor placement choices - as you've seen.
> The fan-out for erroneous bits is 3-4 average, or at least tool
reports that. Does the fanout come directly from the LUT that is used for the SRL, or did the tools do the right thing and use a FF? It may or may not be obvious from the timing report - you might have to use FPGA editor to check. Regardless, you might consider making your SRL one bit shorter and forcing there to be another FF after the SRL. You might even go so far as to fanout the output of the SRL to two or more FF's, and have THOSE feed the rest of your logic. You may need a keep properity on the FF's to keep them from being optimized out.
> Actually, I can add another > pipeline stage between 2-to-1 mux'es and comparator outputs, but this > will bring another 30+ triggers into design, which is not good.
What are triggers? Do you mean FF's? They are basicly free in most FPGA designs, and are vital to high speed designs.
> In > ACEX-1K such optimization brought the speed up to 200MHz; actually,
it
> was there from the beginning, but we did not need that fast solution > and trade-off speed for area. It did run in slower Altera chip, what > should I do to get the same result out of considerably faster Xilinx > chip?
I'm not sure what you're asking. For the same design, the V2Pro part is running considerably faster than the old ACEX part, is it not?
> I tried to use slightly faster clock constraints. Instead of 150MHz I > asked the tool to PAR my design to meet something 166+MHz. The result > was exactly the same. 134+MHz is some sort of hard border, which is > almost never crossed :( I am using ISE 6.3 SP1 for all synthesis, > routing and placement operations.
If possible, try to get ahold of Synplify from Synplicity for your synthesis. They will often do eval's so that your purchasing department doesn't have to get involved until AFTER you see the (hopefully better) results. Good luck, Marc

v_mirgorodsky@yahoo.com wrote:

>Hi, ALL! > >Several months ago I did schematic based design, implementing median >image filtering in Altera EP1K30TC144-2. It was running something close >to 150MHz without any explicit constraints, except target clock >frequency. At that time I did not need that much speed, because my >device was providing me data only 60MHz or below. > >Now we are busy with another device, capable to run at 150MHz and we >have XC2VP4 speed grade 6 Xilinx FPGA as a data processing unit within >the device. I rewrote design using VHDL language. During verification, >RTL schematic of synthesized VHDL code was looking exact like schematic >for Altera ACEX-1k device. The only issue was speed. VHDL reincarnation >of median filter was running only 134+MHz. >
Xilinx has dedicated carry logic that makes anything with carries (adder, magnitude comparator) much faster. I know Spartan much better than Virtex, so what I know may not apply. But, what I'm wondering is if something you are doing in the schematic is excluding the use of the dedicated carry function. In one design I had to laboriously copy the way a Xilinx library macro used the carry components when I made up a slightly different macro. IIRC it also was a magnitude comparator, but I needed a greater than or equal to function. All I can say is the thing works, but I don't actually understand these carry components, I was just copying the thing pretty blindly. But, the macro synthesizes to a much smaller and faster instance on the Spartan chip when it uses these carry blocks. Jon
Hi Vladimir,
                   In the Xilinx Software the default settings are
optimized
for Software Run Time - not for effort level. Are you using a PC or
Unix? I am more familiar with PC - on your PC if you right click on
the "synthesize -XST" and select properties you can choose effort
level and various other options.

SRL16's are an extremely efficient way to use 1 slice as a 16-deep
register. If you want to force the use of an actual flipflop in order to
meet timing - you can place a reset on the last flipflop. Flipflops
with resets go into regular flipflops - flipflop chains with no resets
will be able to take advantage of the SRL16 feature which could
save you a lot of area.

- Vic

v_mirgorodsky@yahoo.com wrote:

> Dear Antti Lukats, > > I am just curious, how to optimize VHDL code to use with Xilinx versus > Altera? Yes, I know, some elements may be created more efficiently in > Xilinx chips, anothers - in Altera chips. You may target your design to > use one or another element, but generic triggers, multiplexers and > adders are not optimizable for certain FPGA architecture within VHDL > language without using black box primitives. > > My concern about Xilinx tools is that they are not giving comparable > performance versus Altera tools with default settings. > > With best regards, > Vladimir S. MIrgorodsky