comp.arch.fpga | a clueless bloke tells Xilinx to get a move on| page 2

Reply by Ray Andraka ●October 5, 20062006-10-05

Brannon wrote:

>>I have to wonder whether the writer of this letter looked at his own
>>design for the reasons PAR was taking too long.  Did he keep the levels
>>of logic to a reasonable number for his desired timing target?  Did he
>>duplicate logic to reduce high fanout nets?  Did he try any
>>floorplanning for critical parts of the design?  Somehow I doubt it, yet
>>those things can make a several orders of magnitude difference in the
>>time to run PAR.
> 
> 
> My logic and fannouts are fine. I confess, though, that I have never
> done floorplanning. I wouldn't even know where to start with it. I
> don't even know what level floorplanning is done at. I rarely use XST;
> I use my own EDIF generation tools. The tools I use tile out vast
> amounts of logic recursively; attaching location constraints to them is
> quite difficult (their names change each compile, the flattened logic
> does not always look the same, etc.). My impression was that
> floorplanning required constraints at some level and that it is
> difficult using XST as well. Is that not true? I don't doubt that my
> top-level tool choice is hindering my ability to take full advantage of
> the Xilinx tools. How much time would it take to do a floorplan on a
> four million gate project? At what stage during the development process
> do you do it? I'm trying to increase my development time, not my
> retargetability time.
> 

The time spent doing floorplanning depends on how hierarchical your 
design is and whether you take advantage of the hierarchy while doing 
your floorplanning.  In your case, it sounds like you are building a 
design out of nearly identical (?) tiles, each comprised of xilinx 
primitives.  If that is the case, yours is an ideal candidate for 
floorplanning, as it is fairly easy to add the placement constraints to 
the edif netlist at the time of compilation.  Most of my floorplanning 
is done in the source where I floorplan the elements at that 
hierarchical level.  When I get to the top level design, the 
floorplanning amounts to placing a relatively small number of pre-placed 
tiles.

The levels of logic and fanouts comment is relative to your projected 
speed.  The more slack you have there, the faster the tool will 
complete.  If the timing is tight, then it takes the tool much longer. 
Also, the automaticplacer is notoriously bad at a few things: 1) 
placement of multiplier and BRAM blocks, 2) placement of data width 
elements (ie it doesn't line up the bits horizontally in a chain of 
adders), and 3) placing LUTs that don't have registers between 
them...i.e. if you have two levels of logic between flip-flops, the 
second lut (the closest to the next flip-flop) is placed well, but the 
LUT feeding a LUT is often placed far away.  This absolutely kills 
timing, and it takes the tool a very long time to anneal the LUTs back 
to a reasonable location.  Keeping logic to one level between flip-flops 
or hand placing the LUTs solves that.  Unfortunately, the LUT names and 
logic distribution over a LUT tree is the stuff that changes the most 
from synthesis run to synthesis run, so floorplanning LUT locations in 
multi-level logic is probably the hardest part of floorplanning.

Reply by Ray Andraka ●October 5, 20062006-10-05

Brannon wrote:

> When I was using VCS to simulate complicated stuff before, it took
> several hours per run. I agree that the output was infinitely more
> useful. However, have you seen the prices on such EDIF tools? You can
> run a lot of 10 minute comiples before you pay for a $10k piece of
> software. And that's a cheap one.
> 
> Symon wrote:

  You might look at the Aldec simulator.  The version I use does mixed 
language simulation including edif netlists.  I don't know the current 
price, but I believe it is still well under $10K.

Reply by Kolja Sulimma ●October 6, 20062006-10-06

Brannon wrote:
> 2.	Use a different algorithm. I understand that the tools currently
> rely on simulated Annealing algorithms for placement and routing. This
> is probably a fine method historically, but we are arriving at the
> point where all paths are constrained and the paths are complex (not
> just vast in number). If there is no value in approximation, then the
> algorithm loses its value.  Perhaps it is time to consider a Branch and
> Bound algorithm instead. This has the advantage of being easily
> threadable.

Threading will gain you a factor of 4 at most in a quad core processor.
When doing placement algorithm development you are talking about orders of magnitude!
Branch and bound would probably be the slowest possible way to do placement.
I am talking billions of years here ;-)

Simulated annealing has the advantage that you can stop at any time just reducing the
quality of the result. There was a post here recently that showed how Xilinx tool run
time depends on the timing constraint set. The difference was larger than the factor of 4.

Anyway, you are right that simulated annealing is old school, but I am sure that ISE
start with some kind of constructive placer (quadratic or recursive bipartitioning)
and do not let the annealing do all the work. Just the refinement probably.

Actually Xilinx placement times are not bad compared to other tools.

Kolja Sulimma

(writing this while waiting for my computer to finish the benchmarks for my constant delay
asic placement PhD thesis)

Reply by Bob Perlman ●October 6, 20062006-10-06

On 5 Oct 2006 22:56:10 +0200, "Symon" <symon_brewer@hotmail.com>
wrote:

>Hi Brannon,
>So, I guess we'd all like the tools to run faster, and you make some good 
>suggestions.
>However, I wonder how often you _need_ to do a PAR cycle? Please excuse me 
>if I'm teaching you to suck eggs, but I just want to check you've considered 
>a development process where you simulate things before PAR. This way, your 
>logic errors are found in the simulator, not the real hardware. If you like 
>to try stuff out as you go, maybe you could run the PAR each evening before 
>heading out to the pub, that's what I sometimes do. :-)

Excellent suggestion, but I find I usually need a drink *after* the
PAR run.

Bob Perlman
Cambrian Design Works
http://www.cambriandesign.com

Reply by ●October 6, 20062006-10-06

PeteS wrote:
> The crashes are, no doubt, because of the increasing complexity of each
> part of the process required to be evaluated by the tools. The *nix way
> was always 'do one thing and do it well' which used to exemplify the
> Xilinx tools. As they have got more complex, they have added things to
> each tool, such that they are now doing more than one thing. Adding such
> complexity adds exponential sources of problems.
>
> I suggest each tool be completely re-evaluated - and if it's doing more
> than one thing, separate those things back out - to 'Do one thing and do
> it well'.

When a vendor doesn't have time to do it right, there is always the
shared Open Source development model where BOTH the vendor and the
customers work to make the tools right, with a shared interest and
investment.

Reply by Rajeev ●October 9, 20062006-10-09

Folks,

Brannon wrote:
> The following is an informal letter to Xilinx requesting their
> continued efforts to increase the speed of their software tools. If
> there are incorrect or missing statements, please correct me!

I thought this was a great post.  To my eye, only one of the posts till
now (Kolja, I believe) has disagreed with any of Brannon's points
regarding speedup.  Several posts have made constructive suggestions
for speedup with the existing tools.  Nevertheless the speedup
suggestions in this thread (including anonymous: incremental compile)
have merit independent of whether Brannon does floorplanning.  Don't
need to be a race car driver to point out potholes in the road.

> Dear Xilinx:
>
> As many of us spend numerous hours of our life waiting for
> Map/Par/Bitgen to finish, I hereby petition Xilinx, Inc., to consider
> this issue (of their tool speed) to be of the highest priority. I am
> now scared to purchase newer chips because I fear that their increased
> size and complexity will only further delay my company's development
> times. Please, please, please invest the time and money to make the
> tools execute faster.
>
> Have you considered the following ideas for speeding up the process?

I'll be looking forward to seeing even just a basic yes/no response
from Xilinx to this question !

> 1.	The largest benefit to speed would be obtained through making the
> tools multithreaded. Upcoming multi-core processors will soon be
> available on all workstation systems. What is it that is causing Xilinx
> years on end to make their tools multithreaded? There is no excuse for
> this. I assume the tools are written in C/C++. Cross platform C/C++
> threading libraries make thread management and synchronization easy
> (see boost.org).
> 2.	Use a different algorithm. I understand that the tools currently
> rely on simulated Annealing algorithms for placement and routing. This
> is probably a fine method historically, but we are arriving at the
> point where all paths are constrained and the paths are complex (not
> just vast in number). If there is no value in approximation, then the
> algorithm loses its value.  Perhaps it is time to consider a Branch and
> Bound algorithm instead. This has the advantage of being easily
> threadable.
> 3.	SIMD instructions are available on most modern processors. Are we
> taking full advantage of them? MMX, SSE1/2/3/4, etc.
> 4.	Modern compilers have much improved memory management and
> compilation over those of previous years. Also, the underlying
> libraries for memory management and file IO can have a huge impact on
> speed. Which compiler are you using? Which libraries are you using?
> Have you tried the latest PathScale or Intel compilers?

With regards to these two suggestions, it would be interesting to know
what is the effective instruction throughput the code is achieving.
Which is to say, examine the key time-consuming portion of algorithm,
add up the actual adds and multiplies etc, measure how long it takes.
If the performance is around 100 MHz on a 3 GHz machine, well there's
probably room for improvement.

> 5.	In recent discussions about the speed of the map tool, I learned
> that it took an unearthly five minutes to simply load and parse a 40MB
> binary file on what is considered a fairly fast machine. It is
> obviously doing a number of sanity checks on the file that are likely
> unnecessary. It is also loading the chip description files at the same
> time. Even still, that seems slow to me. Can we expand the file format
> to include information about its own integrity? Can we increase the
> file caches? Are we using good, modern parser technology? Can we add
> command line parameters that will cause higher speed at the cost of
> more memory usage and visa-versa? Speaking of command line parameters,
> the software takes almost three seconds to show them. Why does it take
> that long to simply initialize?
> 6.	Xilinx's chips are supposedly useful for acceleration. If so, make
> a PCIe x4 board that accelerates the tools using some S3 chips and
> SRAM. I'd pay $1000 for a board that gave a 5x improvement. (okay, so
> that is way less than decent simulation tools, I confess I'm not
> willing to pay big dolla....)

This could be a great opportunity for Nallatech and Xilinx to work out
together !

> 7.	Is Xilinx making its money on software or hardware? If it is not
> making money on software, then consider making it open source. More
> eyes on the code mean more speed.
> 
> Sincerely,
> An HDL peon

My two cents,

-rajeev-

Reply by Austin Lesea ●October 9, 20062006-10-09

Rajeev,

Xilinx takes seriously any suggestions.

We, of all people, with the introduction of the Virtex 5 LX330, know
that we need to somehow make everything work better, and faster.

Note that due to the memory required, the LX330 can ONLY be complied on
a 64 bit Linux machine.... there are just too many logic cells, and too
much routing.  8 Gbytes is about what you need, and windoze can't handle
it (at all).

Regardless of the 'open source' debate, or any other postings, the FPGA
software is much, much larger, and more complex than anyone seems to
comprehend (one of the greatest barriers to entry for anyone thinking
about competition).  As such, I am hopeful that some of the musing here
will get folks thinking, but realisitically, I am doubtful as the people
responsible are daily trying to make things faster, and better as part
of their job.

Austin

Reply by Andreas Ehliar ●October 9, 20062006-10-09

On 2006-10-09, Austin Lesea <austin@xilinx.com> wrote:
> Note that due to the memory required, the LX330 can ONLY be complied on
> a 64 bit Linux machine.... there are just too many logic cells, and too
> much routing.  8 Gbytes is about what you need, and windoze can't handle
> it (at all).

Wouldn't it be possible to do something like this:

* Divide the design into 4 roughly equal sized parts.
* Divide the FPGA into 4 parts
* Map each part of the design into one part of the FPGA.
(For bonus points, run these 4 parts in parallel on different computers.)

These three steps should be doable with (optimistically) 2GB memory.

Clock routing, I/O pins and connections between the different parts of the
FPGA are not included in the above...

Performing the final routing between different parts of the FPGA should
be doable by only considering the rows and columns closest to the boundaries
of the different parts. This way the memory consumption is reduced as well.
For best results it would be necessary to floorplan these connections
as well.

I guess the partition support in ISE 8.2 can already (theoretically) do
some of this, the only problem would be to make sure that for example
map/par only considers a part of the FPGA when implementing the design
instead of loading the ruleset for the entire FPGA.

On the other hand, memory is quite cheap as compared to an LX330 I guess :)

/Andreas

Reply by Tim ●October 9, 20062006-10-09

Andreas Ehliar wrote

> On the other hand, memory is quite cheap as compared to an LX330 I guess 
> :)

How many LX330s make one Ferrari?

Tim

Reply by mk ●October 9, 20062006-10-09

On Mon, 09 Oct 2006 08:55:58 -0700, Austin Lesea <austin@xilinx.com>
wrote:

>Rajeev,
>
>Xilinx takes seriously any suggestions.
>
>We, of all people, with the introduction of the Virtex 5 LX330, know
>that we need to somehow make everything work better, and faster.
>
>Note that due to the memory required, the LX330 can ONLY be complied on
>a 64 bit Linux machine.... there are just too many logic cells, and too
>much routing.  8 Gbytes is about what you need, and windoze can't handle
>it (at all).

Austin, you are mistaken. There has been a 64 bit version of windows
for more than a year now and a new updated version is going to be
released before the end of the year (Vista is in RC2 stage now).
Looking forward to a 64 bit version of ISE on Vista 64 which I am
running now.

Previous 123 Next

a clueless bloke tells Xilinx to get a move on

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group