FPGARelated.com
Forums

CPU benchmark for Xilinx PAR

Started by Paul Gentieu September 13, 2005
Here's a benchmark for PAR (high effort level) running on two different CPUs. The design utilized about 40% of an XC2V4000-5 and had some difficult-to-meet timing constraints. PAR's peak memory usage was ~500 MB.

Intel Pentium D 830 (3.0 GHz), 2 GB RAM: Total CPU time to PAR completion: 2 hours 32 mins

AMD Athlon 64 4000+ (2.4 GHz), 2 GB RAM: Total CPU time to PAR completion: 1 hour 2 mins

I was blown away by the result. I was expecting a modest speed increase with the AMD- maybe 1.3x, if you go by the model number- but certainly not 2.5x. Based on this benchmark, the AMD CPU should actually be called a 7500+. :)

The Pentium is a dual core and the AMD is a single, but the Xilinx software utilizes only one core so this is a fair comparison of raw processor speed.

The Pentium probably gets killed by its deep pipelines. I'd guess that PAR, like most real-world apps, consists mainly of spaghetti code rather than regular loops processing masses of similar data. So the Pentium spends a lot of its time flushing pipelines because of mispredicted branches and such. It probably suffers from its higher memory access latency as well.

It sure would be nice if Xilinx could made their software multithreaded... then an Athlon X2 4800+ would really scream. As it is, I'd guess that an Athlon FX-57 (2.8 GHz) will give the fastest PAR performance currently possible.

-Paul
Generally we use Athlon64 based machines and are very impressed. We have not 
done a comparision recently but we have found on previous benchmarking is 
that some parts of the process are better done on different processors. So 
if you have a really big design you may want to split the work better 2 
machines with one being Intel based the other AMD. I'm sure that some you 
"geek" script writers could figure a script to automate this.

It would be interesting to try a single core Pentium Extreme against the 
FX-57.

John Adair
Enterpoint Ltd. - Home of Broaddown2. The Ultimate Spartan3 Development 
Board.
http://www.enterpoint.co.uk


"Paul Gentieu" <pg8192@yahoo.com> wrote in message 
news:ee900d0.-1@webx.sUN8CHnE...
> Here's a benchmark for PAR (high effort level) running on two different > CPUs. The design utilized about 40% of an XC2V4000-5 and had some > difficult-to-meet timing constraints. PAR's peak memory usage was ~500 MB. > > Intel Pentium D 830 (3.0 GHz), 2 GB RAM: Total CPU time to PAR completion: > 2 hours 32 mins > > AMD Athlon 64 4000+ (2.4 GHz), 2 GB RAM: Total CPU time to PAR completion: > 1 hour 2 mins > > I was blown away by the result. I was expecting a modest speed increase > with the AMD- maybe 1.3x, if you go by the model number- but certainly not > 2.5x. Based on this benchmark, the AMD CPU should actually be called a > 7500+. :) > > The Pentium is a dual core and the AMD is a single, but the Xilinx > software utilizes only one core so this is a fair comparison of raw > processor speed. > > The Pentium probably gets killed by its deep pipelines. I'd guess that > PAR, like most real-world apps, consists mainly of spaghetti code rather > than regular loops processing masses of similar data. So the Pentium > spends a lot of its time flushing pipelines because of mispredicted > branches and such. It probably suffers from its higher memory access > latency as well. > > It sure would be nice if Xilinx could made their software multithreaded... > then an Athlon X2 4800+ would really scream. As it is, I'd guess that an > Athlon FX-57 (2.8 GHz) will give the fastest PAR performance currently > possible. > > -Paul
Very interesting

I really doubt its the branch behaviour even though the Athlon series
has always been good on office type twisty apps. For branchy code
segments that fit in the I cache, these days the branches almost come
for free and guess right more often than not.

I'd hazard a guess it has more to do with the data set being very large
and missing the L1, L2 and TLBs way too often, "poor locality of
reference" , even 1% misses, maybe less maybe enough to wreak havoc.

It not difficult to create a simple data structure that holds millions
of items in a hash table and see even an Athlon xp2400 give up 300ns
avg accesses to each entry if all accesses appear random.rather than
the naive 1ns its L1 cache can actually do.

You can plot a graph of open random address width from 6bits to 24bits
and watch execution time go from 1n to 4ns and then roughly stepping
30ns 100ns  300ns for x[i] when i is coming from any old  random no
generator and masked by width field. Measured on an xp2400.

If this simple test were run on various cpus, we could see how the
caching really works for graduating locality disaster cases and choose
accordingly.

Now EDA software doesn't deliberately do this, but might get some of
the same effect unintended simply by having to walk immense graphs and
trees. Think about it, draw a graph with millions of nodes and try to
label in such a way that it can be traversed with mostly low address
bit changes (high locality) when the nodes in the graph are allocated
completely in random fashion. Then think, how many operations actually
get performed on each link list traversal, a lot of the time it might
be just passing through looking for something, the worst possible
situation, all fetch no work.

I don't imagine there is much EDA code that looks like beautiful DSP
media codec stuff with super straight line high locality SSE tuned
code.

I could be all wrong, but I thinks it the Memory Wall effect and the
Opteron maybe does a better job of recovering. That also means a cpu
that concentrates on that aspect desn't even need a clock advantage as
long as it tolerates poor locality better.

I wonder if its possible to get stats from the cpu performance hardware
that shows what the cpu is really doing in memory, bit out of my
league.

I wonder if the EDA guys just crank out code or do they ever measure
algorithms on different x86 hardware at the cache level, curious?

I also wonder how much FPU is actually used and how so?.

On a threaded cpu designed to work with threaded memory where there is
little memory wall (latency tolerence all around), it doesn't take much
hardware to design a processor element in FPGA that can match Athlon
xp300, and 10 or so ganged together can then match xp3000 but you get
40 odd threads to fill instead of waiting on cache misses. Me, I'd
rather fill the threads (occam style) than wait, but most are not of
that opinion (yet).

Now if EDA ever becomes highly concurrent, (some have done this in VLSI
EDA from simulation to P/R) it does make possible some real speed ups
when real threading becomes pervasive in cpus (not this 2,4 thread
nonsence).

johnjakson at usa dot ...
transputer2 at yahoo dot ...

On Tue, 13 Sep 2005 00:00:43 -0700, Paul Gentieu wrote:

> Here's a benchmark for PAR (high effort level) running on two different CPUs. The design utilized about 40% of an XC2V4000-5 and had some difficult-to-meet timing constraints. PAR's peak memory usage was ~500 MB. > > Intel Pentium D 830 (3.0 GHz), 2 GB RAM: Total CPU time to PAR completion: 2 hours 32 mins > > AMD Athlon 64 4000+ (2.4 GHz), 2 GB RAM: Total CPU time to PAR completion: 1 hour 2 mins > > I was blown away by the result. I was expecting a modest speed increase with the AMD- maybe 1.3x, if you go by the model number- but certainly not 2.5x. Based on this benchmark, the AMD CPU should actually be called a 7500+. :) > > The Pentium is a dual core and the AMD is a single, but the Xilinx software utilizes only one core so this is a fair comparison of raw processor speed. > > The Pentium probably gets killed by its deep pipelines. I'd guess that PAR, like most real-world apps, consists mainly of spaghetti code rather than regular loops processing masses of similar data. So the Pentium spends a lot of its time flushing pipelines because of mispredicted branches and such. It probably suffers from its higher memory access latency as well. > > It sure would be nice if Xilinx could made their software multithreaded... then an Athlon X2 4800+ would really scream. As it is, I'd guess that an Athlon FX-57 (2.8 GHz) will give the fastest PAR performance currently possible. > > -Paul
That's consistent with what I've seen. Note the 4000+ has a 1M cache which is critical for the performance of EDA codes. For NCVerilog I've found that when recordvars is off there is a 2 to 1 difference between an A64 with a 1M cache vs one with a 1/2M cache. I now have a 4400+ in addition to the 3400+ and the 3800+ shown on this page, http://www.polybus.com/linux_hardware/index.htm I haven't updated my benchmark page with the the 4400+ results but they are consistent with the other results. The 4400+ is about 10% faster then the 3400+ on single threaded jobs like NC or Xilinx place and results which is exactly what you would given that each core in the 4400+ runs at the same clock speed and has the same cache size (1M) as the 3400+ but it has dual memory channels vs a single channel on the 3400+.
Paul,

You are not the first to be amazed by this result.
I can only add that I was not able to persuade my management to give me Dual 
AMD 64 due to some unfixable bug (in the management), so I have only P4 :(

I am sure that Xilinx software is always being developed && improved to 
match any future stuff
Vladislav

"Paul Gentieu" <pg8192@yahoo.com> wrote in message 
news:ee900d0.-1@webx.sUN8CHnE...
> Here's a benchmark for PAR (high effort level) running on two different > CPUs. The design utilized about 40% of an XC2V4000-5 and had some > difficult-to-meet timing constraints. PAR's peak memory usage was ~500 MB. > > Intel Pentium D 830 (3.0 GHz), 2 GB RAM: Total CPU time to PAR completion: > 2 hours 32 mins > > AMD Athlon 64 4000+ (2.4 GHz), 2 GB RAM: Total CPU time to PAR completion: > 1 hour 2 mins > > I was blown away by the result. I was expecting a modest speed increase > with the AMD- maybe 1.3x, if you go by the model number- but certainly not > 2.5x. Based on this benchmark, the AMD CPU should actually be called a > 7500+. :) > > The Pentium is a dual core and the AMD is a single, but the Xilinx > software utilizes only one core so this is a fair comparison of raw > processor speed. > > The Pentium probably gets killed by its deep pipelines. I'd guess that > PAR, like most real-world apps, consists mainly of spaghetti code rather > than regular loops processing masses of similar data. So the Pentium > spends a lot of its time flushing pipelines because of mispredicted > branches and such. It probably suffers from its higher memory access > latency as well. > > It sure would be nice if Xilinx could made their software multithreaded... > then an Athlon X2 4800+ would really scream. As it is, I'd guess that an > Athlon FX-57 (2.8 GHz) will give the fastest PAR performance currently > possible. > > -Paul
 
> It sure would be nice if Xilinx could made their software multithreaded... then an Athlon X2 4800+ would really scream. As it is, I'd guess that an Athlon FX-57 (2.8 GHz) will give the fastest PAR performance currently possible. > > -Paul
PAR is multithreaded, use the -m switch.
> PAR is multithreaded, use the -m switch.
The -m does not work on Windows, according to the documentation. This is silly because they should be using cross-platform code anyway. A decent Windows pthread library utilizing termination drivers is not that expensive. I fully agree that they should be using SSE, SSE2, SSE3, 3dNow, etc., along with utilizing Intel and AMD's math/DSP libraries. Even if they have to ship different EXEs for each processor it would totally be worth it.
"B. Joshua Rosen" <bjrosen@PleaseDontSpamMEpolybus.com> wrote in message
news:pan.2005.09.14.14.28.25.727255@PleaseDontSpamMEpolybus.com...
> > > It sure would be nice if Xilinx could made their software
multithreaded... then an Athlon X2 4800+ would really scream. As it is, I'd guess that an Athlon FX-57 (2.8 GHz) will give the fastest PAR performance currently possible.
> > > > -Paul > > PAR is multithreaded, use the -m switch.
When I used the -m switch a while back on our unix system, I was able to specify a node list for different hosts to run the multipass place & route one more than one machine but I couldn't utilize multiple cores in one host. I also can't use more than one host (or core) for one long place & route job; the -m is specifically for multipass place & route (which, by the way, doesn't have the option to use multiple mapper seeds!).

John_H wrote:

> > > PAR is multithreaded, use the -m switch. > > When I used the -m switch a while back on our unix system, I was able to > specify a node list for different hosts to run the multipass place & route > one more than one machine but I couldn't utilize multiple cores in one host. > I also can't use more than one host (or core) for one long place & route > job; the -m is specifically for multipass place & route (which, by the way, > doesn't have the option to use multiple mapper seeds!).
This PAR feature is called the Turns Engine and it was never designed to support multiple jobs on a single machine. You can get around this by using a hostname alias in the node list file, or by tricking PAR by using variations of Mixed case in the node list file. For example, using a four processor machine named "speedy", and the following node list file would allow four concurrent jobs to run: speedy Speedy SPeedy SPEedy Xilinx Answer Record 10511 covers this. Regards, Bret
> That's consistent with what I've seen. Note the 4000+ has a 1M cache which > is critical for the performance of EDA codes. For NCVerilog I've found > that when recordvars is off there is a 2 to 1 difference between an A64 > with a 1M cache vs one with a 1/2M cache. I now have a 4400+ in addition > to the 3400+ and the 3800+ shown on this page, > > http://www.polybus.com/linux_hardware/index.htm
Interesting! Your experience seems consistent with relatively small, RTL (behavioral) designs. For SOC/ASIC designs of any reasonable size, I've found the difference between 0.5MB and 1MB cache to be non-existent (because the working data-set already exceeds larger cache.) I think one problem is the difficulty in producing publishable benchmarks. In the case of NC-Verilog and even Verilog-XL, I've found benchmarks almost useless. It's easy to find a 'test case' which runs 30-40% faster on a puny Pentium3/S 1.26GHz (512K L2 cache) than on a Ultrasparc III 750MHz (Linux 32-bit vs Solaris 32-bit.) And likewise, it's just as easy to find a 2nd RTL test-case where the USIII literally crushes the Pentium3 (2X as fast.) For SDF backannotated gate-level simulations, the results even out, with the US3 marginally faster than the Pentium3. And the US3 has an 8MB CPU cache (don't remember whether it's L2 or L3), so it looks like the pace of design-database 'bloat' already outpaces CPU cache-size improvements. In the case of Design-Compiler and Primetime, there seems to be less variation in runtimes (for a given design compared on 2 different platforms.) The almost all of the customer Verilog RTL designs I've crunched through DC, Primetime, and Tetramax, the wimpy Pentium3 1.25GHz outperforms the Ultrasparc III 750MHz. I'd suspect Xilinx's PAR shares a performance profile similar to Design Compiler. Incidentally, we've found the Athlon64 gets a +20-30% performance boost from 64-bit linux versions of EDA-tools (vs 32-bit linux.) This was our conclusion after re-running quite a few Verilog simulation and synthesis- jobs. It's curious to see the Intel EM64T CPUs take a small performance hit in the same 64-bit linux apps! Aside from the small increase in RAM footprint, the 64-bit EDA tools on Athlon/64 always comes out ahead.
> I haven't updated my benchmark page with the the 4400+ results but they > are consistent with the other results. The 4400+ is about 10% faster then > the 3400+ on single threaded jobs like NC or Xilinx place and results > which is exactly what you would given that each core in the 4400+ runs at > the same clock speed and has the same cache size (1M) as the 3400+ but it > has dual memory channels vs a single channel on the 3400+.
Looks like I'll need to rethink my upgrade plans. Originally I planned on 'cheapskating' on a system-upgrade (AMD dual-core X2 3800+, 2.0GHz, 512K), but it seems FPGA-tools benefit quite abit from the extra 512K cache.