FPGARelated.com
Forums

Place-and-Route : Intel vs AMD

Started by Louis January 10, 2008
Anybody heard of a recent benchmark comparing both Intel and AMD high-
end processors regarding their Place-and-Route (PAR) performance? All
I can find is a 2005 intervention here stating that AMD exceeded
Intel, probably due to Intel's huge pipeline which is not suited for
non-homogeneous processing such as PAR:

http://www.fpga-faq.com/archives/89325.html#89336

With Intel apparently taking the lead for general purpose processing
with its 45nm technology, is such statement still true? I'm basically
looking for the best workstation honest money can buy to run Xilinx's
PAR tool. Any suggestions? Thanks.
Louis wrote:
> Anybody heard of a recent benchmark comparing both Intel and AMD high- > end processors regarding their Place-and-Route (PAR) performance? All > I can find is a 2005 intervention here stating that AMD exceeded > Intel, probably due to Intel's huge pipeline which is not suited for > non-homogeneous processing such as PAR: > > http://www.fpga-faq.com/archives/89325.html#89336 > > With Intel apparently taking the lead for general purpose processing > with its 45nm technology, is such statement still true? I'm basically > looking for the best workstation honest money can buy to run Xilinx's > PAR tool. Any suggestions? Thanks.
I think you'll find that Intel's Core 2 generation does a lot better than previous one, because they have better memory latency and shorter pipelines than the P4, which just plain sucked. I don't have a recent benchmark, but I can tell you what I've seen in the past (early 2006 timeframe): - multicore doesn't matter (unless you try to do other things while running.) Most current FPGA tools are still single-threaded. - cache size matters more than anything. Going from a 512K to a 1024K cache cut the synthesis time by two-thirds. Intel probably has an advantage here, because they have shared caches; remember to only count the cache available to a single core. - memory size and memory latency matters too. Get lots of fast RAM. - the OS will manage memory better if it's a 64 bit OS. Running on a 64-bit Linux seemed to run about 20-25% faster than 32-bit WinXP. -hpa
"Louis" <louis.dupont@gmail.com> wrote in message 
news:c04a5301-f9fc-4031-ade1-70f8050785ee@q39g2000hsf.googlegroups.com...
> Anybody heard of a recent benchmark comparing both Intel and AMD high- > end processors regarding their Place-and-Route (PAR) performance? All > I can find is a 2005 intervention here stating that AMD exceeded > Intel, probably due to Intel's huge pipeline which is not suited for > non-homogeneous processing such as PAR: > > http://www.fpga-faq.com/archives/89325.html#89336 > > With Intel apparently taking the lead for general purpose processing > with its 45nm technology, is such statement still true? I'm basically > looking for the best workstation honest money can buy to run Xilinx's > PAR tool. Any suggestions? Thanks.
I can only give one example : Altera Quartus 2 7.2SP1 EP2C50 design, about "70% full" Last year's rig : AMD Athlon 64 X2 4800+ System, total compile time 13 minutes This year's rig : Intel Core Quad 6600+ system, total compile time 10 minutes Both systems have 4GB DDR2 RAM at max speeds (CPU max speed for AMD, P35 northbridge max speed for Intel) Both systems using WD Raptor drives Both systems using /3GB switch in boot.ini The difference is almost entirely in the placement. Quartus has a multi-processor option, and reports an average of 1.5 processors out of a maximum of 2 for the AMD, and 1.7 out of a maximum of 4 for the Intel I wonder how the AMD Phenom quad core doo-dah would perform ? I am assuming it accesses main memory via a dedicated 128-bit port like the dual core one. I think the Intel goes via the northbridge, and uses "interleaved dual channel" (meaning what I don't know). Sounds like a better channel to main memory.
On 2008-01-11, Gary Pace <abc@xyz.com> wrote:
> > Altera Quartus 2 7.2SP1 > EP2C50 design, about "70% full" > > Last year's rig : > AMD Athlon 64 X2 4800+ System, total compile time 13 minutes > > This year's rig : > Intel Core Quad 6600+ system, total compile time 10 minutes
The Q6600 is multiplier locked, but at the bottom of the availabe front side bus speed. As a result, it's easy to overclock, and people have gotten 3GHz fairly easily. It'd be interesting to see how much effect that has on the build. If you want to go "legit" at those clock speeds there are Q6800 procs... -- Ben Jackson AD7GD <ben@ben.com> http://www.ben.com/
Gary Pace wrote:
> > I wonder how the AMD Phenom quad core doo-dah would perform ? I am assuming > it accesses main memory via a dedicated 128-bit port like the dual core one. > I think the Intel goes via the northbridge, and uses "interleaved dual > channel" (meaning what I don't know). Sounds like a better channel to main > memory. >
Anyone who uses a 128-bit path (dual channel) uses interleaving (running both in parallel) as long as you have the same amount of memory on each port. -hpa
"H. Peter Anvin" <hpa@zytor.com> writes:
> Anyone who uses a 128-bit path (dual channel) uses interleaving > (running both in parallel) as long as you have the same amount of > memory on each port.
With the possible exception of the Socket 1207 parts, for which documentation is not available, the AMD processors don't have dual memory channels, despite widespread claims. They have a single channel that can operate in either 64-bit or 128-bit width (plus optional ECC). Using 128-bit width has obvious benefits, but interleave is not one of them.
On Jan 10, 3:22 pm, "H. Peter Anvin" <h...@zytor.com> wrote:
> - multicore doesn't matter (unless you try to do other things while > running.) Most current FPGA tools are still single-threaded. > - cache size matters more than anything. Going from a 512K to a 1024K > cache cut the synthesis time by two-thirds. Intel probably has > an advantage here, because they have shared caches; remember to only > count the cache available to a single core. > - memory size and memory latency matters too. Get lots of fast RAM. > - the OS will manage memory better if it's a 64 bit OS. Running on a > 64-bit Linux seemed to run about 20-25% faster than 32-bit WinXP.
I mostly agree. I ran my own scaling experiments before settling on my current setup (using Windows Quartus II 7.X and my mix of designs). I don't have the numbers handy, but for me: - number of cores made very little difference, - memory bandwidth, latency, and capacity made *no* measurable difference as long as I had enough (2 GiB+), - L2 cache size was significant, and finally - core frequency mattered most. Basically, given a large enough L2 (4+ MiB), performance scaled linearly with clock frequency (Core 2 Duo). The only AMD part I had to compare with was pretty old (XP 3200+ / 2.0 GHz) and it didn't perform well (d'oh). I went for a conservative over-clocked (~ 3.1 GHz) 4 MiB L2 Core2 Duo. P&R is one of the few problems left where we still don't have enough single-thread performance and where it is fully justifiable to spend more than half your budget on the CPU alone. (Music to Intel's ears. They just need to get the world hooked on FPGAs :-). Tommy
Eric Smith wrote:
> With the possible exception of the Socket 1207 parts, for which documentation > is not available, the AMD processors don't have dual memory channels, > despite widespread claims. They have a single channel that can operate in > either 64-bit or 128-bit width (plus optional ECC). Using 128-bit width has > obvious benefits, but interleave is not one of them.
Well, fixed 128-bit width is pretty much the same thing as dual 64-bit with a fixed interleaving ratio. Specifically, interleaving at 8-byte boundaries. -hpa
I wrote:
> With the possible exception of the Socket 1207 parts, for which documentation > is not available, the AMD processors don't have dual memory channels, > despite widespread claims. They have a single channel that can operate in > either 64-bit or 128-bit width (plus optional ECC). Using 128-bit width has > obvious benefits, but interleave is not one of them.
"H. Peter Anvin" wrote:
> Well, fixed 128-bit width is pretty much the same thing as dual 64-bit > with a fixed interleaving ratio. Specifically, interleaving at 8-byte > boundaries.
Historically, having two banks of interleaved memory meant that if bank 0 was busy reading or writing an even word address, bank 1 could start an access to ANY odd word address, not just n+1. It is my understanding that that was the reason for inventing interleave rather than simply making the memory word longer. HPA suggested to me in private email that that sort of interleave didn't seem useful when the cache is tranferring whole cache lines, to which I replied:
> It just means that you'd want the interleave size to be the cache line > size. There's no technical reason that a CPU with two memory "channels" > shouldn't do that. But as I mentioned previously, the AMD processors > don't actually have two channels, just one wide channel.
> For a single core (single-threaded) doing that might not produce much > benefit, but it certainly could be useful with multiple cores or > threads.
> On a multi-socket Opteron system you effectively get that if you > configure the memory controllers in the multiple chips for interleave > rather than linear addressing, but you have the additional hypertransport > latency for accessing memory attached to another socket's memory > controller.
Eric Smith wrote:

(snip)

> Historically, having two banks of interleaved memory meant that if bank 0 > was busy reading or writing an even word address, bank 1 could start an > access to ANY odd word address, not just n+1. It is my understanding > that that was the reason for inventing interleave rather than simply > making the memory word longer.
I first learned about interleaving reading about the IBM 360/91, which has 16 way interleaved 750ns memory, and a 60ns processor cycle time. I believe it is 64 bits wide. The design goal was one instruction per clock cycle, which tended to require one 64 bit doubleword per cycle (for 64 bit floating point operations). That was before cache (on the 360/85). -- glen