FPGARelated.com
Forums

Need to speed up Stratix compiles.

Started by Pete Fraser February 27, 2004
Paul Leventis (at home) wrote:
>>Seems to be a common misconception that 64bits just increases the amount >>of addressable memory. More importantly for most applications is that twice >>the data is moved or operated on per clock cycle. > > 64-bitness _is_ mostly about addressable memory -- it is rare that 64-bit > integers help reduce run-time. Please see my previous postings on the topic > and some of the replies to it: > > http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=Paul+Leventis+64-bit
I think the OP was refering to the wider datapaths. I don't know the cycle level details of the AMD or Intel 64 bit but an obvious and simple speed gain can come from a wider HW fetch. (even running < 64 bit opcodes ) and then a simple check if the next opcode / next data value is in that block. This works in systems where the CPU must wait for slower downstream memories, and even the smaller single chip microcontrollers are starting to do this. eg Philips ARM uC has 128 bit FETCH. Clearly, random code or data will not be helped, but a large % of code will be sequential. I'm not sure the AMD/Intel offerings hit the SIMD (Single instruction/multiple data ) of other cores, but even without that, some HW gains would be expected. -jg
Hi Jim,

> I think the OP was refering to the wider datapaths. > I don't know the cycle level details of the AMD or Intel 64 bit > but an obvious and simple speed gain can come from a wider HW fetch. > (even running < 64 bit opcodes ) and then a simple check if the next > opcode / next data value is in that block.
Yes, wider memory interfaces/cache data lines can help, but as you say, this is independent of op-code size. If I recall correctly, AMD and Intel processors already fetch 64-bit blocks, but this may have been increased. The latest m/b chipsets for both families of processors use dual-channel DDR (128-bits wide) and so I would not be surprised if they've increased the size of fetches. As vendors introduce 64-bit capable processors (such as Opteron), they often also enhance various aspects of the CPU architecture in ways that help both 32- and 64-bit code. And while the 64-bitness of x86-64 may not matter much for speed, the doubling of the register files etc. could result in faster performance. It's every computer engineers dream to be a processor architect, isn't it? :-) Regards, - Paul
On Tue, 2 Mar 2004 19:57:58 -0600, Kenneth Land wrote:

>Seems to be a common misconception that 64bits just increases the amount of >addressable memory.
The only common misconception is that swapping for a 64-bit processor in a desktop PC will lead to a large performance increase. It doesn't. (Other than any gain from a higher clock speed, of course.) Like to make a guess as to the extra overhead in a 64-bit version of current OSs, btw?
>More importantly for most applications is that twice >the data is moved or operated on per clock cycle.
Data is only data if it's meaningful. The use of 64-bit arithmetic variables is comparatively rare in most applications. Certain scientific and CAD packages do make heavy use of 64-bit floats, but I doubt that's the case here (and high-end processors tend to use 80-bit data paths around the FPU anyway). There's not a lot to be gained from accessing memory in 64-bit chunks if you're only interested in 32 of them (there is an effect on cache hits with vectors, but it's not measurably worthwhile in practice). There will be some effect on prefetch, but it depends on the state of the L1 and L2 caches and the instruction pipeline(s) themselves. Tests I've seen suggest an increase of memory bandwidth efficiency of only around 1-2% at best. If you want a 64-bitter to really earn it's corn, use it in something like a database server with 64GB of RAM and a multi-TB disk farm. Give the poor thing something *meaningful* to do with the extra 32 bits. You'd still need 64-bit software though. -- Max
On Tue, 2 Mar 2004 20:05:37 -0600, Kenneth Land wrote:

> >On the disk speed issue I have one data point. I upgraded my 1GHz PIII-M >laptop drive from a slow 4200 RPM to the fastest 7200 RPM available (for >laptops) and my Nios system build went from about 16 min. to about 15 min. >Not worth the pain and expense of swapping the drive.
Not in a low-spec machine like that, no. The options in a laptop are limited, and there's no way to increase the disk controller bandwidth. But the effect on a powerful workstation of installing a RAID with a high-bandwidth controller and drives such as U-320 SCSI can have a dramatic impact. As always though, it depends on the application.
>On memory, I upgraded the memory in my 3.2 GHz P4 from 512 to 1GB and there >was no noticable difference until I set the memory from 333MHz to 400MHz >dual channel. Then my system build went from 5 min. to 4 min. - 20%.
That doesn't mean a lot. You only need to add more memory if you're running out of it ;o) -- Max
Max <mtj2@btopenworld.com> writes:

> If you want a 64-bitter to really earn it's corn, use it in something > like a database server with 64GB of RAM and a multi-TB disk farm. Give
Or running synthesis, place & route, static timing analysis etc. on an ASIC design requiring 6GB RAM. Petter -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?
"Paul Leventis (at home)" wrote:
> > Hi Jim, > > > I think the OP was refering to the wider datapaths. > > I don't know the cycle level details of the AMD or Intel 64 bit > > but an obvious and simple speed gain can come from a wider HW fetch. > > (even running < 64 bit opcodes ) and then a simple check if the next > > opcode / next data value is in that block. > > Yes, wider memory interfaces/cache data lines can help, but as you say, this > is independent of op-code size. If I recall correctly, AMD and Intel > processors already fetch 64-bit blocks, but this may have been increased. > The latest m/b chipsets for both families of processors use dual-channel DDR > (128-bits wide) and so I would not be surprised if they've increased the > size of fetches. > > As vendors introduce 64-bit capable processors (such as Opteron), they often > also enhance various aspects of the CPU architecture in ways that help both > 32- and 64-bit code. And while the 64-bitness of x86-64 may not matter much > for speed, the doubling of the register files etc. could result in faster > performance. > > It's every computer engineers dream to be a processor architect, isn't it? > :-)
We can all speculate about the relative merits of processor enhancements, but these machines are very complex and the only real way to tell what helps is to try it. Since we are not all ancient Greeks philosophizing in our armchairs, it would be a good idea to pick a design and to run it on a few different workstations, hopefully including an AMD64. I have always been surprised that the FPGA vendors don't put some effort into evaluating platforms and releasing the results. I know this can be a bit of a can of worms, but every time I look at buying a new machine, the first question I research is how fast it will run the FPGA design software. Then I am often trying to speculate on my own since I don't have much info to go on. I seem to recall that there at least used to be some available info on how much memory was needed to optimize run time as a function of part size. But I haven't seen new info on that in quite a while. -- Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX
"rickman" <spamgoeshere4@yahoo.com> wrote in message
news:404603D6.AA20F818@yahoo.com...

> We can all speculate about the relative merits of processor > enhancements, but these machines are very complex and the only real way > to tell what helps is to try it. Since we are not all ancient Greeks > philosophizing in our armchairs, it would be a good idea to pick a > design and to run it on a few different workstations, hopefully > including an AMD64. > > I have always been surprised that the FPGA vendors don't put some effort > into evaluating platforms and releasing the results.
I had assumed that had happened already. Silly me. Perhaps we'll just buy an AMD machine and see what it does, but I thought somebody might have tried that already. Anybody know how solid the Quartus II 4.0 Linux port is? I can't get an answer out of Altera.
On Wed, 03 Mar 2004 04:47:19 GMT, Paul Leventis (at home) wrote:

>Provided the peak memory consumption of Quartus for the compilation in >question is less than the amount of physical memory in the system, >increasing the amount of memory will not help compile time. For non-trivial >designs, a Quartus compile will be most heavily influenced by CPU speed, and >then by memory sub-system speed -- disk speed will have little influence.
I suspected that might be the case, but I wasn't quite sure. I'm more used to programming language tools that use library files extensively, where a fast disk system (or a big ramdisk) can give very worthwhile speed gains. Is there any possibility of making Quartus multi-threaded? That strikes me as the most likely way to get a dramatic performance increase, though I know it's not always easy to achieve with heuristic apps.
>CAD tools process a lot of data. I don't know if a Xeon (bigger cache) is >much faster than a normal P4 (smaller cache), but I wouldn't be surprised if >this were the case for the same reason that a Xeon processor is supposedly >better for server applications -- bigger cache helps applications whose data >set doesn't fit into the cache.
While the extra cache is important in itself, much of the performance gain of the Xeon is also due to the greater degree of parallelism and deeper prefetch lookahead, thus making better use of memory bandwidth throughout. -- Max
Max <mtj2@btopenworld.com> writes:

> Is there any possibility of making Quartus multi-threaded? That > strikes me as the most likely way to get a dramatic performance > increase, though I know it's not always easy to achieve with heuristic > apps.
I would like to get see synthesis and place and route tools I could run on a cluster of cheap PC's. I would be happy with less than linear speedups, e.g. using a 16-node cluster to get a 8x speedup. Petter -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?
On 03 Mar 2004 18:46:34 +0100, Petter Gustad wrote:

>I would like to get see synthesis and place and route tools I could >run on a cluster of cheap PC's. I would be happy with less than linear >speedups, e.g. using a 16-node cluster to get a 8x speedup.
I doubt you'd get anywhere near. Trying to implement those algorithms efficiently on the sort of loosely-coupled architecture you propose would be nigh-on impossible. It's not easy on a single SMP box, but it's doable. A quad Xeon (8 x CPU) box would cost less than four single decent-spec machines anyway. -- Max