OK, the questions apply primarily to FPGA synthesis (Altera Quartus fitter for StratixII and HardCopyII), but I'm interested in feedback regarding all EDA tools in general. Context: I'm suffering some long Quartus runtimes on their biggest StratixII and second-biggest HardCopyII device. Boss has given me permission to order a new desktop/workstation/server. Immediate goal is to speed up Quartus, but other long-term value considerations will be taken into account. True or false? -------------------- Logic synthesis (analyze/elaborate/map) is mostly integer operations? Place and Route (quartus_map) is mostly double-precision floating- point? Static Timing Analysis (TimeQuest) is mostly double-precision floating- point? RTL simulation is mostly integer operations? SDF / gate-level simulation is mostly double-precision floating-point? AMD or Intel? ------------------- Between AMD & Intel's latest multicore CPUs, - Which offers the best integer performance? - Which offers the best floating-point performance? Specific models within the AMD/Intel family? Assume cost is no object, and each uses its highest-performing memory interface, but disk access is (necessary evil) over a networked drive. (Small % of total runtime anyway.) Multi-core, multi-processor, or both? 32-bit or 64-bit? Linux vs. Windows? >2GB of RAM? --------------------------------------------------------------------------------------------------------------------------------- Is Quartus (and the others) more efficient in any one particular environment? I prefer Linux, but the OS is now secondary to pure runtime performance (unless it is a major contributor). Can any of them make use of more than 2GB or RAM? More than 4GB? Useful limit on the number of processors/cores? Any specific box recommendations? Thanks a gig, jj
Best CPU platform(s) for FPGA synthesis
Started by ●July 26, 2007
Reply by ●July 26, 20072007-07-26
On Jul 26, 6:19 pm, jjohn...@cs.ucf.edu wrote:> > True or false? > -------------------- > Logic synthesis (analyze/elaborate/map) is mostly integer operations?Yes.> Place and Route (quartus_map) is mostly double-precision floating- > point?I don't know why they would use floating point if they don't have to.> Static Timing Analysis (TimeQuest) is mostly double-precision floating- > point?I seriously doubt it. I don't see a need for floating point there when delays can use scaled integers.> RTL simulation is mostly integer operations?Yes.> SDF / gate-level simulation is mostly double-precision floating-point?No, or at least not in any implementation I am familiar with. All the delays are scaled up so that integers can be used for them. In simulation (assuming something with state-of-the art performance), the CPU operations themselves are not very important anyway. It is not compute-bound, it is memory-access-bound. What you need is big caches and fast access to memory for when the cache isn't big enough.> Is Quartus (and the others) more efficient in any one particular > environment? I prefer Linux, but the OS is now secondary to pure > runtime performance (unless it is a major contributor). Can any of > them make use of more than 2GB or RAM? More than 4GB?64-bit Linux can make use of more than 4GB of RAM. But don't use 64- bit executables unless your design is too big for 32-bit tools, because they will run slower on the same machine.> Useful limit on > the number of processors/cores?Most of these tools are not multi-threaded, so the only way you will get a speedup is if you have multiple jobs at the same time. Event- driven simulation in particular is not amenable to multi-threading, despite much wishful thinking for the last few decades.
Reply by ●July 27, 20072007-07-27
> > Static Timing Analysis (TimeQuest) is mostly double-precision floating- > > point? > > I seriously doubt it. I don't see a need for floating point there > when delays can use scaled integers.Dynamic range? Cheers, Jon
Reply by ●July 27, 20072007-07-27
I think that memory performance is the limiting factor for FPGA synthesis and P&R. This machine had a single core AMD 64 processor which I recently replaced with a slightly faster dual core processor. I ran a fairly quick FPGA build through Quartus to get a time for a before and after comparison before I did the swap. The before and after times were exactly the same :-( I think the amount and speed of memory is crucial, it's probably worth paying as much attention to that as to the processor. Nial.
Reply by ●July 27, 20072007-07-27
Nial Stewart wrote:> I ran a fairly quick FPGA build through Quartus to get a time for a > before and after comparison before I did the swap.Did you changed the setting "use up to x number of CPUs" (don't remember the exact name) somewhere in the project settings? -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
Reply by ●July 27, 20072007-07-27
On Jul 26, 6:19 pm, jjohn...@cs.ucf.edu wrote:> AMD or Intel? > ------------------- > Between AMD & Intel's latest multicore CPUs, > - Which offers the best integer performance? > - Which offers the best floating-point performance? > Specific models within the AMD/Intel family? > > Assume cost is no object, and each uses its highest-performing memory > interface, but disk access is (necessary evil) over a networked drive. > (Small % of total runtime anyway.) > > Multi-core, multi-processor, or both? 32-bit or 64-bit? Linux vs. > Windows? >2GB of RAM?If cost is no object, then go with the Intel quad-core running at 3 GHz : QX6850. Each core has 2 MB of L2 cache (8MB total), which is, according to several reports in this forum, the single most important factor. I would say go with 4GB of ram, although if you're using the biggest chips, you might need more. Keep in mind that Windows 32-bit will only be able to use 3GB max of this 4 GB, and each application will only be able to access 2GB max. So you might consider Windows 64 bits or Linux 64 bits if necessary. Patrick
Reply by ●July 27, 20072007-07-27
Jon Beniston <jon@beniston.com> writes:>> > Static Timing Analysis (TimeQuest) is mostly double-precision floating- >> > point? >> >> I seriously doubt it. I don't see a need for floating point there >> when delays can use scaled integers. > > Dynamic range?Not a likely problem. Even a 32bit int would be big enough for holding up to a ridiculous 4.3 seconds, assuming 1psec resolution. As far as I know, everything in the simulate, synth, P&R, and STA chain can be performed with adequate resolution using integers. Crosstalk and inductive effects might require floating point help, but I would be surprised if even that can be approximated well with fixed-point arithmetic. Kai -- Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>
Reply by ●July 27, 20072007-07-27
sharp@cadence.com writes:> 64-bit Linux can make use of more than 4GB of RAM. But don't use 64- > bit executables unless your design is too big for 32-bit tools, > because they will run slower on the same machine.Although that might be true for some specific cases, in general on Linux native 64-bit executables tend to run faster than 32-bit executables. But I haven't benchmarked 32-bit vs. 64-bit FPGA tools.
Reply by ●July 27, 20072007-07-27
Thanks everyone, this is real interesting, but please don't stop posting if you have more insights to share! FWIW, my runtimes in Quartus are dominated by P&R (quartus_fit); on Linux, they run about 20% faster on my 2005-era 64-bit Opteron than on my 2004-era 32-bit Xeon (both with a 32-bit build of Quartus). Another test run of a double-precision DSP simulation (compiled C) ran substantially slower on the Opteron, which I thought was supposed to have better floating-point performance than Xeons of that era. Maybe it was just a case of the gcc -O5 optimization switches being totally tuned to Intel instead of AMD, or maybe my Quartus P&R step is primarily dominated by integer calculations. I originally suspected P&R might have a lot of floating-point calculations (even prior to signal-integrity considerations) if they were doing any kind of physical synthesis (e.g., delay calculation based on distance and fanout); ditto for STA, because that's usually an integral part of the P&R loops. I also suspected that if floating- point operations (at least multiplies, add/subtract, and MACs) could be done in a single cycle, there would be no advantage to using integer arithmetic instead (especially if manual, or somewhat explicit integer scaling is required). On the other hand, in something like a router, you can get more exact location info wrt stuff like grid coordinates than you can with floating-point. As far as dynamic range is concerned, I seem to recall that SystemC standardized on 64-bit time to run longer simulations, but SystemC is a different animal in that regard anyway. Nonetheless, I also seem to recall that its implementation of time was 64-bit integers (scaled), because the average FPU operations are really only linear over the 53-bit mantissa part. Assuming they want linear representation of time ticks, I can see the appeal of using 64-bit integers in simulation. As far as event-driven simulations are concerned, I totally understand how hard it is to make good use of multithreading or multiprocessing, because everything is so tightly coupled in that evaluate/update/ reschedule loop. If you were working at a much higher level (behavioral/transaction), where the number of low-level events is lower and the computation behind "complex" events took up a much larger portion of the evaluate/update/reschedule loop, then multicore/ multiprocessing solutions might be more effective for simulation. (Agree/disagree?) It seems that as you get more coarse-grained with the simulation, that even distributed processing (multiple machines on a network) becomes more feasible. Obviously the scheduler has one "core" and has to reside in one CPU/memory space, but if it has less work to do, then it can handle less frequent communication with the event-processing CPUs in another space. Back to Quartus in particular and Windows in general... Quartus supports the new "number_of_cpus" or some similar variable, but only seems to use it in small sections of quartus_fit (I think Altera is just making their baby steps in this area). That appears to be related to the number of processors inside one box. If a single CPU is just hyperthreaded, the processor takes care of instruction distribution unrelated to a variable like number_of_cpus, right? And if there are two single-core processors in a box, obviously it will utilize "number_of_cpus=2" as expected. Does anyone know how that works with dual-core CPUs? i.e, if I have two quad-core CPUs in one box, will setting "number_of_cpus=7" make optimal use of 7 cores while leaving me one to work in a shell or window? Does anyone know if Quartus makes better use of multiple processors in a partitioned bottom-up flow compared to a single top-down compile flow? In 32-bit Windows, is that 3GB limit for everything running at one time? i.e., is 4GB a waste on a Windows machine? Can it run multiple 2GB processes and go beyond 3 or 4GB? Or is 3GB an absolute O/S limit, and 2GB an absolute process limit in Windows? In 32-bit Linux, can it run 4GB per process and as many simultaneous processes of that size as the virtual memory will support? In going to 64-bit apps and O/S versions, should the tools run equally fast as long as the processor is truly 64-bit? Thanks again for all the insights and interesting discussion. jj
Reply by ●July 27, 20072007-07-27
On 27 Jul, 17:17, Kai Harrekilde-Petersen <k...@harrekilde.dk> wrote:> Jon Beniston <j...@beniston.com> writes: > >> > Static Timing Analysis (TimeQuest) is mostly double-precision floating- > >> > point? > > >> I seriously doubt it. I don't see a need for floating point there > >> when delays can use scaled integers. > > > Dynamic range? > > Not a likely problem. Even a 32bit int would be big enough for holding > up to a ridiculous 4.3 seconds, assuming 1psec resolution.I think you're a factor of 1000 out. For an ASIC STA, gate delays must be specified at a much finer resolution than 1ps. Cheers, Jon




