FPGARelated.com
Forums

Best CPU platform(s) for FPGA synthesis

Started by Unknown July 26, 2007
OK, the questions apply primarily to FPGA synthesis (Altera Quartus
fitter for StratixII and HardCopyII), but I'm interested in feedback
regarding all EDA tools in general.


Context: I'm suffering some long Quartus runtimes on their biggest
StratixII and second-biggest HardCopyII device. Boss has given me
permission to order a new desktop/workstation/server. Immediate goal
is to speed up Quartus, but other long-term value considerations will
be taken into account.


True or false?
--------------------
Logic synthesis (analyze/elaborate/map) is mostly integer operations?
Place and Route (quartus_map) is mostly double-precision floating-
point?
Static Timing Analysis (TimeQuest) is mostly double-precision floating-
point?
RTL simulation is mostly integer operations?
SDF / gate-level simulation is mostly double-precision floating-point?


AMD or Intel?
-------------------
Between AMD & Intel's latest multicore CPUs,
- Which offers the best integer performance?
- Which offers the best floating-point performance?
Specific models within the AMD/Intel family?
Assume cost is no object, and each uses its highest-performing memory
interface, but disk access is (necessary evil) over a networked drive.
(Small % of total runtime anyway.)


Multi-core, multi-processor, or both? 32-bit or 64-bit? Linux vs.
Windows? >2GB of RAM?
---------------------------------------------------------------------------------------------------------------------------------
Is Quartus (and the others) more efficient in any one particular
environment? I prefer Linux, but the OS is now secondary to pure
runtime performance (unless it is a major contributor). Can any of
them make use of more than 2GB or RAM? More than 4GB? Useful limit on
the number of processors/cores?


Any specific box recommendations?



Thanks a gig,

jj

On Jul 26, 6:19 pm, jjohn...@cs.ucf.edu wrote:
> > True or false? > -------------------- > Logic synthesis (analyze/elaborate/map) is mostly integer operations?
Yes.
> Place and Route (quartus_map) is mostly double-precision floating- > point?
I don't know why they would use floating point if they don't have to.
> Static Timing Analysis (TimeQuest) is mostly double-precision floating- > point?
I seriously doubt it. I don't see a need for floating point there when delays can use scaled integers.
> RTL simulation is mostly integer operations?
Yes.
> SDF / gate-level simulation is mostly double-precision floating-point?
No, or at least not in any implementation I am familiar with. All the delays are scaled up so that integers can be used for them. In simulation (assuming something with state-of-the art performance), the CPU operations themselves are not very important anyway. It is not compute-bound, it is memory-access-bound. What you need is big caches and fast access to memory for when the cache isn't big enough.
> Is Quartus (and the others) more efficient in any one particular > environment? I prefer Linux, but the OS is now secondary to pure > runtime performance (unless it is a major contributor). Can any of > them make use of more than 2GB or RAM? More than 4GB?
64-bit Linux can make use of more than 4GB of RAM. But don't use 64- bit executables unless your design is too big for 32-bit tools, because they will run slower on the same machine.
> Useful limit on > the number of processors/cores?
Most of these tools are not multi-threaded, so the only way you will get a speedup is if you have multiple jobs at the same time. Event- driven simulation in particular is not amenable to multi-threading, despite much wishful thinking for the last few decades.
> > Static Timing Analysis (TimeQuest) is mostly double-precision floating- > > point? > > I seriously doubt it. I don't see a need for floating point there > when delays can use scaled integers.
Dynamic range? Cheers, Jon
I think that memory performance is the limiting factor for
FPGA synthesis and P&R.

This machine had a single core AMD 64 processor which I recently replaced with
a slightly faster dual core processor.

I ran a fairly quick FPGA build through Quartus to get a time for a
before and after comparison before I did the swap.

The before and after times were exactly the same :-(

I think the amount and speed of memory is crucial, it's probably
worth paying as much attention to that as to the processor.


Nial. 


Nial Stewart wrote:

> I ran a fairly quick FPGA build through Quartus to get a time for a > before and after comparison before I did the swap.
Did you changed the setting "use up to x number of CPUs" (don't remember the exact name) somewhere in the project settings? -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
On Jul 26, 6:19 pm, jjohn...@cs.ucf.edu wrote:
> AMD or Intel? > ------------------- > Between AMD & Intel's latest multicore CPUs, > - Which offers the best integer performance? > - Which offers the best floating-point performance? > Specific models within the AMD/Intel family? > > Assume cost is no object, and each uses its highest-performing memory > interface, but disk access is (necessary evil) over a networked drive. > (Small % of total runtime anyway.) > > Multi-core, multi-processor, or both? 32-bit or 64-bit? Linux vs. > Windows? >2GB of RAM?
If cost is no object, then go with the Intel quad-core running at 3 GHz : QX6850. Each core has 2 MB of L2 cache (8MB total), which is, according to several reports in this forum, the single most important factor. I would say go with 4GB of ram, although if you're using the biggest chips, you might need more. Keep in mind that Windows 32-bit will only be able to use 3GB max of this 4 GB, and each application will only be able to access 2GB max. So you might consider Windows 64 bits or Linux 64 bits if necessary. Patrick
Jon Beniston <jon@beniston.com> writes:

>> > Static Timing Analysis (TimeQuest) is mostly double-precision floating- >> > point? >> >> I seriously doubt it. I don't see a need for floating point there >> when delays can use scaled integers. > > Dynamic range?
Not a likely problem. Even a 32bit int would be big enough for holding up to a ridiculous 4.3 seconds, assuming 1psec resolution. As far as I know, everything in the simulate, synth, P&R, and STA chain can be performed with adequate resolution using integers. Crosstalk and inductive effects might require floating point help, but I would be surprised if even that can be approximated well with fixed-point arithmetic. Kai -- Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>
sharp@cadence.com writes:
> 64-bit Linux can make use of more than 4GB of RAM. But don't use 64- > bit executables unless your design is too big for 32-bit tools, > because they will run slower on the same machine.
Although that might be true for some specific cases, in general on Linux native 64-bit executables tend to run faster than 32-bit executables. But I haven't benchmarked 32-bit vs. 64-bit FPGA tools.
Thanks everyone, this is real interesting, but please don't stop
posting if you have more insights to share!

FWIW, my runtimes in Quartus are dominated by P&R (quartus_fit); on
Linux, they run about 20% faster on my 2005-era 64-bit Opteron than on
my 2004-era 32-bit Xeon (both with a 32-bit build of Quartus). Another
test run of a double-precision DSP simulation (compiled C) ran
substantially slower on the Opteron, which I thought was supposed to
have better floating-point performance than Xeons of that era. Maybe
it was just a case of the gcc -O5 optimization switches being totally
tuned to Intel instead of AMD, or maybe my Quartus P&R step is
primarily dominated by integer calculations.

I originally suspected P&R might have a lot of floating-point
calculations (even prior to signal-integrity considerations) if they
were doing any kind of physical synthesis (e.g., delay calculation
based on distance and fanout); ditto for STA, because that's usually
an integral part of the P&R loops. I also suspected that if floating-
point operations (at least multiplies, add/subtract, and MACs) could
be done in a single cycle, there would be no advantage to using
integer arithmetic instead (especially if manual, or somewhat explicit
integer scaling is required).

On the other hand, in something like a router, you can get more exact
location info wrt stuff like grid coordinates than you can with
floating-point. As far as dynamic range is concerned, I seem to recall
that SystemC standardized on 64-bit time to run longer simulations,
but SystemC is a different animal in that regard anyway. Nonetheless,
I also seem to recall that its implementation of time was 64-bit
integers (scaled), because the average FPU operations are really only
linear over the 53-bit mantissa part. Assuming they want linear
representation of time ticks, I can see the appeal of using 64-bit
integers in simulation.

As far as event-driven simulations are concerned, I totally understand
how hard it is to make good use of multithreading or multiprocessing,
because everything is so tightly coupled in that evaluate/update/
reschedule loop. If you were working at a much higher level
(behavioral/transaction), where the number of low-level events is
lower and the computation behind "complex" events took up a much
larger portion of the evaluate/update/reschedule loop, then multicore/
multiprocessing solutions might be more effective for simulation.
(Agree/disagree?) It seems that as you get more coarse-grained with
the simulation, that even distributed processing (multiple machines on
a network) becomes more feasible. Obviously the scheduler has one
"core" and has to reside in one CPU/memory space, but if it has less
work to do, then it can handle less frequent communication with the
event-processing CPUs in another space.

Back to Quartus in particular and Windows in general... Quartus
supports the new "number_of_cpus" or some similar variable, but only
seems to use it in small sections of quartus_fit (I think Altera is
just making their baby steps in this area).

That appears to be related to the number of processors inside one box.
If a single CPU is just hyperthreaded, the processor takes care of
instruction distribution unrelated to a variable like number_of_cpus,
right? And if there are two single-core processors in a box, obviously
it will utilize "number_of_cpus=2" as expected. Does anyone know how
that works with dual-core CPUs? i.e, if I have two quad-core CPUs in
one box, will setting "number_of_cpus=7" make optimal use of 7 cores
while leaving me one to work in a shell or window?

Does anyone know if Quartus makes better use of multiple processors in
a partitioned bottom-up flow compared to a single top-down compile
flow?

In 32-bit Windows, is that 3GB limit for everything running at one
time? i.e., is 4GB a waste on a Windows machine? Can it run multiple
2GB processes and go beyond 3 or 4GB? Or is 3GB an absolute O/S limit,
and 2GB an absolute process limit in Windows?

In 32-bit Linux, can it run 4GB per process and as many simultaneous
processes of that size as the virtual memory will support?

In going to 64-bit apps and O/S versions, should the tools run equally
fast as long as the processor is truly 64-bit?


Thanks again for all the insights and interesting discussion.


jj



On 27 Jul, 17:17, Kai Harrekilde-Petersen <k...@harrekilde.dk> wrote:
> Jon Beniston <j...@beniston.com> writes: > >> > Static Timing Analysis (TimeQuest) is mostly double-precision floating- > >> > point? > > >> I seriously doubt it. I don't see a need for floating point there > >> when delays can use scaled integers. > > > Dynamic range? > > Not a likely problem. Even a 32bit int would be big enough for holding > up to a ridiculous 4.3 seconds, assuming 1psec resolution.
I think you're a factor of 1000 out. For an ASIC STA, gate delays must be specified at a much finer resolution than 1ps. Cheers, Jon