FPGARelated.com
Forums

Quartus II Synthesis - System Memory Issues for Large Stratix 10 Design

Started by Chris Adams October 29, 2021
Hello,

I have a Stratix 10 design that is based around an ip core generated using =
Intel's HLS. The core does some simple floating point operations and by its=
elf uses very few resources (1 DSP, a few hundred flops etc).

This core sits inside a generate statement like this:

generate
    for(i =3D 0; i < SOMEBIGNUMBER; i=3Di+1)
        myhlscore u0 (inputs, outputs);
...


The design works and is proven in simulation and in hardware.

The problem comes when I try to increase the value of SOMEBIGNUMBER. Despit=
e there being adequate resources, using values above 200 or so make the syn=
thesis tool run out of memory.

I cannot alleviate this easily by adding more memory - I already tried synt=
hesizing on a computer with 256GB memory and a 200GB swap space and quartus=
 ate it all up before dying.

I'm using a .ip file from HLS right now. I'm wondering is there is some way=
 to pre-synthesis the module and keep the results, or is there someway I ne=
ed to write the generate statement so that it caches less? Perhaps there ar=
e some synthesis settings I can change?

Thanks,
C
Chris Adams <chris@chrisada.co.uk> wrote:
> I'm using a .ip file from HLS right now. I'm wondering is there is some > way to pre-synthesis the module and keep the results, or is there someway > I need to write the generate statement so that it caches less? Perhaps > there are some synthesis settings I can change?
You can put the module in a design partition. That should mean the tools can reuse the block again if it's repeated, and won't try and flatten everything out before routing, which is probably what causes the out of memory errors. Design Partition Planner is the tool to do this. I haven't watched it all, but this video seems to cover how to use partitions: https://www.youtube.com/watch?v=AW9kev4lM7g Beware that if you're using a tool that generates Verilog (not sure if HLS is here) you need to keep the structure through to the Verilog level. In other words if your myhlscore() is the same Verilog module on each iteration of the generate loop that's fine but if your HLS core is producing SOMEBIGNUMBER of different verilog modules, or flattening them to one enormous module, that's going to be a problem. It looks like your HLS is constrained to inside the module and your generate is in regular Verilog so that's probably OK. Theo
On Saturday, 30 October 2021 at 16:55:47 UTC-4, Theo wrote:
> Chris Adams <> wrote: > > I'm using a .ip file from HLS right now. I'm wondering is there is some > > way to pre-synthesis the module and keep the results, or is there someway > > I need to write the generate statement so that it caches less? Perhaps > > there are some synthesis settings I can change? > You can put the module in a design partition. That should mean the tools > can reuse the block again if it's repeated, and won't try and flatten > everything out before routing, which is probably what causes the out of > memory errors. > > Design Partition Planner is the tool to do this. I haven't watched it all, > but this video seems to cover how to use partitions: > https://www.youtube.com/watch?v=AW9kev4lM7g > > Beware that if you're using a tool that generates Verilog (not sure if HLS > is here) you need to keep the structure through to the Verilog level. In > other words if your myhlscore() is the same Verilog module on each iteration > of the generate loop that's fine but if your HLS core is producing > SOMEBIGNUMBER of different verilog modules, or flattening them to one > enormous module, that's going to be a problem. It looks like your HLS is > constrained to inside the module and your generate is in regular Verilog so > that's probably OK. > > Theo
Good idea. We tried using a design partition, but even the elaboration stage uses over 120GB of memory before crashing. I may try setting up an extremely large swap file. Chris
Chris Adams <chris@chrisada.co.uk> wrote:

> Good idea. We tried using a design partition, but even the elaboration > stage uses over 120GB of memory before crashing. I may try setting up an > extremely large swap file.
I wonder if the generate statement is causing the trouble, and whether just having a pile of flat instantiations would be any different? One other thing you could try, if the I/O isn't too troublesome, is a tree of instantiations. module A2 contains two instances of the module A, module A4 contains two instances of A2, module A8 two of A4, etc. That way it keeps the complexity at each level of hierarchy down. If the synthesiser is blowing up at the elaboration stage it could help if the elaboration in a specific module is within limits. I've not tried it though. I've had Stratix 10 builds need >16GB but <64GB, for what it's worth, but these weren't super full designs. Theo
On Tuesday, 9 November 2021 at 15:24:51 UTC-5, Theo wrote:
> Chris Adams wrote: > > > Good idea. We tried using a design partition, but even the elaboration > > stage uses over 120GB of memory before crashing. I may try setting up an > > extremely large swap file. > I wonder if the generate statement is causing the trouble, and whether just > having a pile of flat instantiations would be any different? > > One other thing you could try, if the I/O isn't too troublesome, is a tree > of instantiations. module A2 contains two instances of the module A, module > A4 contains two instances of A2, module A8 two of A4, etc. That way it > keeps the complexity at each level of hierarchy down. If the synthesiser is > blowing up at the elaboration stage it could help if the elaboration in a > specific module is within limits. I've not tried it though. > > I've had Stratix 10 builds need >16GB but <64GB, for what it's worth, but > these weren't super full designs. > > Theo
This is also a good idea. We tried this, but it didn't help either. We were eventually able to get the design to complete map, but it required a 500GB swap file, and took 2 days or so to build!!! I don't think this is really a good solution, memory access on the system swap is extremely slow. Anyway the design now fails at the route stage, saying the design cannot be routed... Device resource usage is around 60% following map. Chris
On Tuesday, 14 December 2021 at 08:35:06 UTC-5, Chris Adams wrote:
> On Tuesday, 9 November 2021 at 15:24:51 UTC-5, Theo wrote: > > Chris Adams wrote: > > > > > Good idea. We tried using a design partition, but even the elaboration > > > stage uses over 120GB of memory before crashing. I may try setting up an > > > extremely large swap file. > > I wonder if the generate statement is causing the trouble, and whether just > > having a pile of flat instantiations would be any different? > > > > One other thing you could try, if the I/O isn't too troublesome, is a tree > > of instantiations. module A2 contains two instances of the module A, module > > A4 contains two instances of A2, module A8 two of A4, etc. That way it > > keeps the complexity at each level of hierarchy down. If the synthesiser is > > blowing up at the elaboration stage it could help if the elaboration in a > > specific module is within limits. I've not tried it though. > > > > I've had Stratix 10 builds need >16GB but <64GB, for what it's worth, but > > these weren't super full designs. > > > > Theo > This is also a good idea. We tried this, but it didn't help either. > > We were eventually able to get the design to complete map, but it required a 500GB swap file, and took 2 days or so to build!!! > > I don't think this is really a good solution, memory access on the system swap is extremely slow. > > Anyway the design now fails at the route stage, saying the design cannot be routed... Device resource usage is around 60% following map. > > Chris
Just a quick update - We were able to further reduce synthesis memory usage by lowering the number of cores used for compilation. We went down to 4 cores and memory usage was below 128GB with only a small impact on build performance.
Chris Adams <chris@chrisada.co.uk> wrote:
> Just a quick update - We were able to further reduce synthesis memory > usage by lowering the number of cores used for compilation. We went down > to 4 cores and memory usage was below 128GB with only a small impact on > build performance.
That's good to know, I hadn't thought of that. Makes sense - reduces parallelism but reduces copies of the working set. I get the impression Quartus spends a fair chunk of its time not being parallel (Amdahl's law) - Quartus Pro is a bit better at being parallel, but I think it still prefers fewer cores with a higher clock. Theo