Project

General

Profile

Bug #2252

Memory allocation failures with large page sizes during PME tuning

Added by John Eblen about 2 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
hard
Close

Description

PME tuning creates a large data structure (a struct gmx_pme_t) for every cutoff that it tries, which is replicated on each PME node. These data structures are not freed during tuning, and thus memory usage expands. While normally this extra memory is not a problem, it can cause problems when using large page sizes, at least on NERSC Cori KNL.

Here is some sample output on NERSC Cori KNL with 32 nodes and a system with about 248,000 atoms. Code is compiled to use 2M "huge pages."

command-line: srun -n 512 -N 32 --cpu_bind=cores -c 16 gmx_mpi mdrun -ntomp 4 -npme 256 -nstlist 35 -nsteps 2000 -resetstep 1000 -pin off -noconfout -v

...
step 0
step 100, remaining wall clock time: 24 s
step 140: timed with pme grid 128 128 128, coulomb cutoff 1.200: 66.2 M-cycles
step 210: timed with pme grid 112 112 112, coulomb cutoff 1.336: 69.6 M-cycles
step 280: timed with pme grid 100 100 100, coulomb cutoff 1.496: 63.6 M-cycles
step 350: timed with pme grid 84 84 84, coulomb cutoff 1.781: 85.9 M-cycles
step 420: timed with pme grid 96 96 96, coulomb cutoff 1.559: 68.8 M-cycles
step 490: timed with pme grid 100 100 100, coulomb cutoff 1.496: 68.3 M-cycles
libhugetlbfs [nid08887:140420]: WARNING: New heap segment map at 0x10001200000 failed: Cannot allocate memory
libhugetlbfs [nid08881:97968]: WARNING: New heap segment map at 0x10001200000 failed: Cannot allocate memory
libhugetlbfs [nid08881:97978]: WARNING: New heap segment map at 0x10001200000 failed: Cannot allocate memory
...

Notes
1) The problem doesn't occur if fewer cutoffs are tried.
2) It doesn't occur with '-notunepme'.
3) It is less likely to occur with fewer PMEs.
4) The exact cause is not yet known. It may be worth looking into how huge pages are configured on Cori.

hemicellulose_logs.tar.gz (52.7 KB) hemicellulose_logs.tar.gz Logs of runs on hemicellulose (248,101 atoms) John Eblen, 09/20/2017 07:45 PM

History

#1 Updated by Berk Hess about 2 years ago

I don't understand what could cause this. The allocations are not page locked and I assume non-page locked allocation will not consume a whole page (mdrun does many allocations, so that can't be the case). PME does a few allocations using snew_aligned, which likely uses _mm_malloc. But I would think that also would not consume a whole page.

#2 Updated by Szilárd Páll about 2 years ago

KNL has only 16 Gb of MCDRAM and 68 cores (in Cori), so if he's running 1 thread per PME rank with, say 20 PME ranks, at the 6th grid settings that's only 820 Mb/ranks and than we have not considered the memory requirements of the 48 PP ranks.

Not sure in which mode is John running, in cache mode this should in theory not happen and it could only be a performance and timing issue that data will have to get cached first (and fit) before reliable timings can be obtained.

#3 Updated by John Eblen about 2 years ago

Cori is configured to use huge pages for all malloc'ed memory (HUGETLB_MORECORE is set). Huge pages are also not limited ("overcommit" setting is at maximum) and are not swapped.

Thus, I suspect that the pool of huge pages is being exhausted, which exhausts all of memory at the same time.

I generally use cache mode, because it is the fastest to allocate on Cori. (To use flat mode, you normally have to wait for a reboot.)

I have noticed that PME tuning does more work when using huge pages. Without huge pages, the cutoff is never tried above 1.336.

#4 Updated by Szilárd Páll about 2 years ago

John Eblen wrote:

Cori is configured to use huge pages for all malloc'ed memory (HUGETLB_MORECORE is set). Huge pages are also not limited ("overcommit" setting is at maximum) and are not swapped.

Thus, I suspect that the pool of huge pages is being exhausted, which exhausts all of memory at the same time.

Admittedly, I'm not fully familiar how that works. How does the pool of huge pages relate to the amount of memory? If you use MCDRAM in cache mode, mdrun is not running out of the 16 GB MCDRAM as I hypothesized, so I don't get exactly what's happening, but I'd like to understand it in order to plan to avoid (or detect/preempt) this from happening.

I generally use cache mode, because it is the fastest to allocate on Cori. (To use flat mode, you normally have to wait for a reboot.)

That should be ~best at least in the high parallelization regime (as long as everything surely fits in MCDRAM).

I have noticed that PME tuning does more work when using huge pages. Without huge pages, the cutoff is never tried above 1.336.

That's interesting, but does not tell much. Can you share log of completed runs with and without hugepages without tuning (+with tuning if any completed)?

#5 Updated by John Eblen about 2 years ago

Szilárd Páll wrote:

Admittedly, I'm not fully familiar how that works. How does the pool of huge pages relate to the amount of memory? If you use MCDRAM in cache mode, mdrun is not running out of the 16 GB MCDRAM as I hypothesized, so I don't get exactly what's happening, but I'd like to understand it in order to plan to avoid (or detect/preempt) this from happening.

My understanding is that the pool of huge pages is just a portion of memory set aside for special use. Systems can be configured to have, say, 200 huge pages, and applications can request huge pages by special commands. Systems can also specify an "overcommit" size, where huge pages are created on-the-fly once the pool is exhausted.

Cori's configuration, though, is extremely simple. There is no pool of huge pages. (I shouldn't have said "pool" before.) Instead, the "overcommit" is set at maximum, and it is also set so that every allocation uses huge pages. No special command is needed. Thus, effectively, all memory is allocated in 2M chunks. Therefore, an application that does lots of allocations, regardless of size, can quickly exhaust memory.

I generally use cache mode, because it is the fastest to allocate on Cori. (To use flat mode, you normally have to wait for a reboot.)

That should be ~best at least in the high parallelization regime (as long as everything surely fits in MCDRAM).

For GROMACS we see a minor improvement with flat mode but too small to warrant waiting for reboots.

I have noticed that PME tuning does more work when using huge pages. Without huge pages, the cutoff is never tried above 1.336.

That's interesting, but does not tell much. Can you share log of completed runs with and without hugepages without tuning (+with tuning if any completed)?

#6 Updated by John Eblen about 2 years ago

The tar archive contains 24 log files of various runs with and without huge pages on a ~240,000 atom system. The file names indicate the settings for each run.

#7 Updated by Szilárd Páll about 2 years ago

To me it seems that without huge pages communication seems slower (though cycle counts can be misleading due to imbalance). Update is also slower by ~5%, but what stands out is that DD is >2x slower. I'm not sure whether this is an inherent behavior on KNL?

#8 Updated by Erik Lindahl about 2 years ago

  • Priority changed from Normal to Low

This reminds me of the old story "SMITH: Doctor, it hurts when I do this. DALE: Don't do that."

There are certainly some things we allocate and should free, but during tuning we need to retain the possibility of moving back when the performance goes down, so we can't release it immediately.
For now I don't think it's anything that can be fixed easily in GROMACS, apart from the trivial solution of disabling the tuning on KNL for large systems.

#9 Updated by Roland Schulz about 2 years ago

This is potentially caused by us fragmenting the memory a lot. If we allocate for each step of the tuning quite a bit of memory and free all but a bit of it (because it continues to be needed) then this would fragment the available memory (which is much more important with huge pages). This could be prevented by e.g. changing the allocation order. But this theory should first be confirmed. NERSC/Cray (the huge-pages module is a Cray specific solution) should have some tool to measure memory usage and hopefully also fragmentation. John, if you think this is important enough to spend time on it, can you talk to NERSC and try to ask them for help to measure those two things?

#10 Updated by Mark Abraham almost 2 years ago

While I'm sure there aspects of our memory allocation that can be improved (and this feedback is useful for us in choosing where to prioritise work), making the granularity of memory allocation effectively 2M means the system is not usable for some kinds of applications. There are a number of available workarounds:

  • using -notunepme (which we could do by default for KNL? or when huge pages are enabled - is there an env var to detect with?)
  • using more of gmx tune_pme
  • using fewer ranks (might not run optimally, but running is faster than crashing)

Also available in: Atom PDF