Project

General

Profile

Bug #2014

GROMACS free energy memory footprint

Added by Tyler Reddy over 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Erik asked me to open an issue related to my observations below (pasted from my original email). I'll endeavour to provide additional information as requested. As reproducing the issue requires a rather large file I haven't included a 'minimum failing example' yet, but could probably cook up a non-confidential system / set of files with a bit of effort. For now, I'll provide the mdrun traceback and the email description sent to Erik.

Traceback:

   272 Program gmx mdrun, VERSION 5.1.2
   273 Source code file: /gromacs-5.1.2/src/gromacs/utility/smalloc.c, line: 227
   274                                                                                                                                  
   275 Fatal error:
   276 Not enough memory. Failed to realloc -8589934592 bytes for nlist->jjnr, nlist->jjnr=9988e010
   277 (called from file /gromacs-5.1.2/src/gromacs/mdlib/nbnxn_search.c, line 3392)
   278 For more information and tips for troubleshooting, please check the GROMACS
   279 website at http://www.gromacs.org/Documentation/Errors
   280 -------------------------------------------------------
   281 : Cannot allocate memory

----------
Email Description of Issue:

I've observed something interesting about the free energy calculation behaviour in GROMACS when extremely large numbers of particles are coupled (I have a legitimate use for this application). In short, ramping the lambda value from 0 to 1 for N particles causes an increase in the amount of physical memory (RAM) consumed by GROMACS.

As you enter the millions of coupled particles (which no normal end user would ever do) the memory footprint appears to exceed 128 GB of RAM, which is the most I have access to. Please see a brief description of my measurements below. I am happy to provide additional details regarding some aspects of the runs if you think you can help me get this running.

Unfortunately, coupling separately isn't really an option for me. Is this basically a no-go in GROMACS for now? That would be good to know! This is basically a scaling up of the Alchembed technique (http://pubs.acs.org/doi/abs/10.1021/ct501111d?src=recsys) for a different purpose.

--------------------
The following observations using GROMACS 5.1.2 and lambda-phasing simulations (i.e., alchembed-style) on a node with 128 GB RAM suggest that I am actually hitting a physical memory ceiling:

coupling 1584 CG particles (132 POPE residues) = ok
coupling 19,054 CG particles (1361 PRPE residues) = ok
coupling 40,843 CG particles (3713 DPSM residues) = ok
coupling 201,227 CG particles (15479 RUPE residues) = ok
coupling 436,887 CG particles (39717 RPSM residues) = ok
coupling 1,009,208 CG particles (126151 RHOL residues) = Fatal Error: Not Enough memory
coupling full SYSTEM (> 3M particles) = Fatal Error: Not Enough memory
coupling matching topology / full system with only 3 copies of each residue = ok
[residue names starting with R are position restrained, but forces generated are actually still sufficient to displace them]

I'm not really open to doing the alchemical phasings for each molecule type separately unless it is the only (i.e, last) resort, for various reasons. Full system phasing clearly works at smaller system sizes. Other suggestions to reduce the memory footprint? Does this surprise you?

One doesn't normally do free energy calculations on millions of particles (right?), so it may be that minimizing the memory footprint in that range hasn't been a priority.
----------

alchembed_round_1.log (14.5 KB) alchembed_round_1.log Tyler Reddy, 07/28/2016 05:37 PM
run_files_lambda_growth.tar (3.5 KB) run_files_lambda_growth.tar Tyler Reddy, 07/30/2016 06:01 PM

Related issues

Related to GROMACS - Feature #742: Enhancing the performance of the free energy codeNew
Related to GROMACS - Task #2010: Use size_t instead of int for indexingNew
Related to GROMACS - Feature #1665: improve free energy non-bonded kernel performanceNew

Associated revisions

Revision c2444839 (diff)
Added by Berk Hess over 3 years ago

Reduce FE pair-list memory usage

Refs #2014.

Change-Id: Ia8bbc55eed7e3590e6944127ab94dd41f475e6a7

History

#1 Updated by Mark Abraham over 3 years ago

Thanks for the report.

The current free-energy kernels + list-building infrastructure are probably correct, but are very much a hack from the old group scheme into the Verlet scheme. We can do much better on multiple fronts, but making significant progress basically needs a re-write, and needs some other code work to happen first. That said, I can't think of a good reason such simulations should need anything like this amount of memory... My guess is that the expectation of only a small number of perturbed atoms has led to code that has completely separate data structures for each perturbed particle...

Is memory usage constant over time? Are the failing cases immediate, or after the simulation has progressed a bit? (You're doing slow-growth, as a far as I can tell so far.) If not, then we'll likely find and fix any memory leak ASAP.

Sharing some log files (e.g. in a tarball attachment) might help us get some preliminary insight.

#2 Updated by Mark Abraham over 3 years ago

  • Related to Feature #742: Enhancing the performance of the free energy code added

#3 Updated by Tyler Reddy over 3 years ago

I've attached a log file for the 3 million particle slow-growth memory failure case. The memory failure appears to occur at Step 0 before any progress is made, as far as I can tell. I thought about doing some memory profiling, but that can be tricky I think. I might be able to come up with a simple failure example that involves hydrating a container with a bunch of water and just growing all the water particles simultaneously, if that would help.

#4 Updated by Mark Abraham over 3 years ago

The log file doesn't scream about any intrinsic problems. Why are you running on a single core? If there's other such mdrun on the node then it's likely just making the memory problem worse.

Having such an mdp+top+gro would be good for our collection of "things users try to do that we don't think of." Berk will know offhand what and why the memory allocation pattern is the way it is, but he's on holidays still.

#5 Updated by Tyler Reddy over 3 years ago

I'm running on a single core for various reasons: 1) the purpose of the lambda-growth process is to resolve steric conflicts in a highly-pathological starting configuration where some particles may share the same coordinate space and move a fair bit during the growth process -- these kinds of movements may or may not be problematic for a domain decomposed system; 2) as I'm still debugging the overall procedure (which involves other parts beyond this), introducing the additional complication of domain decomposition and so on into the list of possible issues struck me as potentially confounding other issues; 3) it is the standard approach we use with this alchemical technique, which has so few integration steps that the time lost shouldn't be too bad, even for large systems.

Note that GROMACS is running in a docker container in a single node for all of my uses cases, but this is reproducible for newer and older nodes with quite different architectures. No other mdrun processes were running on the nodes at the time (nobody else was even logged in for the log file provided -- it was an isolated node that only I was using to debug).

Ok, I'll see what I can do about getting a 'minimum failing example' for you--probably tomorrow.

#6 Updated by Tyler Reddy over 3 years ago

I've produced and confirmed a 'minimum failing example' which also has some additional debug information and observations on time / memory usage (see below & attached):

  • download MARTINI water coordinate file and associated forcefield parameters
    wget http://md.chem.rug.nl/images/applications/water/water.gro .
    wget http://md.chem.rug.nl/images/parameters/ITP/martini_v2.1.itp .
    
  • duplicate the last water in the above coordinate file 2 million times and adjust the number of particles accounted for on the second line of the .gro file as well (2000400 particles total)
  • use the provided .top and .mdp files to run the run gromacs 5.1.3 preprocessor command:
gmx grompp -f alchembed_slow.mdp -c water.gro -p water.top -o alchembed_slow.tpr -maxwarn 99
* run the slow lambda growth procedure in GROMACS:
gmx mdrun -v -s alchembed_slow.tpr -stepout 1
  • after waiting for several minutes no sign of a first step... 2:39 pm... laptop fan is fired up, but gmx memory footprint looks pretty stable so far at less than half a GB
  • after roughly 1 hour (!!) the code finally printed the following:
    starting mdrun 'WATER'
    1000 steps,     10.0 ps.
    
  • at this stage the GROMACS memory footprint appears to have doubled to about 920 MB (3:43 pm)
  • 4:18 pm -- gmx memory footprint is still stable at 924 MB -- still no steps printed though
  • 4:56 pm -- gmx finally crashed (see traceback below, which actually contains a bit more debug info this time for whatever reason -- the debug breakpoint):
gmx(13757,0x700000081000) malloc: *** mach_vm_map(size=18446744065119617024) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

-------------------------------------------------------
Program gmx mdrun, VERSION 5.1.3
Source code file: /usr/local/module_software/gromacs-5.1.3/src/gromacs/utility/smalloc.c, line: 227

Fatal error:
Not enough memory. Failed to realloc -8589934592 bytes for nlist->jjnr, nlist->jjnr=a95eb000
(called from file /usr/local/module_software/gromacs-5.1.3/src/gromacs/mdlib/nbnxn_search.c, line 3392)
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
: Cannot allocate memory
  • Note that I ran this code on yet another machine (mac retina laptop) and used all the available cores this time. This certainly suggests to me that the issue is reproducible in various different situations.

#7 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #2014.
Uploader: Berk Hess ()
Change-Id: Ia8bbc55eed7e3590e6944127ab94dd41f475e6a7
Gerrit URL: https://gerrit.gromacs.org/6105

#8 Updated by Berk Hess over 3 years ago

  • Category set to mdrun
  • Status changed from New to In Progress
  • Assignee set to Berk Hess
  • Priority changed from High to Normal
  • Target version set to 2016.1

This is not a memory footprint issue, but rather the pair-list size running over max int. You have 3 million perturbed particles, which is too close to the limit for a single pair-list (core). Using more cores will avoid this issue. There is a bug that increases the list size by a factor of 4. I pushed a fix to release-2016 for this. This fix will make your system work, but it just pushes the issue to larger sizes. You can easily backport this fix to 5.1 if you need it in there.
Here, and in some other places in the code, we should either adds size checks for max int or switch to 64-bit ints (if that doesn't hurt performance, which I expect might happen in this case).

#9 Updated by Berk Hess over 3 years ago

  • Related to Task #2010: Use size_t instead of int for indexing added

#10 Updated by Tyler Reddy over 3 years ago

I don't see any commits in release-2016 since August 4/ 2016. Does it take a while for them to show up on redmine / github?

#11 Updated by Berk Hess over 3 years ago

My gerrit changed is linked in this redmine issue page right above my first reply.

#12 Updated by Szilárd Páll about 3 years ago

  • Related to Feature #1665: improve free energy non-bonded kernel performance added

#13 Updated by Szilárd Páll about 3 years ago

@Tyler, have a look at #1665 that I've just linked. You may be leaving quite some performance on the table if as your runs is likely spending a significant time in the FE kernels.

#14 Updated by Tyler Reddy about 3 years ago

My use case is pretty strange, but very useful to me. I think the entire run is basically FE, right? All steps are climbing up the lambda value a little bit. My impression is that it wasn't clear to the devs why I used a single core & that this limitation was also triggering the memory issue (which, ok, wasn't actually a memory issue). If it were true that the stability of all GROMACS (FE) simulations were identical irrespective of domain decomposition state, this would be very useful for my purposes.

For example, if the stability on 256 cores were always indistinguishable from the stability on a single core, even if particles are zipping around quite a bit in a larger container--that would be very powerful & I probably wouldn't have even encountered the original challenge here. But, as it stands, and I may not state this quite correctly--there is at least a modestly increased chance of a crash for a somewhat unstable system when it is decomposed over many cores versus a single core run.

#15 Updated by Mark Abraham about 3 years ago

Tyler Reddy wrote:

My use case is pretty strange, but very useful to me. I think the entire run is basically FE, right? All steps are climbing up the lambda value a little bit. My impression is that it wasn't clear to the devs why I used a single core & that this limitation was also triggering the memory issue (which, ok, wasn't actually a memory issue). If it were true that the stability of all GROMACS (FE) simulations were identical irrespective of domain decomposition state, this would be very useful for my purposes.

Nobody needs the stability to be 100% regardless of the stability of the underlying simulation. I'll ignore the cases where the simulation is just broken and going to explode regardless of DD. Otherwise, implementing DD requires that the halo exchange has a characteristic distance range for the halo handled each step, and also for the range of nearest domain neighbours to which a particle could move over the lifetime of the decomposition. Because particles move, you have to re-do the decomposition, and you need to know where each particle might move for the next decomposition. You could engineer it for nearest neighbour, or nearest n neighbours, but the latter is inefficient unless the characteristic velocity is large enough that a particle travels more than about the diameter of a domain. So yes, "stability" is necessarily a function of the number of domains in the decomposition for an efficient implementation. But for a system as large as yours, you can surely have tens of domains with no loss of stability resulting from the implementation of DD, in an intrinsically stable simulation.

For example, if the stability on 256 cores were always indistinguishable from the stability on a single core, even if particles are zipping around quite a bit in a larger container--that would be very powerful & I probably wouldn't have even encountered the original challenge here. But, as it stands, and I may not state this quite correctly--there is at least a modestly increased chance of a crash for a somewhat unstable system when it is decomposed over many cores versus a single core run.

Is the physics of "zipping around" meaningful? If it's merely an equilibration phase, then the standard techinques apply, e.g. http://www.gromacs.org/Documentation/Terminology/Blowing_Up including using a smaller time step.

#16 Updated by Mark Abraham about 3 years ago

  • Target version changed from 2016.1 to 2016.2

I'm not sure there's anything remaining here we might do in a patch release, but the general issue in #1665 remains. I'll change this to a feature request, unless there's further thoughts from anyone.

#17 Updated by Tyler Reddy about 3 years ago

I don't think reducing the timestep or conventional techniques for equilibration work if the starting coordinates are not physically sensible -- i.e., many particles occupying the same space. When building extremely large / complex systems, it can happen that particles clash with each other in ways that cannot be resolved using conventional techniques. For these cases, abusing the GROMACS free energy machinery to lambda phase the particles is quite useful, but this unusual usage case seems to require using less cores to avoid a crash.

Nonetheless, the patch produced here allowed me to do what I needed to do, so thanks for that. Furthermore, the issue with the workflow is arguably at the molecular packing level & not the simulation level -- I'm basically using GROMACS to clean up the mess that is made by deficiencies in system building / packing programs that are not forcefield aware (i.e., pretty much all of them).

#18 Updated by Mark Abraham about 3 years ago

Tyler Reddy wrote:

I don't think reducing the timestep or conventional techniques for equilibration work if the starting coordinates are not physically sensible -- i.e., many particles occupying the same space.

They'd certainly have to be used in concert with other techniques in the kind of extreme case you're considering. But if the time for a simulation matters (since we're talking about parallelization) then an 8-way DD that's possible because the timestep is half the size will improve throughput compared to a single domain.

When building extremely large / complex systems, it can happen that particles clash with each other in ways that cannot be resolved using conventional techniques. For these cases, abusing the GROMACS free energy machinery to lambda phase the particles is quite useful, but this unusual usage case seems to require using less cores to avoid a crash.

OK, but that seems intrinsic to the way GROMACS is engineered for stable simulations, rather than a problem for DD or FE modules.

Nonetheless, the patch produced here allowed me to do what I needed to do, so thanks for that. Furthermore, the issue with the workflow is arguably at the molecular packing level & not the simulation level -- I'm basically using GROMACS to clean up the mess that is made by deficiencies in system building / packing programs that are not forcefield aware (i.e., pretty much all of them).

Yeah, filling space compactly is a tough problem, though. That's why gmx membed has a niche

#19 Updated by Szilárd Páll about 3 years ago

  • Status changed from In Progress to Resolved

Tyler Reddy wrote:

Nonetheless, the patch produced here allowed me to do what I needed to do, so thanks for that.

Good. In that case, I think we can mark this issue resolved.

Furthermore, the issue with the workflow is arguably at the molecular packing level & not the simulation level -- I'm basically using GROMACS to clean up the mess that is made by deficiencies in system building / packing programs that are not forcefield aware (i.e., pretty much all of them).

If you think there are a common and important use-cases where such placement/packing issues could be better resolved through some special MD equilibration mode (either by abusing the FE moduel or not), I suggest opening a separate feature request for further discussion (instead of converting this issue).

#20 Updated by Szilárd Páll about 3 years ago

BTW, with such large systems, as Mark pointed out, you should be able to use at least a 8-16-way DD, combined with say 4-8 threads/rank, you should be able to scale to a few nodes instead of running on a single core.

#21 Updated by Mark Abraham almost 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF