Project

General

Profile

Bug #1897

Memory leak in mdrun

Added by Mark Abraham over 4 years ago. Updated over 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

While testing the fix for #1873, I observed release-5-0 with gmx_mpi mdrun -multi + DD + GPU + FE code consume increasing amounts of memory until being sent SIGTERM by a system process. Haven't tried it with up-to-date code, yet. The system is a single methane in water, transforming one deuterium to hydrogen and a hydrogen to deuterium (just to check that the dEkin/dl term works with DD now).

e.g. on tcbs20 (with 24 cores) run

gmx_mpi mdrun -multidir simulation-4-lambda-0 simulation-4-lambda-1 simulation-4-lambda-2 simulation-4-lambda-3 simulation-4-lambda-4 simulation-4-lambda-5 -gpu_id 000000000000000000000000 -pin on -v -dd 2 2 1 -dlb no -ntomp 1 -nsteps 500000
Screen Shot 2016-02-03 at 09.46.56.png (160 KB) Screen Shot 2016-02-03 at 09.46.56.png Ganglia screenshot of tcbs20 Mark Abraham, 02/03/2016 09:46 AM
memory-leak.tbz (3.04 MB) memory-leak.tbz repro files Mark Abraham, 02/03/2016 09:55 AM

Associated revisions

Revision b4915b27 (diff)
Added by Mark Abraham over 4 years ago

Warn about OpenMPI 1.8.6

Refs #1897

Change-Id: I529e59f095bb38a35569da57d395658e038e3f8f

History

#1 Updated by Berk Hess over 4 years ago

Did you try to run without replica exchange and/or without FE?
Both those components do some snew's at every step (most introduced by Michael). It wouldn't surprise me if there is a free missing.

#2 Updated by Mark Abraham over 4 years ago

Indeed, I need to try some simpler cases.

#3 Updated by Mark Abraham over 4 years ago

Berk Hess wrote:

Did you try to run without replica exchange and/or without FE?

There's no replica exchange, merely aggregation into a multi-sim.

Both those components do some snew's at every step (most introduced by Michael). It wouldn't surprise me if there is a free missing.

Yeah, but this sd + mass-lambdas, so a nearly Michael-free zone...

#4 Updated by Berk Hess over 4 years ago

There is some snewing in the free-energy accumulation etc.

#5 Updated by Mark Abraham over 4 years ago

I can reproduce the leak when running on just a single CPU core (ie. no multi, no GPU, no DD) and if I turn free-energy off then I don't see a leak (so far). Will poke some more tomorrow.

#6 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #1897.
Uploader: Mark Abraham ()
Change-Id: I8328b30a12689795c7af2d12dfc94db11b78a03a
Gerrit URL: https://gerrit.gromacs.org/5618

#7 Updated by Mark Abraham over 4 years ago

  • Status changed from New to Rejected

I can find no evidence of a problem with debugging output of smalloc routines, nor LeakSanitizer. The problem is certainly not related to the free-energy code. If I change the period between output steps of various kinds then I can affect the total system memory used (as observed e.g. with watch -n 1 free -m) in the way you might expect for buffered file writing - writing no output leads to no growth in system memory used. I think my observations originate in the way the file system buffering was implemented, and only a problem because my runs were reasonably long, writing lots of output, and there were many processes on the node because it was a multi-sim.

#8 Updated by Mark Abraham over 4 years ago

I believe my observations were made with the OpenMPI 1.8.6 available as the default in our tcbs modules, which has known memory leaks also observed by e.g. LAMMPS and CP2K users also, per Google. I'll upload a warning for our build system, but I don't think we we can do more than that.

#9 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #1897.
Uploader: Mark Abraham ()
Change-Id: I529e59f095bb38a35569da57d395658e038e3f8f
Gerrit URL: https://gerrit.gromacs.org/5717

Also available in: Atom PDF