Memory leak in mdrun
While testing the fix for #1873, I observed release-5-0 with gmx_mpi mdrun -multi + DD + GPU + FE code consume increasing amounts of memory until being sent SIGTERM by a system process. Haven't tried it with up-to-date code, yet. The system is a single methane in water, transforming one deuterium to hydrogen and a hydrogen to deuterium (just to check that the dEkin/dl term works with DD now).
e.g. on tcbs20 (with 24 cores) run
gmx_mpi mdrun -multidir simulation-4-lambda-0 simulation-4-lambda-1 simulation-4-lambda-2 simulation-4-lambda-3 simulation-4-lambda-4 simulation-4-lambda-5 -gpu_id 000000000000000000000000 -pin on -v -dd 2 2 1 -dlb no -ntomp 1 -nsteps 500000
#3 Updated by Mark Abraham almost 4 years ago
Berk Hess wrote:
Did you try to run without replica exchange and/or without FE?
There's no replica exchange, merely aggregation into a multi-sim.
Both those components do some snew's at every step (most introduced by Michael). It wouldn't surprise me if there is a free missing.
Yeah, but this sd + mass-lambdas, so a nearly Michael-free zone...
#7 Updated by Mark Abraham almost 4 years ago
- Status changed from New to Rejected
I can find no evidence of a problem with debugging output of smalloc routines, nor LeakSanitizer. The problem is certainly not related to the free-energy code. If I change the period between output steps of various kinds then I can affect the total system memory used (as observed e.g. with
watch -n 1 free -m) in the way you might expect for buffered file writing - writing no output leads to no growth in system memory used. I think my observations originate in the way the file system buffering was implemented, and only a problem because my runs were reasonably long, writing lots of output, and there were many processes on the node because it was a multi-sim.
#8 Updated by Mark Abraham almost 4 years ago
I believe my observations were made with the OpenMPI 1.8.6 available as the default in our tcbs modules, which has known memory leaks also observed by e.g. LAMMPS and CP2K users also, per Google. I'll upload a warning for our build system, but I don't think we we can do more than that.