Project

General

Profile

Bug #1265

Memory leak?

Added by Michael Shirts almost 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
High
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I' ve found a memory leak that seems somewhat irreproducible. When run on my Mac, it's fine;
when run on a linux cluster, it eventually blew up to 4 GB memory (from about 300MB starting) after about 12 hours -- but it was a slow increase, 4-8 K every step (every ns?) or so.

So I don't know exactly which it was. It was synched on 5/21/13 with release 4.6. It was compiled with gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) ( I haven't been able to get them to upgrade yet)

It was run on a single node with 16 cores (32 with hyperthreading):
mdrun -ntmpi 32 -dd 5 3 2 -npme 2 -deffnm solv.20 -dhdl solv.20.dhdl.xvg

I tried running with the same tpr and same command (32 threads) on a OS X laptop to try to reproduce, but failed -- memory did not increase once it hit stride.
Though the local version was compiled with debug on, so that might also make a difference? Hard to tell with these memory thing.

I'm happy to try some other things if people want to suggest, though I'm not 100% sure what the best option is -- hence asking before I try a lot of things that won't diagnose it.

Logfile build info:

Gromacs version: VERSION 4.6.2-dev-20130521-f78f0dc
GIT SHA1 hash: f78f0dc83f55d588f0fbc049af667519d9cf868e
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled
GPU support: disabled
invsqrt routine: gmx_software_invsqrt(x)
CPU acceleration: SSE2
FFT library: fftw-3.3.3-sse2
Large file support: enabled
RDTSCP usage: disabled
Built on: Tue Apr 23 11:37:55 EDT 2013
Built by: [CMAKE]
Build OS/arch: Linux 2.6.32-279.19.1.el6.x86_64 x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
Build CPU family: 6 Model: 26 Stepping: 5
Build CPU features: apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ss
se3
C compiler: /usr/bin/gcc GNU gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4)
C compiler flags: -msse2 -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wall -Wno-unused -Wunused-value -fom
it-frame-pointer -funroll-all-loops -O3 -DNDEBUG

solv.20.tpr - tpr that has the problem (2.29 MB) Michael Shirts, 05/25/2013 03:57 PM

Associated revisions

Revision 1babb4bb (diff)
Added by Michael Shirts almost 4 years ago

Fix of memory leak

Fixes #1265 (probably)

freeing data in expanded ensemble code that should have been freed all along.

Change-Id: I115ee068c56e4edab8fcea828e60ee1386f00716

History

#1 Updated by Michael Shirts almost 4 years ago

I've been able to get it to leak when running on my laptop with just 4 cores (it was leaking before, I just couldn't find it), so it's not quite as exotic as before. This will allow me to make a bit more progress, though if people have tips for finding memory leaks on gromacs, let me know.

#2 Updated by Mark Abraham almost 4 years ago

I don't have that hash at all...

yes i do

#3 Updated by Mark Abraham almost 4 years ago

Yeah I have it leaking on two different machines. I'll poke it with valgrind after dinner

#4 Updated by Mark Abraham almost 4 years ago

  • Status changed from New to Accepted

#5 Updated by Michael Shirts almost 4 years ago

Thanks! I spent a while, but can't get valgrind working on my machine, and couldn't find it by stepping through and trying to watch the memory footprint manually.

#6 Updated by Mark Abraham almost 4 years ago

Hmmm. valgrind hasn't been very clear. It mostly crashes, after complaining about uninitialized value of nr in tMPI_Thread_setaffinity_single, but 4.6 also complains about that, so I don't think this is the problem.

#7 Updated by Mark Abraham almost 4 years ago

Found it with DDT, though I had to resort to a diff of memory snapshots at 200 and 400 steps :( Only affects expanded ensemble, it seems. I'll leave it to Michael to fix

http://redmine.gromacs.org/projects/gromacs/repository/revisions/release-4-6/entry/src/mdlib/expanded.c#L1222

#8 Updated by Michael Shirts almost 4 years ago

Ugh, I should have been able to catch that. Sorry to have you go through the hassle! The fix is obvious, but I'm checking it a few ways first just to be sure, will upload soon.

#9 Updated by Michael Shirts almost 4 years ago

  • Status changed from Accepted to Fix uploaded

#10 Updated by Michael Shirts almost 4 years ago

  • Status changed from Fix uploaded to Resolved
  • % Done changed from 0 to 100

#11 Updated by Mark Abraham almost 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF