Project

General

Profile

Bug #2470

memory leak in OpenCL runs with

Added by Szilárd Páll over 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

OpenCL runs suffer from a memory leak which after a few hours of runtime will consume all host memory. It seems to be related to GPU timing as the leak is not present if timing is turned off. Likely surfaced as a result of #2468.

cl_memory_logs.tgz (2.96 MB) cl_memory_logs.tgz Aleksei Iupinov, 04/20/2018 01:47 PM

Associated revisions

Revision f2c9e778 (diff)
Added by Aleksei Iupinov over 1 year ago

Prevent OpenCL timing memory leak

Fixes #2470

Change-Id: I4917de697bee7df98da0037e9165e52e660f83a0

History

#1 Updated by Szilárd Páll over 1 year ago

Actually, even non-DD runs do leak memory, so there is something strange going on here. Disabling timing seems to strongly reduce or eliminate the effect; there is still a strange drift in memory usage, but it seems to plateau after a while. Need longer runs to assess it.

#2 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related DRAFT patchset '2' for Issue #2470.
Uploader: Aleksei Iupinov ()
Change-Id: gromacs~release-2018~I4917de697bee7df98da0037e9165e52e660f83a0
Gerrit URL: https://gerrit.gromacs.org/7730

#3 Updated by Szilárd Páll over 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Aleksei Iupinov

#4 Updated by Aleksei Iupinov over 1 year ago

  • Status changed from In Progress to Resolved

#5 Updated by Szilárd Páll over 1 year ago

  • Status changed from Resolved to Feedback wanted

Let me repeat my question from gerrit: have you verified that long runs do not show any indication of leaks?

#6 Updated by Aleksei Iupinov over 1 year ago

So I did a 1 rank/16 threads run of NB OpenCL, 96k water box on dev-haswell-gpu01 (with AMD GPU PRO drivers) for 4 hours - with versions before introduction of GpuRegionTimer, and after the fix for the release event (above). I looked at the system free memory, using the corresponding 4th column of the "free" output every second. Curves look the same, their difference is almost flat. By their incline, the system free memory decreases by about 26 MB/hour. Overall, I assume my fix restored the status quo, at least :-)
Log files with output of free kbytes are attached.
The version without GpuRegionTimer I actually ran much longer, so it has a flat part in the end (after gmx stopped).

#7 Updated by Szilárd Páll over 1 year ago

Aleksei Iupinov wrote:

So I did a 1 rank/16 threads run of NB OpenCL, 96k water box on dev-haswell-gpu01 (with AMD GPU PRO drivers) for 4 hours - with versions before introduction of GpuRegionTimer, and after the fix for the release event (above). I looked at the system free memory, using the corresponding 4th column of the "free" output every second. Curves look the same, their difference is almost flat. By their incline, the system free memory decreases by about 26 MB/hour. Overall, I assume my fix restored the status quo, at least :-)

Fair enough, my question is however, whether there is still a memory leak? :)

The trouble with the 4th column of free is that is the true "free" memory, not including disc cache, etc. For that reason, as mdrun does write at least to the log, checkpoint, and possibly energy output, that is not a good metric.

I suggest looking at the "used" column of free and/or the resident memory usage of the gmx process.

#8 Updated by Szilárd Páll about 1 year ago

  • Status changed from Feedback wanted to Resolved

#9 Updated by Paul Bauer about 1 year ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF