Bug #2470
memory leak in OpenCL runs with
Description
OpenCL runs suffer from a memory leak which after a few hours of runtime will consume all host memory. It seems to be related to GPU timing as the leak is not present if timing is turned off. Likely surfaced as a result of #2468.
Associated revisions
History
#1 Updated by Szilárd Páll 11 months ago
Actually, even non-DD runs do leak memory, so there is something strange going on here. Disabling timing seems to strongly reduce or eliminate the effect; there is still a strange drift in memory usage, but it seems to plateau after a while. Need longer runs to assess it.
#2 Updated by Gerrit Code Review Bot 11 months ago
Gerrit received a related DRAFT patchset '2' for Issue #2470.
Uploader: Aleksei Iupinov (a.yupinov@gmail.com)
Change-Id: gromacs~release-2018~I4917de697bee7df98da0037e9165e52e660f83a0
Gerrit URL: https://gerrit.gromacs.org/7730
#3 Updated by Szilárd Páll 11 months ago
- Status changed from New to In Progress
- Assignee set to Aleksei Iupinov
#4 Updated by Aleksei Iupinov 11 months ago
- Status changed from In Progress to Resolved
Applied in changeset f2c9e7785cf77a6b6ede7a39654578185f4b285b.
#5 Updated by Szilárd Páll 10 months ago
- Status changed from Resolved to Feedback wanted
Let me repeat my question from gerrit: have you verified that long runs do not show any indication of leaks?
#6 Updated by Aleksei Iupinov 10 months ago
- File cl_memory_logs.tgz cl_memory_logs.tgz added
So I did a 1 rank/16 threads run of NB OpenCL, 96k water box on dev-haswell-gpu01 (with AMD GPU PRO drivers) for 4 hours - with versions before introduction of GpuRegionTimer, and after the fix for the release event (above). I looked at the system free memory, using the corresponding 4th column of the "free" output every second. Curves look the same, their difference is almost flat. By their incline, the system free memory decreases by about 26 MB/hour. Overall, I assume my fix restored the status quo, at least :-)
Log files with output of free kbytes are attached.
The version without GpuRegionTimer I actually ran much longer, so it has a flat part in the end (after gmx stopped).
#7 Updated by Szilárd Páll 10 months ago
Aleksei Iupinov wrote:
So I did a 1 rank/16 threads run of NB OpenCL, 96k water box on dev-haswell-gpu01 (with AMD GPU PRO drivers) for 4 hours - with versions before introduction of GpuRegionTimer, and after the fix for the release event (above). I looked at the system free memory, using the corresponding 4th column of the "free" output every second. Curves look the same, their difference is almost flat. By their incline, the system free memory decreases by about 26 MB/hour. Overall, I assume my fix restored the status quo, at least :-)
Fair enough, my question is however, whether there is still a memory leak? :)
The trouble with the 4th column of free
is that is the true "free" memory, not including disc cache, etc. For that reason, as mdrun does write at least to the log, checkpoint, and possibly energy output, that is not a good metric.
I suggest looking at the "used" column of free and/or the resident memory usage of the gmx process.
#8 Updated by Szilárd Páll 6 months ago
- Status changed from Feedback wanted to Resolved
#9 Updated by Paul Bauer 6 months ago
- Status changed from Resolved to Closed
Prevent OpenCL timing memory leak
Fixes #2470
Change-Id: I4917de697bee7df98da0037e9165e52e660f83a0