Project

General

Profile

Bug #3226

Regression tests segfault with GPU update and DD

Added by Paul Bauer about 2 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

When offloading update to GPU and activating the development feature flags for GPU haloexchange and direct communication, some of the complex regressiontest can segfault under those conditions:
  • running as
    gmx mdrun -ntmpi 3 -npme 1   -pme gpu -notunepme
    
  • enabling all if the following developer flags:
    GMX_USE_GPU_BUFFER_OPS
    GMX_GPU_DD_COMMS
    GMX_GPU_PME_PP_COMMS
    

The failing tests are (at least) nbnxn_pme and nbnxn_rf

Associated revisions

Revision 7289ee3f (diff)
Added by Paul Bauer about 2 months ago

Revert "Enable GPU update with DD when GPU comm features are enabled"

This reverts commit b88f3eadf90261555508f6a4df64ce184141a66a.

Reason for revert: Mark found the code segfaulting under some conditions with domain decomposition and update enabled, so this is not ready to be used yet.

Refs #3226

Change-Id: I45cf9564a87595d445057da071a8fca5c60f9d9a

Revision f8a8252c (diff)
Added by Artem Zhmurov about 2 months ago

Allow using GPU update with DD and update groups

The GPU update is now can be enabled for the supported DD cases
with the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable. Also
added the checks on whether the SHAKE algorithm was requested,
since SHAKE is not supported by the GPU update.

Refs. #3226, #3163.

Change-Id: I57e3ad3b8a571ec244989e888afd5cfcbaf9b75e

History

#1 Updated by Alan Gray about 2 months ago

  • Assignee changed from Alan Gray to Artem Zhmurov

To correct the description, this issue has nothing to do with the development feature flags, but is a bug in the GPU update code. The same failure occurs without the flags activated (where, to test, the assertion disabling use of GPU update with DD must be removed in the code). Reassigning to Artem to fix.

#2 Updated by Alan Gray about 2 months ago

I’ve traced the failure back to the lincs_cuda kernel: there is a line near the start of the kernel which calculates a reciprocal square root,

xi = gm_x[i];
xj = gm_x[j];
float3 dx = pbcDxAiuc(pbcAiuc, xi, xj);
float rlen = rsqrtf(dx.x * dx.x + dx.y * dx.y + dx.z * dx.z);

but for some threads gm_x[i] and gm_x[j] are both zero, so the distance between them is also zero and the reciprocal goes to infinity.

These threads all seem to correspond to high i,j (e.g. i=1040 j=1039 which is blockIdx.x=0 threadIdx.x=3 on rank 1). Is it possible that simply the lincs kernel is operating on the halo region where it should be restricted to the local region, or something like that?

#3 Updated by Artem Zhmurov about 2 months ago

I think CUDA LINCS in this case is what exposes the bug. If there are more than 1 domain, some of the coordinates for the constrained atoms at the step 0 are zeroes before the update. Even if everything is on a CPU. These are atoms, that are most likely at the border between domains (see printout below). In case of the CPU buffer ops and update, this issue seems to be disappearing at step 0. This bug might have being introduced recently, and may not have any significant effect in case there atoms are just ignored for one step. In case of GPU update, as noted by @Alan, this lead to NaNs in the coordinates after LINCS is applied. I will investigate further tomorrow. However, my limited knowledge of how DD works is a significant factor and I might be needing some help.

Step 0:
...
32-31-21 (3.083085, 1.387040, 0.557815) - (3.072986, 1.488941, 0.444135)
32-1032-21 (3.083085, 1.387040, 0.557815) - (0.000000, 0.000000, 0.000000)
1032-1033-21 (0.000000, 0.000000, 0.000000) - (0.000000, 0.000000, 0.000000)
1033-1034-21 (0.000000, 0.000000, 0.000000) - (0.000000, 0.000000, 0.000000)
...

Step 1:
...
32-31-21 (3.082746, 1.386903, 0.558610) - (3.073303, 1.488556, 0.444652)
32-1032-21 (3.082746, 1.386903, 0.558610) - (2.981956, 1.428162, 0.665010)
1032-1033-21 (2.981956, 1.428162, 0.665010) - (2.935971, 1.308863, 0.749046)
1033-1034-21 (2.935971, 1.308863, 0.749046) - (2.873982, 1.349012, 0.883040)
...

Legend: atom1-atom2-type (x1, y1, z1) - (x2, y2, z2)

#4 Updated by Paul Bauer about 2 months ago

Didn't we have the original assert on DD because of this issue?
I'll see if I can help with this in any way, but if there is no easy fix in sight we should continue to restrict update to single rank and have the option of either direct GPU communication or GPU update.

#5 Updated by Artem Zhmurov about 2 months ago

Paul Bauer wrote:

Didn't we have the original assert on DD because of this issue?
I'll see if I can help with this in any way, but if there is no easy fix in sight we should continue to restrict update to single rank and have the option of either direct GPU communication or GPU update.

It is not related to GPU code path at all, as far as I can tell. For some reasons the CPU update works fine with those zeroes in coordinates for the step 0 and then everything normalizes at step 1. So the error is marginal - only some atoms at one step are not constrained. Or at least it seems to be the case.

#6 Updated by Artem Zhmurov about 2 months ago

Artem Zhmurov wrote:

Paul Bauer wrote:

Didn't we have the original assert on DD because of this issue?
I'll see if I can help with this in any way, but if there is no easy fix in sight we should continue to restrict update to single rank and have the option of either direct GPU communication or GPU update.

It is not related to GPU code path at all, as far as I can tell. For some reasons the CPU update works fine with those zeroes in coordinates for the step 0 and then everything normalizes at step 1. So the error is marginal - only some atoms at one step are not constrained. Or at least it seems to be the case.

Nevermind. I am getting zero coordinates even at non-zero steps. But rather alarmingly, ignoring atoms with zero coordinates make the test pass.

#7 Updated by Mark Abraham about 2 months ago

Paul Bauer wrote:

Didn't we have the original assert on DD because of this issue?

The behavior I reported was on a version that had enabled DD paths, so of course there was no assertion blocking DD behavior.

I'll see if I can help with this in any way, but if there is no easy fix in sight we should continue to restrict update to single rank and have the option of either direct GPU communication or GPU update.

We already did that with the reversion.

#8 Updated by Paul Bauer about 2 months ago

Mark Abraham wrote:

Paul Bauer wrote:

Didn't we have the original assert on DD because of this issue?

The behavior I reported was on a version that had enabled DD paths, so of course there was no assertion blocking DD behavior.

I'll see if I can help with this in any way, but if there is no easy fix in sight we should continue to restrict update to single rank and have the option of either direct GPU communication or GPU update.

We already did that with the reversion.

I was replying to Alan's comment that the error is also triggered in the update path alone if the assertion is removed

#9 Updated by Alan Gray about 2 months ago

I have never been made aware of any reason why the GPU update should not work with DD - I don't know why there is still an assertion blocking the combination (beyond the fact that we are now aware of this bug which we are working to fix now).

#10 Updated by Artem Zhmurov about 2 months ago

  • Status changed from New to Closed

#11 Updated by Artem Zhmurov about 2 months ago

  • Status changed from Closed to In Progress

#12 Updated by Artem Zhmurov about 2 months ago

I think the problem is that when all bonds are constrained, two coupled constraints can end up being split between two domain. Hence, we can not get away without communicating the atom coordinates between domains. So my idea of fitting all bonds constraints into a single block if they fit will only work in non-DD case. I think we should go with the easy solution and fall back to CPU update when (1) all bonds are constrained and (2) there are multiple domains. Or disable all-bonds case for GPU update altogether. Does it sound reasonable?

#13 Updated by Alan Gray about 2 months ago

I thought that the GPU update was already disabled for all-bonds cases and only enabled for h-bonds - I didn't realise the case failing here was all-bonds. Yes, I think it is sensible to fall back to CPU when we have all-bonds and DD. I don't know enough to comment on whether we should fall back on single-GPU - maybe @Berk can comment?

#14 Updated by Berk Hess about 2 months ago

The GPU DD does not implement constraint communication, so it should only be enabled when we do not have constraints or with update groups when constraints are present.

#15 Updated by Artem Zhmurov about 2 months ago

Berk Hess wrote:

The GPU DD does not implement constraint communication, so it should only be enabled when we do not have constraints or with update groups when constraints are present.

Is there a way to check if we are running with update groups in runner.cpp?

#16 Updated by Artem Zhmurov about 2 months ago

Artem Zhmurov wrote:

Berk Hess wrote:

The GPU DD does not implement constraint communication, so it should only be enabled when we do not have constraints or with update groups when constraints are present.

Is there a way to check if we are running with update groups in runner.cpp?

And also: will this logic still work if we don't have constraints?

#17 Updated by Mark Abraham about 2 months ago

Artem Zhmurov wrote:

Berk Hess wrote:

The GPU DD does not implement constraint communication, so it should only be enabled when we do not have constraints or with update groups when constraints are present.

Is there a way to check if we are running with update groups in runner.cpp?

Currently we run makeUpdateGroups() during the construction of DomainDecompositionBuilder around line 1123 of runner.cpp, but we need to know whether update-group support will work before we make decisions about useGpuForUpdate on line 989 and useDomainDecomposition on line 839 (and perhaps at some future time, even the number of thread-MPI ranks to spawn before that).

I suggest someone pulls makeUpdateGroups() out to before line 839, so that we can pass the resulting vector into the DD setup later on, and in the meantime can compute booleans from it that will allow us to make correct and consistent decisions in decideWhetherToUseGpuForUpdate().

That will automatically make the right decisions, e.g. in the absence of constraints and vsites.

#18 Updated by Artem Zhmurov about 2 months ago

  • Status changed from In Progress to Resolved

https://gerrit.gromacs.org/#/c/gromacs/+/14640/ fixes the issue. The coordinates were not copied from GPU for PP-PME communication.

#19 Updated by Paul Bauer about 1 month ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF