Project

General

Profile

Bug #3100

crash with GPU comm DD

Added by Szilárd Páll 28 days ago. Updated 21 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Fairly standard input, 4-rank PP-only offload run crashes after thousands of steps both with / without DLB and PME tuning. Input attached.

topol.tpr (7.94 MB) topol.tpr Szilárd Páll, 09/20/2019 02:36 AM

Related issues

Related to GROMACS - Feature #2890: GPU Halo ExchangeNew

Associated revisions

Revision 2fb7901d (diff)
Added by Berk Hess 28 days ago

Fix DD 1 pulse checks without DLB

The check for requesting single pulse DD communication only
worked correctly without DLB.
Also added a check for 1 pulse with PME tuning.

Todo: PME tuned runs might still end up with multiple pulses
due to pressure scaling. This needs to be checked.

Fixes #3100

Change-Id: Icac5e37ef79385fd7c7cf5c19c7b40b2d685b95b

History

#1 Updated by Szilárd Páll 28 days ago

#2 Updated by Alan Gray 28 days ago

Trying to reproduce.

@Mark, change
6fb2a76 Use GPU halo exchange only when compatible DD is available
seems to have been over-zealous, and now this case (plus the cases I use for my development) are not actually trigerring GPU halo exchange (even with env variable set).

Rolling back to before this change, I have tested with
gmx mdrun -s topol.tpr -ntomp 10 -pme cpu -nb gpu -ntmpi 4 -nsteps 100000 -v -notunepme -pin on -bonded gpu -dlb no -noconfout -gpu_id 0123
and it runs to completion.

@szilard, please can you let me know the mdrun command you used, which GPU you ran on and after how many steps did it crash.

Thanks

#3 Updated by Alan Gray 28 days ago

@Mark in is1DAnd1PulseDD() you are only returning true if (dd.comm->maxpulse == 1), but in these cases dd.comm->maxpulse=0 so it is returning false. Should it instead be (dd.comm->maxpulse <= 1)?

#4 Updated by Szilárd Páll 28 days ago

Alan Gray wrote:

@szilard, please can you let me know the mdrun command you used, which GPU you ran on and after how many steps did it crash.

GMX_GPU_PME_PP_COMMS=1 GMX_USE_GPU_BUFFER_OPS=1 GMX_GPU_DD_COMMS=1\
gmx mdrun -nsteps 10000 -ntmpi 4

but it crashes with -dlb no as well as -notunepme. The run uses 4 ranks with 4x4 threads + 2 GPUs (so 2 ranks per GPU).

The amount of steps it takes is in the thousands (seen 2000-6000). This is a machine with a P6000 and a GTX TITAN, but I doubt that matters.

#5 Updated by Alan Gray 28 days ago

Thanks - now reproduced. (Noting that PME-PP comms are not relevant here and the bug is still present without GMX_GPU_PME_PP_COMMS=1)

In reference to previous similar race condition we fixed in GPU DD: we originally added a sync before both coordinate and force halo exchange, then we removed it for the force. This case runs successfully if the force sync is reinstated. Needs more work to understand why this is the case.

#6 Updated by Mark Abraham 28 days ago

Alan Gray wrote:

@Mark in is1DAnd1PulseDD() you are only returning true if (dd.comm->maxpulse == 1), but in these cases dd.comm->maxpulse=0 so it is returning false. Should it instead be (dd.comm->maxpulse <= 1)?

Berk's working on a fix for mdrun -dlb no at https://gerrit.gromacs.org/c/gromacs/+/13283

#7 Updated by Szilárd Páll 28 days ago

Alan Gray wrote:

(Noting that PME-PP comms are not relevant here and the bug is still present without GMX_GPU_PME_PP_COMMS=1)

Sure, I just triggered both in testing, but there was no PME offload so that didn't matter.

In reference to previous similar race condition we fixed in GPU DD: we originally added a sync before both coordinate and force halo exchange, then we removed it for the force.

Thanks for the update.

Are you referring to:
https://gerrit.gromacs.org/c/gromacs/+/12943/16/src/gromacs/domdec/gpuhaloexchange_impl.cu#318
?

I do not see why would an MPI sync call be necessary given that there is an explicit sync with the local stream (https://gerrit.gromacs.org/c/gromacs/+/12943/16/src/gromacs/domdec/gpuhaloexchange_impl.cu#266) and implicit ordering on anything nonlocal in the nonlocal stream.

This case runs successfully if the force sync is reinstated. Needs more work to understand why this is the case.

I agree, we need a better understanding.

#8 Updated by Szilárd Páll 28 days ago

Szilárd Páll wrote:

Alan Gray wrote:

(Noting that PME-PP comms are not relevant here and the bug is still present without GMX_GPU_PME_PP_COMMS=1)

Sure, I just triggered both in testing, but there was no PME offload so that didn't matter.

In reference to previous similar race condition we fixed in GPU DD: we originally added a sync before both coordinate and force halo exchange, then we removed it for the force.

Thanks for the update.

Are you referring to:
https://gerrit.gromacs.org/c/gromacs/+/12943/16/src/gromacs/domdec/gpuhaloexchange_impl.cu#318
?

I do not see why would an MPI sync call be necessary given that there is an explicit sync with the local stream (https://gerrit.gromacs.org/c/gromacs/+/12943/16/src/gromacs/domdec/gpuhaloexchange_impl.cu#266) and implicit ordering on anything nonlocal in the nonlocal stream.

...and just as I typed it in it has occurred to me: could it be a race with the bonded task that does write non-local forces, but we put that task in the nonlocal stream with DD so even that should not be an issue, I think (see forcerec.cpp:1988).

#9 Updated by Berk Hess 28 days ago

  • Status changed from New to Resolved

#10 Updated by Alan Gray 25 days ago

Are you referring to: https://gerrit.gromacs.org/c/gromacs/+/12943/16/src/gromacs/domdec/gpuhaloexchange_impl.cu#318?

Yes

I do not see why would an MPI sync call be necessary given that there is an explicit sync with the local stream (https://gerrit.gromacs.org/c/gromacs/+/12943/16/src/gromacs/domdec/gpuhaloexchange_impl.cu#266) and implicit ordering on anything nonlocal in the nonlocal stream.

...and just as I typed it in it has occurred to me: could it be a race with the bonded task that does write non-local forces, but we put that task in the nonlocal stream with DD so even that should not be an issue, I think (see forcerec.cpp:1988).

Agreed - still looking into it.

#11 Updated by Alan Gray 24 days ago

could it be a race with the bonded task that does write non-local forces, but we put that task in the nonlocal stream with DD so even that should not be an issue, I think

Bonded forces will depend on both the local and non-local X buffer ops kernels, right? If bonded is in the non-local stream, then there should be a sync to ensure the local X buffer ops has completed, which I think may be missing in the code (or do we have that somewhere?). Not sure if that is the cause of this race, I guess its possible that use of the GPU halo exchange makes it more likely that the ordering goes wrong.

#12 Updated by Alan Gray 24 days ago

  • Status changed from Resolved to In Progress

#13 Updated by Alan Gray 24 days ago

then there should be a sync to ensure the local X buffer ops has completed, which I think may be missing in the code (or do we have that somewhere?).

Actually, I think we already have this OK through the nbnxnInsertNonlocalGpuDependency() call at the end of the X buffer ops.

#14 Updated by Szilárd Páll 24 days ago

Alan Gray wrote:

then there should be a sync to ensure the local X buffer ops has completed, which I think may be missing in the code (or do we have that somewhere?).

Actually, I think we already have this OK through the nbnxnInsertNonlocalGpuDependency() call at the end of the X buffer ops.

Yes, that sync ensures all dependencies of non-local interactions on local atoms (coordinate transfer, and end-of-step including force buffer clearing and rolling pruning).

#15 Updated by Szilárd Páll 24 days ago

#16 Updated by Mark Abraham 24 days ago

  • Target version changed from 2020-beta1 to 2020-beta2

#17 Updated by Alan Gray 23 days ago

It's looking like the issue is with the conditional D2H copy of nonlocal coordinate data after the X halo exchange:

if (domainWork.haveCpuBondedWork || domainWork.haveFreeEnergyWork)
{
  //non-local part of coordinate buffer must be copied back to host for CPU work
  nbv->launch_copy_x_from_gpu(as_rvec_array(x.unpaddedArrayRef().data()), Nbnxm::AtomLocality::NonLocal);
}

The condition is false for this case, but from debugging the issue it looks like it should be true. Can you think any extra condition we may require in the if statement?

#18 Updated by Mark Abraham 23 days ago

@Alan on what step? With nstlist of what?

#19 Updated by Alan Gray 23 days ago

I’ve looked at it some more and there is a possibility that the above extra copy caused the run to succeed not because of required data access on the host but instead through the side effect of delaying the nonlocal stream. Let me narrow it down some more and I’ll report back.

#20 Updated by Alan Gray 22 days ago

I've found the bug. It's not in the GPU halo exchange, but was introduced in
c69e061 Decouple coordinates buffer management from buffer ops in NBNXM
(@artem)

The problem is that the new function to copy X from host to device for a given locality, when called in do_force to copy the local part, is actually copying extra atoms that encroach on the non-local part.
Originally, the number of local atoms was determined through gridSet.numRealAtomsLocal(). After the above change, the number of local atoms is instead determined through gridSet.grids()[0].atomIndexEnd(). These numbers don't match (at least for this test case), e.g. in one instance

gridSet.numRealAtomsLocal()=34085, gridSet.grids()[0].atomIndexEnd()=35328 

Using the nbnxn_get_atom_range() function to get the atom range in the same way as the original code fixes the problem, as far as I can tell (I've run with and without the fix several times and get consistent results).

#21 Updated by Artem Zhmurov 22 days ago

Alan Gray wrote:

I've found the bug. It's not in the GPU halo exchange, but was introduced in
c69e061 Decouple coordinates buffer management from buffer ops in NBNXM
(@artem)

The problem is that the new function to copy X from host to device for a given locality, when called in do_force to copy the local part, is actually copying extra atoms that encroach on the non-local part.
Originally, the number of local atoms was determined through gridSet.numRealAtomsLocal(). After the above change, the number of local atoms is instead determined through gridSet.grids()[0].atomIndexEnd(). These numbers don't match (at least for this test case), e.g. in one instance
[...]

Using the nbnxn_get_atom_range() function to get the atom range in the same way as the original code fixes the problem, as far as I can tell (I've run with and without the fix several times and get consistent results).

Are you sure that nbnxn_get_atom_range(...) was used to get coordinate ranges? As far as I can tell it was only used for forces. The getAtomRanges(..) I've introduced is just an extraction of the code that was there into a separate function. I wanted to combine the two functions, but they had slightly different logic and my commit intended to be just refactoring. Anyhow, we need to handle these ranges properly. I think the class with AtomLocality enum and getAtomRanges(..) function should be the right solution. Quoting @Mark on this function:
"
Suggest also a todo that stuff like this should return a tuple (or a struct of first and size). For example

int first, size;
std::tie (first, size) = getAtomRanges&lt;AtomLocality::All&gt;();

with

template <> std::tuple&lt;int, int&gt;
StatePropagatorDataGpu::Impl::getAtomRanges&lt;AtomLocality::All&gt;() {
return assertIfElementsAreNegative(std::make_tuple&lt;int, int&gt;(0, numAtomsAll_));
}

is easy to use, code, and document (only a template parameter to document, no pointers to assert are not null, no default switch case, no wondering if the compiler inlines away the switch branch). In turn, that motivates templating the copyTo/FromDevice functions on AtomLocality (suggest another todo).
"

As a quick fix though, reverting to just one correct getAtomRanges(...) function in NBNXM should do. I think.

#22 Updated by Alan Gray 22 days ago

Ok, yes the previous code was also using the same for the coordinates. I don't understand why these are different - can someone explain? cr->dd->comm->atomRanges.numHomeAtoms() which is used in the halo exchange agrees with gridSet.numRealAtomsLocal().

#23 Updated by Artem Zhmurov 22 days ago

Alan Gray wrote:

Ok, yes the previous code was also using the same for the coordinates. I don't understand why these are different - can someone explain? cr->dd->comm->atomRanges.numHomeAtoms() which is used in the halo exchange agrees with gridSet.numRealAtomsLocal().

As far as I can tell, it should be cr->dd->comm->atomRanges.numHomeAtoms() if there is no filler particles (which is the case right now). But @Berk is a much better source of knowledge here.

For now, I would use the right function for both coordinates and forces and remove the other function. It is not ideal, but it is a step forward.

#24 Updated by Alan Gray 22 days ago

Fix at https://gerrit.gromacs.org/c/gromacs/+/13401. @Szilard, can you just check that this also fixes your run?

#25 Updated by Szilárd Páll 21 days ago

Alan Gray wrote:

Fix at https://gerrit.gromacs.org/c/gromacs/+/13401. @Szilard, can you just check that this also fixes your run?

Yes, it looks like it does. However, as noted on gerrrit, so does https://gerrit.gromacs.org/c/gromacs/+/12919 which is preferred as it eliminates a lot of the technical a debt introduced earlier.

#26 Updated by Szilárd Páll 21 days ago

Szilárd Páll wrote:

Alan Gray wrote:

Fix at https://gerrit.gromacs.org/c/gromacs/+/13401. @Szilard, can you just check that this also fixes your run?

Yes, it looks like it does. However, as noted on gerrrit, so does https://gerrit.gromacs.org/c/gromacs/+/12919 which is preferred as it eliminates a lot of the technical a debt introduced earlier.

side-note: 12919 does eliminate this issue as well in tests, but as it requires excessive sync-ing the lack of crashes is a less certain indicator.

Also available in: Atom PDF