Project

General

Profile

Bug #1756

incorrect CUDA inter-stream synchronization on CC >=3.5 devices

Added by Szilárd Páll over 4 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
High
Category:
mdrun
Target version:
Affected version - extra info:
5.x
Affected version:
Difficulty:
uncategorized
Close

Description

The synchronization between non-local and local GPU streams was set up with the assumption that, as GPUs had a single work-queue, the order of issuing tasks into the device queue ensured and implicit dependency between tasks issued into different streams (in a deterministic order). Therefore as local H2D transfer is always issued first, local coordinates would surely arrive to GPU before the non-local kernel can start.

However, with multiple hardware queues (Hyper-Q) the implicit dependence between tasks issued in different streams is broken and the GPU scheduler is free to pick tasks from either local or non-local stream. As the non-local stream only synchronizes on an event that ensure that the preceding force buffer clearing is complete, the non-local (coordinate H2D and) kernel can start before the local coordinate H2D is complete.

The probability of this condition to be triggered is likely increased by stream priorities as well as multiple tasks running on the GPU - which could be another application or multiple MPI ranks sharing the device.

The bug affects all GROMACS versions with GPU support running on any CC >=3.5 device (GK210 and later). The direct result of the bug is that, when triggered, non-local forces will be computed with wrong or outdated coordinates.

How likely is this bug triggered is still being investigated and this will determine its severity.


Related issues

Related to GROMACS - Bug #1623: Mdrun crashes for GPU runs (maybe multi-rank issue)Closed

Associated revisions

Revision 3fed3960 (diff)
Added by Mark Abraham over 4 years ago

Avoid GPU data race also with OpenCL

Implements the same change to non-local stream synchronization as now
used for CUDA.

Fixes #1756

Change-Id: I720edc0951f97dcff0bd477084fff45a149f01d9

History

#1 Updated by Szilárd Páll over 4 years ago

  • Status changed from New to In Progress

#2 Updated by Mark Abraham over 4 years ago

We have discussed a plan to add an extra local atom whose position can be set at the end of the non-local force kernel to a sentinel value, and which should be cleared by the local H2D transfer in the next step. Then at the start of the next non-local force kernel it can look for the sentinel value, and set a flag in the force for that atom. The normal D2H local(?) f transfer will take that back to the host, who can later observe and log frequency of this adverse race event. The trajectory will otherwise be normal, so we also have the ability to correlate frequency with observables if that seems warranted.

#3 Updated by Mark Abraham over 4 years ago

  • Status changed from In Progress to Resolved

https://gerrit.gromacs.org/#/c/4801/ fixed this potential race. Szilard's test code was completely unable to observe an on-device race, so we think that's a bullet dodged.

#4 Updated by Szilárd Páll over 4 years ago

Mark Abraham wrote:

https://gerrit.gromacs.org/#/c/4801/ fixed this potential race. Szilard's test code was completely unable to observe an on-device race, so we think that's a bullet dodged.

Note that I have only tested with 2-3 typical inputs and only on single node limited concurrency runs. I will likely not be able to do larger-scale tests right now due to lack of access to bigger resources. Do you have access to the Garching machine (or even better would be a Cray where MPS surely works) to run a test with more concurrency?

#5 Updated by Szilárd Páll over 4 years ago

Szilárd Páll wrote:

Mark Abraham wrote:

https://gerrit.gromacs.org/#/c/4801/ fixed this potential race. Szilard's test code was completely unable to observe an on-device race, so we think that's a bullet dodged.

Note that I have only tested with 2-3 typical inputs and only on single node limited concurrency runs. I will likely not be able to do larger-scale tests right now due to lack of access to bigger resources. Do you have access to the Garching machine (or even better would be a Cray where MPS surely works) to run a test with more concurrency?

PS: forgot to add the "Fixes" string to the orginal commit, sorry about that. We could add it to the 5.0->5.1 merge commit so at least that gets linked here properly.

#6 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #1756.
Uploader: Mark Abraham ()
Change-Id: I720edc0951f97dcff0bd477084fff45a149f01d9
Gerrit URL: https://gerrit.gromacs.org/4808

#7 Updated by Mark Abraham over 4 years ago

  • % Done changed from 0 to 100

#8 Updated by Rossen Apostolov over 4 years ago

  • Status changed from Resolved to Closed

#9 Updated by Szilárd Páll over 4 years ago

  • Related to Bug #1623: Mdrun crashes for GPU runs (maybe multi-rank issue) added

#10 Updated by Szilárd Páll over 4 years ago

  • Status changed from Closed to Feedback wanted

Update: in repeated, longer runs I did manage to trigger the race condition. It took some time because it seems that the frequency depends on some conditions that I'm not entirely sure about and is possibly also affected by the way I have been counting the occurrences of the race condition.

While in a couple of tests even up to 5e7 steps long runs did not trigger the bug, the same code and input did trigger the race a number of times with a frequency of once per ~3e7 steps (for the DNA input attached to #1623). Another input has triggered recently more frequently, but I suspect this could be caused by the nature of the code changes I made to record the event (in particular that I used a non-pinned H2D transfer to place a sentinel value in the GPU xq buffer).

I am running a reproducer that should have somewhat lower impact on the GPU task execution (pinned H2D transfer) and another approach for counting the race events is still being (re)considered.

I'm reopening the change to reflect that investigation is ongoing. The effect of calculating forces with outdated coordinates (only by 1 step!) with the observed frequency is most likely negligible. However, I can not exclude the possibility that the race may be triggered more frequently under different conditions (e.g. different hardware, software, or input system). Hence, I suggest to explicitly mention this bug in the release notes of the coming release(s). Additionally, I am considering providing a patch that users can use to count the occurrence of the bug. Thoughts?

#11 Updated by Carsten Kutzner about 4 years ago

I posted my feedback at #1623.

#12 Updated by Gerrit Code Review Bot about 4 years ago

Gerrit received a related DRAFT patchset '2' for Issue #1756.
Uploader: Szilárd Páll ()
Change-Id: Ic8ed79057de209997538355d4eb1207cf328c156
Gerrit URL: https://gerrit.gromacs.org/4943

#13 Updated by Mark Abraham about 4 years ago

  • Status changed from Feedback wanted to Resolved

#14 Updated by Mark Abraham about 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF