Project

General

Profile

Bug #2702

PME gather reduction race in OpenCL (and CUDA)

Added by Berk Hess 12 months ago. Updated 7 months ago.

Status:
Accepted
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
2018.3
Affected version:
Difficulty:
uncategorized
Close

Description

The complex.orientation restraints failed (at filing) twice on the nvidia opencl build, e.g.:
http://jenkins.gromacs.org/job/Matrix_PreSubmit_master/6908/OPTIONS=gcc-6%20openmp%20gpuhw=nvidia%20opencl-1.2%20clFFT-2.14%20mpi%20simd=avx2_256%20host=bs_nix1310,label=bs_nix1310/testReport/junit/(root)/complex/orientation_restraints/

The force differences are on all atoms and small, which could be explained by incorrect PME long range forces.


Related issues

Related to GROMACS - Bug #2642: mdrun with SIMD triggers floating point exceptionsClosed

Associated revisions

Revision 577b4d23 (diff)
Added by Mark Abraham 12 months ago

Make PME OpenCL enabled only for AMD devices

Other vendor devices have known issues, but fixes
are not yet complete.

Refs #2702, #2719

Change-Id: I0d443229ffe4cee3bb4029f57502f9c7fba2574d

History

#1 Updated by Berk Hess 12 months ago

  • Related to Bug #2642: mdrun with SIMD triggers floating point exceptions added

#2 Updated by Szilárd Páll 12 months ago

I've not been tracking the failures on jenkins, so not sure what the status is: do we have any evidence that this is indeed OpenCL-related? Paul mentioned today that he's seen such errors in non-GPU builds too.

#3 Updated by Mark Abraham 12 months ago

There's many suspected failing configurations. See #2642. An OpenCL configuration seems more susceptible to at least one failure mode.

#6 Updated by Berk Hess 12 months ago

The first detected deviation in the failure Paul linked is a force deviation on 4 atoms, only in the z-component:
f[ 33] (-4.82232e+02 -1.67281e+03 -9.86697e+02) - (-4.82230e+02 -1.67281e+03 -1.01275e+03)
f[ 41] ( 2.27018e+02 -4.49755e+02 4.39665e+02) - ( 2.27018e+02 -4.49755e+02 4.39083e+02)
f[ 297] (-8.72075e+02 -3.91538e+02 2.48470e+02) - (-8.72074e+02 -3.91542e+02 2.45434e+02)
f[ 489] (-3.79490e+02 4.31009e+02 7.80092e+02) - (-3.79490e+02 4.31009e+02 7.82264e+02)

These 4 atoms are protein atoms, not spatially close to each other or to box edges. That only the z component is off suggests some synchronization issue in the OpenCL PME gather kernel, where different iterations or threads/warps operate on different dimensions. I don't see any issue in the code though and we have not seen AMD issues yet, so I wonder if this is some nvidia opencl bug.

#7 Updated by Szilárd Páll 12 months ago

Running tests to try to reproduce. It does indeed seems fishy, but as Berk noted offline only the gather seems to do component-wise treatment (which would be prone to repeatedly screwing up only one force component); this code was suspicious to begin with as it has an extra unexplained barrier (that I wanted to remove in fact as I believed it is not needed).

#8 Updated by Szilárd Páll 12 months ago

  • Category set to mdrun
  • Status changed from New to Accepted
  • Affected version changed from git master to 2019-beta1

Reproduced. So far I can't see whether this is a subtle bug in the code (though the code is identical to the Fermi reduction in CUDA) or a compiler issue. What is clear however is that the seemingly unnecessary barrier (https://redmine.gromacs.org/projects/gromacs/repository/revisions/release-2019/entry/src/gromacs/ewald/pme-gather.clh#L129) just reduced the frequency of the issue.

#9 Updated by Berk Hess 12 months ago

But the issue still only occurs on nvidia and not on amd?
The only difference I can see it the warp size.

#10 Updated by Szilárd Páll 12 months ago

Berk Hess wrote:

But the issue still only occurs on nvidia and not on amd?
The only difference I can see it the warp size.

Done more tests now:
- on AMD it works fine even without the "extra" barrier
- on NVIDIA Fermi it fails at least in one out of ten runs even with the "extra" barrier, on Pascal 0 out of ~6000 runs failed; however without the barrier on both Kepler and Pascal it fails in 100% of the runs.

#11 Updated by Szilárd Páll 12 months ago

The bug is in the gather reduction, present both in the shared memory reduction in CUDA (code not used in the 2019 release, but used on Fermi in the 2018) as well as the reimplementation in OpenCL. I have a workaround, but the reduction code is not just incorrect, but overly convoluted and far from optimal too so we should consider a rewrite of this rather small function -- possibly for 2019 if that's OK in a beta. I'm quite busy, but if this week I get to it, it can make it to beta3; otherwise, probably it's best it we fix it and leave the rewrite for later.

#12 Updated by Gerrit Code Review Bot 12 months ago

Gerrit received a related patchset '2' for Issue #2702.
Uploader: Mark Abraham ()
Change-Id: gromacs~release-2019~I0d443229ffe4cee3bb4029f57502f9c7fba2574d
Gerrit URL: https://gerrit.gromacs.org/8653

#13 Updated by Szilárd Páll 10 months ago

Although we thought the original bug was not in issue for exec width >=64, we seem to be continue to get failing tests with forces slightly incorrect and only in the third component:

http://jenkins.gromacs.org/view/Gerrit%20pre-submit/job/Matrix_PreSubmit_2019/362/OPTIONS=gcc-8%20openmp%20simd=avx2_256%20gpuhw=amd%20opencl-1.2%20clFFT-2.14%20host=bs_gpu01,label=bs_gpu01/testReport/junit/(root)/complex/orientation_restraints

Late conclusion for this week and perhaps late for the RC1, but perhaps instead of patching it up, it may be best simply rewrite this reduction.

#14 Updated by Gerrit Code Review Bot 10 months ago

Gerrit received a related patchset '1' for Issue #2702.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~Id3143887167a194f7eef7825239fdb841f17ebdc
Gerrit URL: https://gerrit.gromacs.org/8805

#15 Updated by Szilárd Páll 10 months ago

  • Subject changed from Suspected OpenCL (nvidia?) PME issue to PME gather reduction race in OpenCL (and CUDA)
  • Affected version - extra info set to 2018.3

Also affects 2018 on Fermi with the shared memory reduction codepath as well as 2019 CUDA code which still contains this as a fallback.

#16 Updated by Szilárd Páll 10 months ago

Szilárd Páll wrote:

Also affects 2018 on Fermi with the shared memory reduction codepath as well as 2019 CUDA code which still contains this as a fallback.

A few more synchthreads are needed, change 8805 is not enough. I've an an old draft that should have solved most of the issues, but I have not finalized it because it was still triggering race errors reports by the cuda race checker tool. I need to first either understand why those races are reported (as false positives?) or fix them.

#17 Updated by Mark Abraham 7 months ago

Possibly related to #2897

#18 Updated by Mark Abraham 7 months ago

  • Related to Bug #2897: rotation/flex2 can still fail on cpu-only run on OpenCL build added

#19 Updated by Szilárd Páll 7 months ago

  • Related to deleted (Bug #2897: rotation/flex2 can still fail on cpu-only run on OpenCL build)

#20 Updated by Szilárd Páll 7 months ago

Mark Abraham wrote:

Possibly related to #2897

As noted there, it is not (was a CPU run where the rotation test failed).

Also available in: Atom PDF