AMD OpenCl failes release build in complex tests

Added by Paul Bauer 2 months ago. Updated 5 days ago.

The release build linked here failed two of the complex tests for AMD OpenCl:

00:42:32.391 [[gcc-8, gpuhw=amd, opencl-1.2, release]] FAILED. Check checkforce.out (1 errors) file(s) in orientation-restraints for orientation-restraints

00:42:32.391 [[gcc-8, gpuhw=amd, opencl-1.2, release]] FAILED. Check checkpot.out (4 errors), checkforce.out (3 errors) file(s) in tip4p_continue for tip4p_continue
Added by Berk Hess about 1 month ago

Temporary fix for OpenCL PME gather

There is a race on the z-component of the PME forces in the OpenCL
force reduction in the gather kernel. This change avoid that race.
But a better solution is a different, more efficient reduction.

Refs #2737

#1 Updated by Paul Bauer 2 months ago

  • Target version changed from 2019-beta2 to 2019-beta3

bumped to next beta

#2 Updated by Szilárd Páll 2 months ago

  • Status changed from New to Feedback wanted

As far as I can tell, the AMD OpenCL build runs on the ancient AMD slave which we should not use anymore -- it has an very outdated software stack (and hardware) which may or may not function correctly.

Hence, I'm not sure this is a bug, but I'll wait with closing.

That console output is rather confusing, BTW; is there a way to get separate outputs from separate builds?

#3 Updated by Mark Abraham 2 months ago

Yes, you can get console log for individual builds if you look through the "Pipeline steps" but it's not exactly convenient. Thus shows that the new AMD GPU slave also has an issue. That issue seems more like OpenCL builds not being able to jit, but we'll need to inspect the mdrun.out files.

#4 Updated by Mark Abraham about 2 months ago

  • Target version changed from 2019-beta3 to 2019

#5 Updated by Gerrit Code Review Bot about 1 month ago

Gerrit received a related patchset '1' for Issue #2737.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~I45068c9187873548dff585044d2c8541444e385c
Gerrit URL:

#6 Updated by Szilárd Páll about 1 month ago

On the dev-gpu02 machine it took me a half a day and thousands of passes to reproduce a failure of the orientation restraints test, but this doesn't seem to show the same kind of error as the others (wrong force Z component); attached checkforce.out.

On the build slave with I could indeed reproduce the failure in a 1-200 hundred attempts and with the fix I've got already ~350 iterations completed, so the reduction error seems to be solved on AMD.

I'm still going to keep running the tests because the former failure is suspicious.

#7 Updated by Szilárd Páll about 1 month ago

I've been running tests 5 days straight (both with and without PME offload) without any errors. I've no idea what is wrong with the release builds and not being able to read the jenkins output doesn't help either.

#8 Updated by Paul Bauer about 1 month ago

The release build has an additional failure in the complex/dd121 case, with the errors in checkforce.out

#9 Updated by Szilárd Páll about 1 month ago

Paul Bauer wrote:

The release build has an additional failure in the complex/dd121 case, with the errors in checkforce.out

This output doesn't seem to tell much, except that the errors do not resemble the previous ones. I can't repro this although what I ran for 5+ days both a dev machine and bs-gpu01 was the orientation restraints tests. Now I have all nbnxn tests running, for a few hours, but so far zero errors, so I'm puzzled.

#10 Updated by Berk Hess about 1 month ago

But these errors are very weird: most coordinate have a massive mismatch in only x, y or z. Some have a mismatch in 2 dimensions. So the coordinate data seems to have gotten corrupted. It might be possible that a single, varying component of one force contribution gets very large values.

#11 Updated by Berk Hess about 1 month ago

Force errors can not explain this, since some atoms that are constrained are affected, whereas atoms they are constrained to are not. So either the coordinate update in the constraints is buggy, or the copy-back of constraint coordinates to the state or the coordinates get corrupted in between constraining and trajectory writing.

#12 Updated by Szilárd Páll about 1 month ago

As noted on gerrit, AFACIT the reason for the failures is that this verification has been buggy / not updated to the right branch and has been running master code.

Assuming the underlying issue is still the already fixed rare OpenCL gather reduction (which was made less rare by me running two concurrent regressiontests in an infinite loop on GPU0 on the AMD slave) it's still intriguing the kind of errors that the above output file shows, but given that I've been unable to reproduce any of these with code other than with master (without the fix), I think we can assume the issue is a false positive until proven otherwise.

#13 Updated by Szilárd Páll about 1 month ago

Update: so it seems that the release verification only triggers master builds in some cases, at least in some manually triggered builds it doesn't; e.g.

Note that the release verification scripts work differently and call make check with the regressiontest path passed to cmake (not clear at all until you dig through 13Mb of log :), so no direct gmxtest invocation. Consequently, the automatic launch results in N-way decomposition for the dd121 test that failed in my repro case -- so no OpenCL PME here. Instead, LJ/coul SR energies are off already at step 0. Full output attached.

#14 Updated by Paul Bauer 19 days ago

  • Target version changed from 2019 to 2019.1
  • Difficulty hard added
  • Difficulty deleted (uncategorized)

This will need to be bumped to the next point release. We have known issues that are documented now, so I guess this should be fine.

#15 Updated by Mark Abraham 5 days ago

  • Category set to mdrun
  • Assignee set to Szilárd Páll
  • Priority changed from Normal to Low

