Project

General

Profile

Bug #2737

AMD OpenCl failes release build in complex tests

Added by Paul Bauer 11 months ago. Updated 1 day ago.

Status:
Accepted
Priority:
Low
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
hard
Close

Description

The release build linked here failed two of the complex tests for AMD OpenCl: http://jenkins.gromacs.org/view/Release/job/Release_workflow_master/170/consoleFull

00:42:32.391 [[gcc-8, gpuhw=amd, opencl-1.2, release]] FAILED. Check checkforce.out (1 errors) file(s) in orientation-restraints for orientation-restraints

00:42:32.391 [[gcc-8, gpuhw=amd, opencl-1.2, release]] FAILED. Check checkpot.out (4 errors), checkforce.out (3 errors) file(s) in tip4p_continue for tip4p_continue
checkforce.out (24.5 KB) checkforce.out Szilárd Páll, 12/12/2018 12:17 PM
checkforce.out (546 KB) checkforce.out Paul Bauer, 12/17/2018 02:26 PM
dd121.tar.gz (960 KB) dd121.tar.gz Szilárd Páll, 12/18/2018 11:36 PM
nbnxn_vsite.tar.gz (534 KB) nbnxn_vsite.tar.gz test failing with master on a 6-rank 2-GPU config on AMD + AMDGPU-PRO Szilárd Páll, 10/01/2019 11:50 PM

Associated revisions

Revision a75919e4 (diff)
Added by Berk Hess 10 months ago

Temporary fix for OpenCL PME gather

There is a race on the z-component of the PME forces in the OpenCL
force reduction in the gather kernel. This change avoid that race.
But a better solution is a different, more efficient reduction.

Refs #2737

Change-Id: I45068c9187873548dff585044d2c8541444e385c

History

#1 Updated by Paul Bauer 11 months ago

  • Target version changed from 2019-beta2 to 2019-beta3

bumped to next beta

#2 Updated by Szilárd Páll 11 months ago

  • Status changed from New to Feedback wanted

As far as I can tell, the AMD OpenCL build runs on the ancient AMD slave which we should not use anymore -- it has an very outdated software stack (and hardware) which may or may not function correctly.

Hence, I'm not sure this is a bug, but I'll wait with closing.

That console output is rather confusing, BTW; is there a way to get separate outputs from separate builds?

#3 Updated by Mark Abraham 11 months ago

Yes, you can get console log for individual builds if you look through the "Pipeline steps" but it's not exactly convenient. Thus
http://jenkins.gromacs.org/job/Release_workflow_master/173/execution/node/107/log/?consoleFull shows that the new AMD GPU slave also has an issue. That issue seems more like OpenCL builds not being able to jit, but we'll need to inspect the mdrun.out files.

#4 Updated by Mark Abraham 11 months ago

  • Target version changed from 2019-beta3 to 2019

#5 Updated by Gerrit Code Review Bot 10 months ago

Gerrit received a related patchset '1' for Issue #2737.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~I45068c9187873548dff585044d2c8541444e385c
Gerrit URL: https://gerrit.gromacs.org/8802

#6 Updated by Szilárd Páll 10 months ago

On the dev-gpu02 machine it took me a half a day and thousands of passes to reproduce a failure of the orientation restraints test, but this doesn't seem to show the same kind of error as the others (wrong force Z component); attached checkforce.out.

On the build slave with I could indeed reproduce the failure in a 1-200 hundred attempts and with the fix I've got already ~350 iterations completed, so the reduction error seems to be solved on AMD.

I'm still going to keep running the tests because the former failure is suspicious.

#7 Updated by Szilárd Páll 10 months ago

I've been running tests 5 days straight (both with and without PME offload) without any errors. I've no idea what is wrong with the release builds and not being able to read the jenkins output doesn't help either.

#8 Updated by Paul Bauer 10 months ago

The release build has an additional failure in the complex/dd121 case, with the errors in checkforce.out

#9 Updated by Szilárd Páll 10 months ago

Paul Bauer wrote:

The release build has an additional failure in the complex/dd121 case, with the errors in checkforce.out

This output doesn't seem to tell much, except that the errors do not resemble the previous ones. I can't repro this although what I ran for 5+ days both a dev machine and bs-gpu01 was the orientation restraints tests. Now I have all nbnxn tests running, for a few hours, but so far zero errors, so I'm puzzled.

#10 Updated by Berk Hess 10 months ago

But these errors are very weird: most coordinate have a massive mismatch in only x, y or z. Some have a mismatch in 2 dimensions. So the coordinate data seems to have gotten corrupted. It might be possible that a single, varying component of one force contribution gets very large values.

#11 Updated by Berk Hess 10 months ago

Force errors can not explain this, since some atoms that are constrained are affected, whereas atoms they are constrained to are not. So either the coordinate update in the constraints is buggy, or the copy-back of constraint coordinates to the state or the coordinates get corrupted in between constraining and trajectory writing.

#12 Updated by Szilárd Páll 10 months ago

As noted on gerrit, AFACIT the reason for the failures is that this verification has been buggy / not updated to the right branch and has been running master code.

Assuming the underlying issue is still the already fixed rare OpenCL gather reduction (which was made less rare by me running two concurrent regressiontests in an infinite loop on GPU0 on the AMD slave) it's still intriguing the kind of errors that the above output file shows, but given that I've been unable to reproduce any of these with code other than with master (without the fix), I think we can assume the issue is a false positive until proven otherwise.

#13 Updated by Szilárd Páll 10 months ago

Update: so it seems that the release verification only triggers master builds in some cases, at least in some manually triggered builds it doesn't; e.g. http://jenkins.gromacs.org/job/Release_workflow_master/236

Note that the release verification scripts work differently and call make check with the regressiontest path passed to cmake (not clear at all until you dig through 13Mb of log :), so no direct gmxtest invocation. Consequently, the automatic launch results in N-way decomposition for the dd121 test that failed in my repro case -- so no OpenCL PME here. Instead, LJ/coul SR energies are off already at step 0. Full output attached.

#14 Updated by Paul Bauer 10 months ago

  • Target version changed from 2019 to 2019.1
  • Difficulty hard added
  • Difficulty deleted (uncategorized)

This will need to be bumped to the next point release. We have known issues that are documented now, so I guess this should be fine.

#15 Updated by Mark Abraham 9 months ago

  • Category set to mdrun
  • Assignee set to Szilárd Páll
  • Priority changed from Normal to Low

#16 Updated by Mark Abraham 8 months ago

  • Target version changed from 2019.1 to 2019.2

#17 Updated by Mark Abraham 7 months ago

Possibly related to #2897 not related

#18 Updated by Mark Abraham 7 months ago

  • Related to Bug #2897: rotation/flex2 can still fail on cpu-only run on OpenCL build added

#19 Updated by Szilárd Páll 7 months ago

  • Related to deleted (Bug #2897: rotation/flex2 can still fail on cpu-only run on OpenCL build)

#20 Updated by Szilárd Páll 6 months ago

The orientation restraints test fails on my laptop with Intel iGPU and only nobondeds offloaded with some subtle checkforce (xfv) errors, so quite likely not PME OpenCL related (possibly not even OpenCL-related?). This however did not reproduce on either NVIDIA or AMD with or without PME offload. The tip4p error I've still not managed to repro so far.

#21 Updated by Paul Bauer 6 months ago

  • Target version changed from 2019.2 to 2019.3

bumped again

#22 Updated by Szilárd Páll 4 months ago

Can you please retrigger these -- as I noted above, this may not be related to OpenCL code at all (or at least not to OpenCL code new in 2019).

#23 Updated by Paul Bauer 4 months ago

I retriggered this and it works now

#24 Updated by Paul Bauer 4 months ago

  • Status changed from Feedback wanted to Resolved

this seems to have been resolved some time in the past. Can be kept open if we need to investigate further in the future.

#25 Updated by Szilárd Páll 4 months ago

Paul Bauer wrote:

this seems to have been resolved some time in the past. Can be kept open if we need to investigate further in the future.

I do not thin it has. I've retriggered your change (PS11 when it was still based on the OpenCL gather fix) and had 1 or 3 failures; here's the failing case on both complex/dd121 and complex/nbnxn-ljpme-geometric:
http://jenkins.gromacs.org/job/Release_workflow_master/457/

Note again that AFAICT these runs do DD so they do not use PME OpenCL.

#26 Updated by Szilárd Páll 4 months ago

Note again that AFAICT these runs do DD so they do not use PME OpenCL.

Confirmed:
- dd121 uses 12 ranks, 3x4x1 decomposition
- nbnxn-ljpme-geometric uses 2 ranks
hence only nonbonded offload in the configs in question that do fail intermittently on bs-gpu01.

#27 Updated by Szilárd Páll 4 months ago

Update: I now have one instance where manually run test failed on bs-gpu01; I've however not been able to repro it on another dev machine with ROCm yet; started a long loop of test and see what that gives.

#28 Updated by Paul Bauer 4 months ago

  • Status changed from Resolved to Feedback wanted
  • Target version changed from 2019.3 to 2019.4

There still seems to be something going on, but the release build seems to pass now.

#29 Updated by Paul Bauer 15 days ago

  • Target version changed from 2019.4 to 2019.5

is this still an issue? bumping

#30 Updated by Szilárd Páll 14 days ago

  • Target version changed from 2019.5 to 2020-beta2

re-targeting to 2020, release build on master running check here: http://jenkins.gromacs.org/view/Release/job/Release_workflow_master/599/

#32 Updated by Szilárd Páll 12 days ago

  • Status changed from Feedback wanted to Accepted

I can not reproduce this on NVIDIA or Intel, only AMD.

#33 Updated by Berk Hess 12 days ago

Could it be that this is the same issue that I fixed here?
https://gerrit.gromacs.org/c/gromacs/+/13560

#34 Updated by Berk Hess 12 days ago

Berk Hess wrote:

Could it be that this is the same issue that I fixed here?
https://gerrit.gromacs.org/c/gromacs/+/13560

The issue I fixed only fixed issue with PME on GPU with a PME only rank. That is not the case here.

#35 Updated by Berk Hess 12 days ago

The output in nbnxn_vsite.tar.gz shows that the LJ and Coulomb pair energies are smaller by 2% and 20% at step 0 already. So likely pair interactions are missing.

#36 Updated by Berk Hess 12 days ago

Berk Hess wrote:

The output in nbnxn_vsite.tar.gz shows that the LJ and Coulomb pair energies are smaller by 2% and 20% at step 0 already. So likely pair interactions are missing.

Maybe some non-local interactions are missing due to a synchronization issue?

#37 Updated by Berk Hess 5 days ago

One failure on jenkins has a massive difference in LJ energy at step 0:
LJ (SR) step 0: 1944.39, step 0: 1113.65
Coulomb (SR) step 0: -15234.3, step 0: -15265.1

This difference corresponds to 99.6% of the local LJ energy of one of the two zones. This could be coincidence. But it does show that the difference must come partially, or maybe fully, from missing local interactions. My suspicion is that the local energies, and probably forces, are used before the local kernel has finished.

#38 Updated by Szilárd Páll 1 day ago

no progress, suggest bump before beta2

Also available in: Atom PDF