Project

General

Profile

Bug #2737

AMD OpenCl failes release build in complex tests

Added by Paul Bauer 5 months ago. Updated 7 days ago.

Status:
Feedback wanted
Priority:
Low
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
hard
Close

Description

The release build linked here failed two of the complex tests for AMD OpenCl: http://jenkins.gromacs.org/view/Release/job/Release_workflow_master/170/consoleFull

00:42:32.391 [[gcc-8, gpuhw=amd, opencl-1.2, release]] FAILED. Check checkforce.out (1 errors) file(s) in orientation-restraints for orientation-restraints

00:42:32.391 [[gcc-8, gpuhw=amd, opencl-1.2, release]] FAILED. Check checkpot.out (4 errors), checkforce.out (3 errors) file(s) in tip4p_continue for tip4p_continue
checkforce.out (24.5 KB) checkforce.out Szilárd Páll, 12/12/2018 12:17 PM
checkforce.out (546 KB) checkforce.out Paul Bauer, 12/17/2018 02:26 PM
dd121.tar.gz (960 KB) dd121.tar.gz Szilárd Páll, 12/18/2018 11:36 PM

Related issues

Related to GROMACS - Bug #2897: rotation/flex2 can still fail on OpenCLNew

Associated revisions

Revision a75919e4 (diff)
Added by Berk Hess 3 months ago

Temporary fix for OpenCL PME gather

There is a race on the z-component of the PME forces in the OpenCL
force reduction in the gather kernel. This change avoid that race.
But a better solution is a different, more efficient reduction.

Refs #2737

Change-Id: I45068c9187873548dff585044d2c8541444e385c

History

#1 Updated by Paul Bauer 5 months ago

  • Target version changed from 2019-beta2 to 2019-beta3

bumped to next beta

#2 Updated by Szilárd Páll 5 months ago

  • Status changed from New to Feedback wanted

As far as I can tell, the AMD OpenCL build runs on the ancient AMD slave which we should not use anymore -- it has an very outdated software stack (and hardware) which may or may not function correctly.

Hence, I'm not sure this is a bug, but I'll wait with closing.

That console output is rather confusing, BTW; is there a way to get separate outputs from separate builds?

#3 Updated by Mark Abraham 5 months ago

Yes, you can get console log for individual builds if you look through the "Pipeline steps" but it's not exactly convenient. Thus
http://jenkins.gromacs.org/job/Release_workflow_master/173/execution/node/107/log/?consoleFull shows that the new AMD GPU slave also has an issue. That issue seems more like OpenCL builds not being able to jit, but we'll need to inspect the mdrun.out files.

#4 Updated by Mark Abraham 4 months ago

  • Target version changed from 2019-beta3 to 2019

#5 Updated by Gerrit Code Review Bot 3 months ago

Gerrit received a related patchset '1' for Issue #2737.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~I45068c9187873548dff585044d2c8541444e385c
Gerrit URL: https://gerrit.gromacs.org/8802

#6 Updated by Szilárd Páll 3 months ago

On the dev-gpu02 machine it took me a half a day and thousands of passes to reproduce a failure of the orientation restraints test, but this doesn't seem to show the same kind of error as the others (wrong force Z component); attached checkforce.out.

On the build slave with I could indeed reproduce the failure in a 1-200 hundred attempts and with the fix I've got already ~350 iterations completed, so the reduction error seems to be solved on AMD.

I'm still going to keep running the tests because the former failure is suspicious.

#7 Updated by Szilárd Páll 3 months ago

I've been running tests 5 days straight (both with and without PME offload) without any errors. I've no idea what is wrong with the release builds and not being able to read the jenkins output doesn't help either.

#8 Updated by Paul Bauer 3 months ago

The release build has an additional failure in the complex/dd121 case, with the errors in checkforce.out

#9 Updated by Szilárd Páll 3 months ago

Paul Bauer wrote:

The release build has an additional failure in the complex/dd121 case, with the errors in checkforce.out

This output doesn't seem to tell much, except that the errors do not resemble the previous ones. I can't repro this although what I ran for 5+ days both a dev machine and bs-gpu01 was the orientation restraints tests. Now I have all nbnxn tests running, for a few hours, but so far zero errors, so I'm puzzled.

#10 Updated by Berk Hess 3 months ago

But these errors are very weird: most coordinate have a massive mismatch in only x, y or z. Some have a mismatch in 2 dimensions. So the coordinate data seems to have gotten corrupted. It might be possible that a single, varying component of one force contribution gets very large values.

#11 Updated by Berk Hess 3 months ago

Force errors can not explain this, since some atoms that are constrained are affected, whereas atoms they are constrained to are not. So either the coordinate update in the constraints is buggy, or the copy-back of constraint coordinates to the state or the coordinates get corrupted in between constraining and trajectory writing.

#12 Updated by Szilárd Páll 3 months ago

As noted on gerrit, AFACIT the reason for the failures is that this verification has been buggy / not updated to the right branch and has been running master code.

Assuming the underlying issue is still the already fixed rare OpenCL gather reduction (which was made less rare by me running two concurrent regressiontests in an infinite loop on GPU0 on the AMD slave) it's still intriguing the kind of errors that the above output file shows, but given that I've been unable to reproduce any of these with code other than with master (without the fix), I think we can assume the issue is a false positive until proven otherwise.

#13 Updated by Szilárd Páll 3 months ago

Update: so it seems that the release verification only triggers master builds in some cases, at least in some manually triggered builds it doesn't; e.g. http://jenkins.gromacs.org/job/Release_workflow_master/236

Note that the release verification scripts work differently and call make check with the regressiontest path passed to cmake (not clear at all until you dig through 13Mb of log :), so no direct gmxtest invocation. Consequently, the automatic launch results in N-way decomposition for the dd121 test that failed in my repro case -- so no OpenCL PME here. Instead, LJ/coul SR energies are off already at step 0. Full output attached.

#14 Updated by Paul Bauer 3 months ago

  • Target version changed from 2019 to 2019.1
  • Difficulty hard added
  • Difficulty deleted (uncategorized)

This will need to be bumped to the next point release. We have known issues that are documented now, so I guess this should be fine.

#15 Updated by Mark Abraham 2 months ago

  • Category set to mdrun
  • Assignee set to Szilárd Páll
  • Priority changed from Normal to Low

#16 Updated by Mark Abraham about 1 month ago

  • Target version changed from 2019.1 to 2019.2

#17 Updated by Mark Abraham 7 days ago

Possibly related to #2897

#18 Updated by Mark Abraham 7 days ago

  • Related to Bug #2897: rotation/flex2 can still fail on OpenCL added

Also available in: Atom PDF