Bug #2737
AMD OpenCl failes release build in complex tests
Description
The release build linked here failed two of the complex tests for AMD OpenCl: http://jenkins.gromacs.org/view/Release/job/Release_workflow_master/170/consoleFull
00:42:32.391 [[gcc-8, gpuhw=amd, opencl-1.2, release]] FAILED. Check checkforce.out (1 errors) file(s) in orientation-restraints for orientation-restraints
00:42:32.391 [[gcc-8, gpuhw=amd, opencl-1.2, release]] FAILED. Check checkpot.out (4 errors), checkforce.out (3 errors) file(s) in tip4p_continue for tip4p_continue
Related issues
Associated revisions
History
#1 Updated by Paul Bauer about 2 years ago
- Target version changed from 2019-beta2 to 2019-beta3
bumped to next beta
#2 Updated by Szilárd Páll about 2 years ago
- Status changed from New to Feedback wanted
As far as I can tell, the AMD OpenCL build runs on the ancient AMD slave which we should not use anymore -- it has an very outdated software stack (and hardware) which may or may not function correctly.
Hence, I'm not sure this is a bug, but I'll wait with closing.
That console output is rather confusing, BTW; is there a way to get separate outputs from separate builds?
#3 Updated by Mark Abraham about 2 years ago
Yes, you can get console log for individual builds if you look through the "Pipeline steps" but it's not exactly convenient. Thus
http://jenkins.gromacs.org/job/Release_workflow_master/173/execution/node/107/log/?consoleFull shows that the new AMD GPU slave also has an issue. That issue seems more like OpenCL builds not being able to jit, but we'll need to inspect the mdrun.out files.
#4 Updated by Mark Abraham about 2 years ago
- Target version changed from 2019-beta3 to 2019
#5 Updated by Gerrit Code Review Bot about 2 years ago
Gerrit received a related patchset '1' for Issue #2737.
Uploader: Berk Hess (hess@kth.se)
Change-Id: gromacs~release-2019~I45068c9187873548dff585044d2c8541444e385c
Gerrit URL: https://gerrit.gromacs.org/8802
#6 Updated by Szilárd Páll about 2 years ago
- File checkforce.out checkforce.out added
On the dev-gpu02 machine it took me a half a day and thousands of passes to reproduce a failure of the orientation restraints test, but this doesn't seem to show the same kind of error as the others (wrong force Z component); attached checkforce.out.
On the build slave with I could indeed reproduce the failure in a 1-200 hundred attempts and with the fix I've got already ~350 iterations completed, so the reduction error seems to be solved on AMD.
I'm still going to keep running the tests because the former failure is suspicious.
#7 Updated by Szilárd Páll about 2 years ago
I've been running tests 5 days straight (both with and without PME offload) without any errors. I've no idea what is wrong with the release builds and not being able to read the jenkins output doesn't help either.
#8 Updated by Paul Bauer about 2 years ago
- File checkforce.out checkforce.out added
The release build has an additional failure in the complex/dd121 case, with the errors in checkforce.out
#9 Updated by Szilárd Páll about 2 years ago
Paul Bauer wrote:
The release build has an additional failure in the complex/dd121 case, with the errors in checkforce.out
This output doesn't seem to tell much, except that the errors do not resemble the previous ones. I can't repro this although what I ran for 5+ days both a dev machine and bs-gpu01 was the orientation restraints tests. Now I have all nbnxn tests running, for a few hours, but so far zero errors, so I'm puzzled.
#10 Updated by Berk Hess about 2 years ago
But these errors are very weird: most coordinate have a massive mismatch in only x, y or z. Some have a mismatch in 2 dimensions. So the coordinate data seems to have gotten corrupted. It might be possible that a single, varying component of one force contribution gets very large values.
#11 Updated by Berk Hess about 2 years ago
Force errors can not explain this, since some atoms that are constrained are affected, whereas atoms they are constrained to are not. So either the coordinate update in the constraints is buggy, or the copy-back of constraint coordinates to the state or the coordinates get corrupted in between constraining and trajectory writing.
#12 Updated by Szilárd Páll about 2 years ago
As noted on gerrit, AFACIT the reason for the failures is that this verification has been buggy / not updated to the right branch and has been running master code.
Assuming the underlying issue is still the already fixed rare OpenCL gather reduction (which was made less rare by me running two concurrent regressiontests in an infinite loop on GPU0 on the AMD slave) it's still intriguing the kind of errors that the above output file shows, but given that I've been unable to reproduce any of these with code other than with master (without the fix), I think we can assume the issue is a false positive until proven otherwise.
#13 Updated by Szilárd Páll about 2 years ago
- File dd121.tar.gz dd121.tar.gz added
Update: so it seems that the release verification only triggers master builds in some cases, at least in some manually triggered builds it doesn't; e.g. http://jenkins.gromacs.org/job/Release_workflow_master/236
Note that the release verification scripts work differently and call make check
with the regressiontest path passed to cmake (not clear at all until you dig through 13Mb of log :), so no direct gmxtest invocation. Consequently, the automatic launch results in N-way decomposition for the dd121 test that failed in my repro case -- so no OpenCL PME here. Instead, LJ/coul SR energies are off already at step 0. Full output attached.
#14 Updated by Paul Bauer about 2 years ago
- Target version changed from 2019 to 2019.1
- Difficulty hard added
- Difficulty deleted (
uncategorized)
This will need to be bumped to the next point release. We have known issues that are documented now, so I guess this should be fine.
#15 Updated by Mark Abraham about 2 years ago
- Category set to mdrun
- Assignee set to Szilárd Páll
- Priority changed from Normal to Low
#16 Updated by Mark Abraham almost 2 years ago
- Target version changed from 2019.1 to 2019.2
#17 Updated by Mark Abraham almost 2 years ago
Possibly related to #2897 not related
#18 Updated by Mark Abraham almost 2 years ago
- Related to Bug #2897: rotation/flex2 can still fail on cpu-only run on OpenCL build added
#19 Updated by Szilárd Páll almost 2 years ago
- Related to deleted (Bug #2897: rotation/flex2 can still fail on cpu-only run on OpenCL build)
#20 Updated by Szilárd Páll almost 2 years ago
The orientation restraints test fails on my laptop with Intel iGPU and only nobondeds offloaded with some subtle checkforce (xfv) errors, so quite likely not PME OpenCL related (possibly not even OpenCL-related?). This however did not reproduce on either NVIDIA or AMD with or without PME offload. The tip4p error I've still not managed to repro so far.
#21 Updated by Paul Bauer almost 2 years ago
- Target version changed from 2019.2 to 2019.3
bumped again
#22 Updated by Szilárd Páll over 1 year ago
Can you please retrigger these -- as I noted above, this may not be related to OpenCL code at all (or at least not to OpenCL code new in 2019).
#23 Updated by Paul Bauer over 1 year ago
I retriggered this and it works now
#24 Updated by Paul Bauer over 1 year ago
- Status changed from Feedback wanted to Resolved
this seems to have been resolved some time in the past. Can be kept open if we need to investigate further in the future.
#25 Updated by Szilárd Páll over 1 year ago
Paul Bauer wrote:
this seems to have been resolved some time in the past. Can be kept open if we need to investigate further in the future.
I do not thin it has. I've retriggered your change (PS11 when it was still based on the OpenCL gather fix) and had 1 or 3 failures; here's the failing case on both complex/dd121
and complex/nbnxn-ljpme-geometric
:
http://jenkins.gromacs.org/job/Release_workflow_master/457/
Note again that AFAICT these runs do DD so they do not use PME OpenCL.
#26 Updated by Szilárd Páll over 1 year ago
Note again that AFAICT these runs do DD so they do not use PME OpenCL.
Confirmed:
- dd121 uses 12 ranks, 3x4x1 decomposition
- nbnxn-ljpme-geometric uses 2 ranks
hence only nonbonded offload in the configs in question that do fail intermittently on bs-gpu01.
#27 Updated by Szilárd Páll over 1 year ago
Update: I now have one instance where manually run test failed on bs-gpu01; I've however not been able to repro it on another dev machine with ROCm yet; started a long loop of test and see what that gives.
#28 Updated by Paul Bauer over 1 year ago
- Status changed from Resolved to Feedback wanted
- Target version changed from 2019.3 to 2019.4
There still seems to be something going on, but the release build seems to pass now.
#29 Updated by Paul Bauer over 1 year ago
- Target version changed from 2019.4 to 2019.5
is this still an issue? bumping
#30 Updated by Szilárd Páll over 1 year ago
- Target version changed from 2019.5 to 2020-beta2
re-targeting to 2020, release build on master running check here: http://jenkins.gromacs.org/view/Release/job/Release_workflow_master/599/
#31 Updated by Szilárd Páll over 1 year ago
- File nbnxn_vsite.tar.gz nbnxn_vsite.tar.gz added
#32 Updated by Szilárd Páll over 1 year ago
- Status changed from Feedback wanted to Accepted
I can not reproduce this on NVIDIA or Intel, only AMD.
#33 Updated by Berk Hess over 1 year ago
Could it be that this is the same issue that I fixed here?
https://gerrit.gromacs.org/c/gromacs/+/13560
#34 Updated by Berk Hess over 1 year ago
Berk Hess wrote:
Could it be that this is the same issue that I fixed here?
https://gerrit.gromacs.org/c/gromacs/+/13560
The issue I fixed only fixed issue with PME on GPU with a PME only rank. That is not the case here.
#35 Updated by Berk Hess over 1 year ago
The output in nbnxn_vsite.tar.gz shows that the LJ and Coulomb pair energies are smaller by 2% and 20% at step 0 already. So likely pair interactions are missing.
#36 Updated by Berk Hess over 1 year ago
Berk Hess wrote:
The output in nbnxn_vsite.tar.gz shows that the LJ and Coulomb pair energies are smaller by 2% and 20% at step 0 already. So likely pair interactions are missing.
Maybe some non-local interactions are missing due to a synchronization issue?
#37 Updated by Berk Hess over 1 year ago
One failure on jenkins has a massive difference in LJ energy at step 0:
LJ (SR) step 0: 1944.39, step 0: 1113.65
Coulomb (SR) step 0: -15234.3, step 0: -15265.1
This difference corresponds to 99.6% of the local LJ energy of one of the two zones. This could be coincidence. But it does show that the difference must come partially, or maybe fully, from missing local interactions. My suspicion is that the local energies, and probably forces, are used before the local kernel has finished.
#38 Updated by Szilárd Páll over 1 year ago
no progress, suggest bump before beta2
#39 Updated by Paul Bauer about 1 year ago
- Target version changed from 2020-beta2 to 2020-beta3
bump
#41 Updated by Paul Bauer about 1 year ago
at least in 2020 this seems to be resolved with https://gerrit.gromacs.org/c/gromacs/+/14623 and efca6b742351c6b66ded94d8c8a75874381737bf
#42 Updated by Paul Bauer about 1 year ago
- Status changed from Accepted to Resolved
#43 Updated by Paul Bauer about 1 year ago
- Status changed from Resolved to Closed
#44 Updated by Szilárd Páll 11 months ago
- Related to Bug #3405: intermitted OpenCL regressiontest failures added
Temporary fix for OpenCL PME gather
There is a race on the z-component of the PME forces in the OpenCL
force reduction in the gather kernel. This change avoid that race.
But a better solution is a different, more efficient reduction.
Refs #2737
Change-Id: I45068c9187873548dff585044d2c8541444e385c