AMD OpenCl failes release build in complex tests
The release build linked here failed two of the complex tests for AMD OpenCl: http://jenkins.gromacs.org/view/Release/job/Release_workflow_master/170/consoleFull
00:42:32.391 [[gcc-8, gpuhw=amd, opencl-1.2, release]] FAILED. Check checkforce.out (1 errors) file(s) in orientation-restraints for orientation-restraints
00:42:32.391 [[gcc-8, gpuhw=amd, opencl-1.2, release]] FAILED. Check checkpot.out (4 errors), checkforce.out (3 errors) file(s) in tip4p_continue for tip4p_continue
Temporary fix for OpenCL PME gather
There is a race on the z-component of the PME forces in the OpenCL
force reduction in the gather kernel. This change avoid that race.
But a better solution is a different, more efficient reduction.
#2 Updated by Szilárd Páll almost 2 years ago
- Status changed from New to Feedback wanted
As far as I can tell, the AMD OpenCL build runs on the ancient AMD slave which we should not use anymore -- it has an very outdated software stack (and hardware) which may or may not function correctly.
Hence, I'm not sure this is a bug, but I'll wait with closing.
That console output is rather confusing, BTW; is there a way to get separate outputs from separate builds?
#3 Updated by Mark Abraham almost 2 years ago
Yes, you can get console log for individual builds if you look through the "Pipeline steps" but it's not exactly convenient. Thus
http://jenkins.gromacs.org/job/Release_workflow_master/173/execution/node/107/log/?consoleFull shows that the new AMD GPU slave also has an issue. That issue seems more like OpenCL builds not being able to jit, but we'll need to inspect the mdrun.out files.
#6 Updated by Szilárd Páll almost 2 years ago
On the dev-gpu02 machine it took me a half a day and thousands of passes to reproduce a failure of the orientation restraints test, but this doesn't seem to show the same kind of error as the others (wrong force Z component); attached checkforce.out.
On the build slave with I could indeed reproduce the failure in a 1-200 hundred attempts and with the fix I've got already ~350 iterations completed, so the reduction error seems to be solved on AMD.
I'm still going to keep running the tests because the former failure is suspicious.
#9 Updated by Szilárd Páll almost 2 years ago
Paul Bauer wrote:
The release build has an additional failure in the complex/dd121 case, with the errors in checkforce.out
This output doesn't seem to tell much, except that the errors do not resemble the previous ones. I can't repro this although what I ran for 5+ days both a dev machine and bs-gpu01 was the orientation restraints tests. Now I have all nbnxn tests running, for a few hours, but so far zero errors, so I'm puzzled.
#10 Updated by Berk Hess almost 2 years ago
But these errors are very weird: most coordinate have a massive mismatch in only x, y or z. Some have a mismatch in 2 dimensions. So the coordinate data seems to have gotten corrupted. It might be possible that a single, varying component of one force contribution gets very large values.
#11 Updated by Berk Hess almost 2 years ago
Force errors can not explain this, since some atoms that are constrained are affected, whereas atoms they are constrained to are not. So either the coordinate update in the constraints is buggy, or the copy-back of constraint coordinates to the state or the coordinates get corrupted in between constraining and trajectory writing.
#12 Updated by Szilárd Páll almost 2 years ago
As noted on gerrit, AFACIT the reason for the failures is that this verification has been buggy / not updated to the right branch and has been running master code.
Assuming the underlying issue is still the already fixed rare OpenCL gather reduction (which was made less rare by me running two concurrent regressiontests in an infinite loop on GPU0 on the AMD slave) it's still intriguing the kind of errors that the above output file shows, but given that I've been unable to reproduce any of these with code other than with master (without the fix), I think we can assume the issue is a false positive until proven otherwise.
#13 Updated by Szilárd Páll almost 2 years ago
Update: so it seems that the release verification only triggers master builds in some cases, at least in some manually triggered builds it doesn't; e.g. http://jenkins.gromacs.org/job/Release_workflow_master/236
Note that the release verification scripts work differently and call
make check with the regressiontest path passed to cmake (not clear at all until you dig through 13Mb of log :), so no direct gmxtest invocation. Consequently, the automatic launch results in N-way decomposition for the dd121 test that failed in my repro case -- so no OpenCL PME here. Instead, LJ/coul SR energies are off already at step 0. Full output attached.
#20 Updated by Szilárd Páll over 1 year ago
The orientation restraints test fails on my laptop with Intel iGPU and only nobondeds offloaded with some subtle checkforce (xfv) errors, so quite likely not PME OpenCL related (possibly not even OpenCL-related?). This however did not reproduce on either NVIDIA or AMD with or without PME offload. The tip4p error I've still not managed to repro so far.
#25 Updated by Szilárd Páll over 1 year ago
Paul Bauer wrote:
this seems to have been resolved some time in the past. Can be kept open if we need to investigate further in the future.
I do not thin it has. I've retriggered your change (PS11 when it was still based on the OpenCL gather fix) and had 1 or 3 failures; here's the failing case on both
Note again that AFAICT these runs do DD so they do not use PME OpenCL.
#30 Updated by Szilárd Páll 12 months ago
- Target version changed from 2019.5 to 2020-beta2
re-targeting to 2020, release build on master running check here: http://jenkins.gromacs.org/view/Release/job/Release_workflow_master/599/
One failure on jenkins has a massive difference in LJ energy at step 0:
LJ (SR) step 0: 1944.39, step 0: 1113.65
Coulomb (SR) step 0: -15234.3, step 0: -15265.1
This difference corresponds to 99.6% of the local LJ energy of one of the two zones. This could be coincidence. But it does show that the difference must come partially, or maybe fully, from missing local interactions. My suspicion is that the local energies, and probably forces, are used before the local kernel has finished.