Project

General

Profile

Task #2819

Task #2899: Update testing matrix versions for GROMACS 2020 release

figure out latest clang + native CUDA that works on our hardware

Added by Szilárd Páll 10 months ago. Updated about 2 months ago.

Status:
Closed
Priority:
Low
Category:
testing
Difficulty:
uncategorized
Close

Description

Tried to enable clang 7 native CUDA with CUDA 9.1/9.2 builds in Jenkins and kept running into failing tests (see https://gerrit.gromacs.org/#/c/8663/). The failures suggest that it's a PME compilation issue.
I could not reproduce on CC 3.5, 5.2, and 6.1 hardware outside of Jenkins and given that the build slaves have rather old 3.0 hardware, I suspect this may be an issue specific to that arch.

TODO:
  • Confirm the issue and flag in cmake.
  • Figure out which clang version can we use without hardware upgrade
  • (consider harwdare upgrade, CC 3.0 is really dated anyway)

Related issues

Related to GROMACS - Task #3011: misc upgrades of testing matricesClosed

Associated revisions

Revision b3c35087 (diff)
Added by Szilárd Páll about 2 months ago

Work around clang CUDA device code codegen bug

Some of the PME kernels (namely PME spread and solve) get miscompiled
when clang native device code compilation is used with assertions and
optimization on, causing errors in RelWithAssert builds.
As a workaround, this change disables optimization for these kernels when
assertions are enabled (and only if the device compiler is clang).

Refs #2819

Change-Id: I815fc86116ffb57c4d5803ce7fa4d260909ae7ba

Revision 922e24d5 (diff)
Added by Szilárd Páll about 2 months ago

Bump clang-cuda post-submit compiler/CUDA version

clang 8 + CUDA 10 is the latest working setup

Fixes #2819
Refs #3006

Change-Id: I3dd0bb5c667d2295593178d445260beaf0509277

History

#1 Updated by Szilárd Páll 10 months ago

  • Description updated (diff)

Note the low prio, just bump to .1 if not addressed before release.

#2 Updated by Szilárd Páll 10 months ago

  • Tracker changed from Bug to Task
  • Status changed from New to In Progress
  • Affected version deleted (2019-rc1)

#3 Updated by Szilárd Páll 10 months ago

Looks like clang 6 + CUDA 9 doesn't work either on CC 3.0.

#4 Updated by Paul Bauer 10 months ago

  • Target version changed from 2019 to 2019.1

bumped

#5 Updated by Szilárd Páll 10 months ago

Update: I've been making the mistake of compiling in Release mode which has all tests passing. However, the jenkins config is RelWithAssert and with that I can also confirm that PME-GPU tests fail.

#6 Updated by Mark Abraham 10 months ago

Ok. What's the next move?

#7 Updated by Szilárd Páll 10 months ago

Mark Abraham wrote:

Ok. What's the next move?

Identifying what is different between RelWithAssert and other build types that can/does make unit tests fail. Suggestions would be welcome.

Alternatively we can flag/ignore broken RelWithAssert if cuda-clang is not considered important enough.

#8 Updated by Szilárd Páll 10 months ago

Experiment 1. Added NDEBUG around PME solve code to see if this eliminates the failure.

$ git diff | tail -n1000
diff --git a/src/gromacs/ewald/pme-gpu-internal.cpp b/src/gromacs/ewald/pme-gpu-internal.cpp
index 2a88a02..8271b27 100644
--- a/src/gromacs/ewald/pme-gpu-internal.cpp
+++ b/src/gromacs/ewald/pme-gpu-internal.cpp
@@ -1098,6 +1098,7 @@ void pme_gpu_spread(const PmeGpu    *pmeGpu,
 void pme_gpu_solve(const PmeGpu *pmeGpu, t_complex *h_grid,
                    GridOrdering gridOrdering, bool computeEnergyAndVirial)
 {
+#define NDEBUG
     const bool   copyInputAndOutputGrid = pme_gpu_is_testing(pmeGpu) || !pme_gpu_performs_FFT(pmeGpu);

     auto        *kernelParamsPtr = pmeGpu->kernelParams.get();
@@ -1199,6 +1200,7 @@ void pme_gpu_solve(const PmeGpu *pmeGpu, t_complex *h_grid,
                              0, pmeGpu->archSpecific->complexGridSize,
                              pmeGpu->archSpecific->pmeStream, pmeGpu->settings.transferKind, nullptr);
     }
+#undef NDEBUG
 }

 void pme_gpu_gather(PmeGpu                *pmeGpu,
diff --git a/src/gromacs/ewald/pme-solve.cu b/src/gromacs/ewald/pme-solve.cu
index c00ec9d..28419c5 100644
--- a/src/gromacs/ewald/pme-solve.cu
+++ b/src/gromacs/ewald/pme-solve.cu
@@ -49,6 +49,8 @@

 #include "pme.cuh" 

+#define NDEBUG
+
 /*! \brief
  * PME complex grid solver kernel function.
  *
diff --git a/src/gromacs/ewald/tests/pmesolvetest.cpp b/src/gromacs/ewald/tests/pmesolvetest.cpp
index e7549d8..542acfd 100644
--- a/src/gromacs/ewald/tests/pmesolvetest.cpp
+++ b/src/gromacs/ewald/tests/pmesolvetest.cpp
@@ -54,6 +54,8 @@

 #include "pmetestcommon.h" 

+#define NDEBUG
+
 namespace gmx
 {
 namespace test

This still produces errors:

$ bin/ewald-test --gtest_filter="SaneInput/PmeSolveTest.ReproducesOutputs/0*" 
Note: Google Test filter = SaneInput/PmeSolveTest.ReproducesOutputs/0*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from SaneInput/PmeSolveTest
[ RUN      ] SaneInput/PmeSolveTest.ReproducesOutputs/0
/home/pszilard/projects/gromacs/gromacs-19/src/testutils/refdata.cpp:929: Failure
   In item: /Virial/Cell 0 0
    Actual: -0.11273560672998428
 Reference: 8.2591867446899414
Difference: 8.37192 (2129332110 single-prec. ULPs, rel. 1.01), signs differ
 Tolerance: abs. 0.00286102, 24 ULPs
Google Test trace:
/home/pszilard/projects/gromacs/gromacs-19/src/gromacs/ewald/tests/pmesolvetest.cpp:143: Testing solving (Coulomb, YZX, with energy/virial) with GPU (GPU #0: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat: compatible) for PME grid size 16 12 28, Ewald coefficients 2 0.7
/home/pszilard/projects/gromacs/gromacs-19/src/testutils/refdata.cpp:929: Failure
   In item: /Virial/Cell 0 0
    Actual: -0.11273560672998428
 Reference: 8.2591867446899414
Difference: 8.37192 (2129332110 single-prec. ULPs, rel. 1.01), signs differ
 Tolerance: abs. 0.00286102, 24 ULPs
Google Test trace:
/home/pszilard/projects/gromacs/gromacs-19/src/gromacs/ewald/tests/pmesolvetest.cpp:143: Testing solving (Coulomb, YZX, with energy/virial) with GPU (GPU #1: NVIDIA GeForce GTX 960, compute cap.: 5.2, ECC:  no, stat: compatible) for PME grid size 16 12 28, Ewald coefficients 2 0.7
[  FAILED  ] SaneInput/PmeSolveTest.ReproducesOutputs/0, where GetParam() = ({ 8, 0, 0, 0, 3.4, 0, 0, 0, 2 }, 12-byte object <10-00 00-00 0C-00 00-00 1C-00 00-00>, { (12-byte object <00-00 00-00 00-00 00-00 00-00 00-00>, 8-byte object <00-00 60-40 66-66 D6-40>), (12-byte object <07-00 00-00 00-00 00-00 00-00 00-00>, 8-byte object <00-00 20-C0 33-33 33-BF>), (12-byte object <03-00 00-00 05-00 00-00 07-00 00-00>, 8-byte object <A6-9B C4-BB 77-CC 2B-32>), (12-byte object <03-00 00-00 01-00 00-00 02-00 00-00>, 8-byte object <9A-99 19-3F CD-CC FC-40>), (12-byte object <06-00 00-00 02-00 00-00 04-00 00-00>, 8-byte object <CD-CC F0-41 CD-CC 1C-40>) }, 1.2, 2, 0.7, 4-byte object <00-00 00-00>) (192 ms)
[----------] 1 test from SaneInput/PmeSolveTest (192 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (460 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SaneInput/PmeSolveTest.ReproducesOutputs/0, where GetParam() = ({ 8, 0, 0, 0, 3.4, 0, 0, 0, 2 }, 12-byte object <10-00 00-00 0C-00 00-00 1C-00 00-00>, { (12-byte object <00-00 00-00 00-00 00-00 00-00 00-00>, 8-byte object <00-00 60-40 66-66 D6-40>), (12-byte object <07-00 00-00 00-00 00-00 00-00 00-00>, 8-byte object <00-00 20-C0 33-33 33-BF>), (12-byte object <03-00 00-00 05-00 00-00 07-00 00-00>, 8-byte object <A6-9B C4-BB 77-CC 2B-32>), (12-byte object <03-00 00-00 01-00 00-00 02-00 00-00>, 8-byte object <9A-99 19-3F CD-CC FC-40>), (12-byte object <06-00 00-00 02-00 00-00 04-00 00-00>, 8-byte object <CD-CC F0-41 CD-CC 1C-40>) }, 1.2, 2, 0.7, 4-byte object <00-00 00-00>)

 1 FAILED TEST

However, compiling the whole binary with NDEBUG still produces passing tests:

$ cmake . -DCMAKE_CXX_FLAGS='-DNDEBUG' && make ewald-test
$ bin/ewald-test --gtest_filter="SaneInput/PmeSolveTest.ReproducesOutputs/0*" 
Note: Google Test filter = SaneInput/PmeSolveTest.ReproducesOutputs/0*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from SaneInput/PmeSolveTest
[ RUN      ] SaneInput/PmeSolveTest.ReproducesOutputs/0
[       OK ] SaneInput/PmeSolveTest.ReproducesOutputs/0 (131 ms)
[----------] 1 test from SaneInput/PmeSolveTest (131 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (320 ms total)
[  PASSED  ] 1 test.

#9 Updated by Szilárd Páll 10 months ago

It seems some that in-kernel assertions are what screw up code-generation (?); if I remove all three assertions nested in the data-dependent if (notZeroPoint) condition (source:src/gromacs/ewald/pme-solve.cu#L206), the errors are gone.

Thoughts anyone?

#10 Updated by Mark Abraham 9 months ago

  • Target version changed from 2019.1 to 2020

Retargetting to 2020, as not user facing issue

#11 Updated by Szilárd Páll 3 months ago

  • Assignee set to Szilárd Páll

The issue is eliminated by clang 8, I'll upgrade the post-stubmit matrix to.

#12 Updated by Mark Abraham 3 months ago

  • Target version changed from 2020 to 2020-infrastructure-stable

#13 Updated by Mark Abraham 3 months ago

  • Parent task set to #2899

#14 Updated by Szilárd Páll 3 months ago

There's one remaining issue: clang seems to emit intermediate ptx that the NVIDIA ptx compiler doesn't like and emits these warnings:

 ptxas warning :  .debug_abbrev section not found
 ptxas warning :  .debug_info section not found

We should either suppress these (e.g. in the jenkins parser) or figure out how to disable them.

#15 Updated by Szilárd Páll 3 months ago

I've started to implement an exception mechanism for these specific warnings in the jenkins compiler warnins groovy parser, but as I'm a groovy noob I ran into hiccups and looked for an altenative. That's when I realized that we don't need debug symbols, just assertions so we can use a RelWithAssert build type. Will switch the matrix.

#16 Updated by Szilárd Páll 3 months ago

Szilárd Páll wrote:

I've started to implement an exception mechanism for these specific warnings in the jenkins compiler warnins groovy parser, but as I'm a groovy noob I ran into hiccups and looked for an altenative. That's when I realized that we don't need debug symbols, just assertions so we can use a RelWithAssert build type. Will switch the matrix.

I take it back: PME is still incorrect with optimizations on and in debug mode we get these warnings that I can't figure out how to skip in a sensible way (i.e. not a sequence of negated characters of the ".debug_abbrev section not found" string). Thoughts?

Would prefer to not have to debug PME code, but it might be useful to know why is clang miscompiling it.

#17 Updated by Mark Abraham 2 months ago

Szilárd Páll wrote:

Szilárd Páll wrote:

I've started to implement an exception mechanism for these specific warnings in the jenkins compiler warnins groovy parser, but as I'm a groovy noob I ran into hiccups and looked for an altenative. That's when I realized that we don't need debug symbols, just assertions so we can use a RelWithAssert build type. Will switch the matrix.

I take it back: PME is still incorrect with optimizations on and in debug mode we get these warnings that I can't figure out how to skip in a sensible way (i.e. not a sequence of negated characters of the ".debug_abbrev section not found" string). Thoughts?

Would prefer to not have to debug PME code, but it might be useful to know why is clang miscompiling it.

The virtue of the clang_cuda build is clang compiling the device code. It's not essential to do so with optimizations on. Given we already have conditionality in the CMake code for clang_cuda build, we could find a way to append -O0 or something.

#18 Updated by Mark Abraham 2 months ago

Mark Abraham wrote:

The virtue of the clang_cuda build is clang compiling the device code. It's not essential to do so with optimizations on. Given we already have conditionality in the CMake code for clang_cuda build, we could find a way to append -O0 or something.

Or just test this config in Debug mode. We don't care about perf or codegen for this configuration.

#19 Updated by Szilárd Páll 2 months ago

  • Category set to testing
  • Status changed from In Progress to Fix uploaded

#20 Updated by Szilárd Páll about 2 months ago

  • Related to Task #3011: misc upgrades of testing matrices added

#21 Updated by Szilárd Páll about 2 months ago

  • Status changed from Fix uploaded to Resolved

#22 Updated by Szilárd Páll about 2 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF