Project

General

Profile

Task #2819

figure out latest clang + native CUDA that works on our hardware

Added by Szilárd Páll 3 months ago. Updated 2 months ago.

Status:
In Progress
Priority:
Low
Assignee:
-
Category:
-
Target version:
Difficulty:
uncategorized
Close

Description

Tried to enable clang 7 native CUDA with CUDA 9.1/9.2 builds in Jenkins and kept running into failing tests (see https://gerrit.gromacs.org/#/c/8663/). The failures suggest that it's a PME compilation issue.
I could not reproduce on CC 3.5, 5.2, and 6.1 hardware outside of Jenkins and given that the build slaves have rather old 3.0 hardware, I suspect this may be an issue specific to that arch.

TODO:
  • Confirm the issue and flag in cmake.
  • Figure out which clang version can we use without hardware upgrade
  • (consider harwdare upgrade, CC 3.0 is really dated anyway)

History

#1 Updated by Szilárd Páll 3 months ago

  • Description updated (diff)

Note the low prio, just bump to .1 if not addressed before release.

#2 Updated by Szilárd Páll 3 months ago

  • Tracker changed from Bug to Task
  • Status changed from New to In Progress
  • Affected version deleted (2019-rc1)

#3 Updated by Szilárd Páll 3 months ago

Looks like clang 6 + CUDA 9 doesn't work either on CC 3.0.

#4 Updated by Paul Bauer 3 months ago

  • Target version changed from 2019 to 2019.1

bumped

#5 Updated by Szilárd Páll 3 months ago

Update: I've been making the mistake of compiling in Release mode which has all tests passing. However, the jenkins config is RelWithAssert and with that I can also confirm that PME-GPU tests fail.

#6 Updated by Mark Abraham 3 months ago

Ok. What's the next move?

#7 Updated by Szilárd Páll 3 months ago

Mark Abraham wrote:

Ok. What's the next move?

Identifying what is different between RelWithAssert and other build types that can/does make unit tests fail. Suggestions would be welcome.

Alternatively we can flag/ignore broken RelWithAssert if cuda-clang is not considered important enough.

#8 Updated by Szilárd Páll 3 months ago

Experiment 1. Added NDEBUG around PME solve code to see if this eliminates the failure.

$ git diff | tail -n1000
diff --git a/src/gromacs/ewald/pme-gpu-internal.cpp b/src/gromacs/ewald/pme-gpu-internal.cpp
index 2a88a02..8271b27 100644
--- a/src/gromacs/ewald/pme-gpu-internal.cpp
+++ b/src/gromacs/ewald/pme-gpu-internal.cpp
@@ -1098,6 +1098,7 @@ void pme_gpu_spread(const PmeGpu    *pmeGpu,
 void pme_gpu_solve(const PmeGpu *pmeGpu, t_complex *h_grid,
                    GridOrdering gridOrdering, bool computeEnergyAndVirial)
 {
+#define NDEBUG
     const bool   copyInputAndOutputGrid = pme_gpu_is_testing(pmeGpu) || !pme_gpu_performs_FFT(pmeGpu);

     auto        *kernelParamsPtr = pmeGpu->kernelParams.get();
@@ -1199,6 +1200,7 @@ void pme_gpu_solve(const PmeGpu *pmeGpu, t_complex *h_grid,
                              0, pmeGpu->archSpecific->complexGridSize,
                              pmeGpu->archSpecific->pmeStream, pmeGpu->settings.transferKind, nullptr);
     }
+#undef NDEBUG
 }

 void pme_gpu_gather(PmeGpu                *pmeGpu,
diff --git a/src/gromacs/ewald/pme-solve.cu b/src/gromacs/ewald/pme-solve.cu
index c00ec9d..28419c5 100644
--- a/src/gromacs/ewald/pme-solve.cu
+++ b/src/gromacs/ewald/pme-solve.cu
@@ -49,6 +49,8 @@

 #include "pme.cuh" 

+#define NDEBUG
+
 /*! \brief
  * PME complex grid solver kernel function.
  *
diff --git a/src/gromacs/ewald/tests/pmesolvetest.cpp b/src/gromacs/ewald/tests/pmesolvetest.cpp
index e7549d8..542acfd 100644
--- a/src/gromacs/ewald/tests/pmesolvetest.cpp
+++ b/src/gromacs/ewald/tests/pmesolvetest.cpp
@@ -54,6 +54,8 @@

 #include "pmetestcommon.h" 

+#define NDEBUG
+
 namespace gmx
 {
 namespace test

This still produces errors:

$ bin/ewald-test --gtest_filter="SaneInput/PmeSolveTest.ReproducesOutputs/0*" 
Note: Google Test filter = SaneInput/PmeSolveTest.ReproducesOutputs/0*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from SaneInput/PmeSolveTest
[ RUN      ] SaneInput/PmeSolveTest.ReproducesOutputs/0
/home/pszilard/projects/gromacs/gromacs-19/src/testutils/refdata.cpp:929: Failure
   In item: /Virial/Cell 0 0
    Actual: -0.11273560672998428
 Reference: 8.2591867446899414
Difference: 8.37192 (2129332110 single-prec. ULPs, rel. 1.01), signs differ
 Tolerance: abs. 0.00286102, 24 ULPs
Google Test trace:
/home/pszilard/projects/gromacs/gromacs-19/src/gromacs/ewald/tests/pmesolvetest.cpp:143: Testing solving (Coulomb, YZX, with energy/virial) with GPU (GPU #0: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC:  no, stat: compatible) for PME grid size 16 12 28, Ewald coefficients 2 0.7
/home/pszilard/projects/gromacs/gromacs-19/src/testutils/refdata.cpp:929: Failure
   In item: /Virial/Cell 0 0
    Actual: -0.11273560672998428
 Reference: 8.2591867446899414
Difference: 8.37192 (2129332110 single-prec. ULPs, rel. 1.01), signs differ
 Tolerance: abs. 0.00286102, 24 ULPs
Google Test trace:
/home/pszilard/projects/gromacs/gromacs-19/src/gromacs/ewald/tests/pmesolvetest.cpp:143: Testing solving (Coulomb, YZX, with energy/virial) with GPU (GPU #1: NVIDIA GeForce GTX 960, compute cap.: 5.2, ECC:  no, stat: compatible) for PME grid size 16 12 28, Ewald coefficients 2 0.7
[  FAILED  ] SaneInput/PmeSolveTest.ReproducesOutputs/0, where GetParam() = ({ 8, 0, 0, 0, 3.4, 0, 0, 0, 2 }, 12-byte object <10-00 00-00 0C-00 00-00 1C-00 00-00>, { (12-byte object <00-00 00-00 00-00 00-00 00-00 00-00>, 8-byte object <00-00 60-40 66-66 D6-40>), (12-byte object <07-00 00-00 00-00 00-00 00-00 00-00>, 8-byte object <00-00 20-C0 33-33 33-BF>), (12-byte object <03-00 00-00 05-00 00-00 07-00 00-00>, 8-byte object <A6-9B C4-BB 77-CC 2B-32>), (12-byte object <03-00 00-00 01-00 00-00 02-00 00-00>, 8-byte object <9A-99 19-3F CD-CC FC-40>), (12-byte object <06-00 00-00 02-00 00-00 04-00 00-00>, 8-byte object <CD-CC F0-41 CD-CC 1C-40>) }, 1.2, 2, 0.7, 4-byte object <00-00 00-00>) (192 ms)
[----------] 1 test from SaneInput/PmeSolveTest (192 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (460 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SaneInput/PmeSolveTest.ReproducesOutputs/0, where GetParam() = ({ 8, 0, 0, 0, 3.4, 0, 0, 0, 2 }, 12-byte object <10-00 00-00 0C-00 00-00 1C-00 00-00>, { (12-byte object <00-00 00-00 00-00 00-00 00-00 00-00>, 8-byte object <00-00 60-40 66-66 D6-40>), (12-byte object <07-00 00-00 00-00 00-00 00-00 00-00>, 8-byte object <00-00 20-C0 33-33 33-BF>), (12-byte object <03-00 00-00 05-00 00-00 07-00 00-00>, 8-byte object <A6-9B C4-BB 77-CC 2B-32>), (12-byte object <03-00 00-00 01-00 00-00 02-00 00-00>, 8-byte object <9A-99 19-3F CD-CC FC-40>), (12-byte object <06-00 00-00 02-00 00-00 04-00 00-00>, 8-byte object <CD-CC F0-41 CD-CC 1C-40>) }, 1.2, 2, 0.7, 4-byte object <00-00 00-00>)

 1 FAILED TEST

However, compiling the whole binary with NDEBUG still produces passing tests:

$ cmake . -DCMAKE_CXX_FLAGS='-DNDEBUG' && make ewald-test
$ bin/ewald-test --gtest_filter="SaneInput/PmeSolveTest.ReproducesOutputs/0*" 
Note: Google Test filter = SaneInput/PmeSolveTest.ReproducesOutputs/0*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from SaneInput/PmeSolveTest
[ RUN      ] SaneInput/PmeSolveTest.ReproducesOutputs/0
[       OK ] SaneInput/PmeSolveTest.ReproducesOutputs/0 (131 ms)
[----------] 1 test from SaneInput/PmeSolveTest (131 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (320 ms total)
[  PASSED  ] 1 test.

#9 Updated by Szilárd Páll 3 months ago

It seems some that in-kernel assertions are what screw up code-generation (?); if I remove all three assertions nested in the data-dependent if (notZeroPoint) condition (source:src/gromacs/ewald/pme-solve.cu#L206), the errors are gone.

Thoughts anyone?

#10 Updated by Mark Abraham 2 months ago

  • Target version changed from 2019.1 to 2020

Retargetting to 2020, as not user facing issue

Also available in: Atom PDF