Project

General

Profile

Bug #2939

master post-submit failing in complex.nbnxn-ljpme-LB-geometric

Added by Szilárd Páll 4 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Category:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Since 05029e2 the angle and/or the RB fields of the energy file fail the check for the nbnxn-ljpme-LB-geometric regression test, e.g:

checkpot.out:
comparing energy file ./reference_s.edr and ener.edr

There are 46 and 47 terms in the energy files

enm[13] (- - Conserved En.)
There are 11 terms to compare in the energy files

Angle            step  20:       205.527,  step  20:      206.289
Ryckaert-Bell.   step  20:       29.3271,  step  20:      29.2583

Files read successfully

This only fails post-submit where multi-domain GPU offload is done hence bonded offload is triggered by default (as PME is not offloaded). Likely an issue with the tolerances, but the GPU results should be looked at as well.

Recent failing tests:

http://jenkins.gromacs.org/job/Matrix_PostSubmit_master/1372/OPTIONS=gcc-5%20gpuhw=nvidia%20nranks=4%20gpu_id=1%20cuda-10.0%20no-hwloc%20release-with-assert%20host=bs_nix1204,label=bs_nix1204/testReport/(root)/complex/nbnxn_ljpme_LB_geometric/

http://jenkins.gromacs.org/job/Matrix_PostSubmit_master/1371/OPTIONS=gcc-5%20gpuhw=nvidia%20nranks=4%20gpu_id=1%20cuda-10.0%20no-hwloc%20release-with-assert%20host=bs_nix1204,label=bs_nix1204/testReport/(root)/complex/nbnxn_ljpme_LB_geometric/

ntmpi-4-gpu-checkpot.out (401 Bytes) ntmpi-4-gpu-checkpot.out Paul Bauer, 05/13/2019 04:39 PM
ntmpi-7-gpu-checkpot.out (269 Bytes) ntmpi-7-gpu-checkpot.out Paul Bauer, 05/13/2019 04:39 PM
ntmpi-8-cpu-checkpot.out (533 Bytes) ntmpi-8-cpu-checkpot.out Paul Bauer, 05/13/2019 04:39 PM
ntmpi-8-gpu-checkpot.out (269 Bytes) ntmpi-8-gpu-checkpot.out Paul Bauer, 05/13/2019 04:39 PM

History

#1 Updated by Szilárd Páll 4 months ago

reproduced off-jenins too with

perl gmxtest.pl -xml complex -nt 4

and

GROMACS version:    2020-dev-20190430-134798e
GIT SHA1 hash:      134798e80f0f28ba42ebb573ff78537e54c0de63

#2 Updated by Paul Bauer 4 months ago

I also just reproduced this locally with CUDA 9.1, but with slightly different values for the offending terms and only after running the test a few times.
@Szilard, any tips for how I could go at debugging this further?

#3 Updated by Paul Bauer 4 months ago

  • Status changed from New to Accepted

#4 Updated by Szilárd Páll 4 months ago

Paul Bauer wrote:

@Szilard, any tips for how I could go at debugging this further?

My rough idea would be to check under what parallelization conditions is the mismatch reproducible: only DD + bonded offload or also no DD and bonded offload (i.e. -ntmpi 1 -bonded gpu as ni this case -bonded cpu is the default).
Let me know what you find, I myself won't have time to look into it in the coming 1-2 weeks due to other engagements.

#5 Updated by Paul Bauer 3 months ago

I tested this now with the run configurations you suggested, using this

wd=`pwd`
for nmpi in 1 2 3 4 5 6 7 8 ; do
    for offload in cpu gpu ; do
        mkdir -p ntmpi-${nmpi}-bonded-${offload}
        cd ntmpi-${nmpi}-bonded-${offload}
        cp $wd/*tpr $wd/*edr . 
        echo "Checking ntmpi ${nmpi} and offload ${offload} with constant number of OpenMP threads of 2" 
        for i in {1..50} ; do
            gmx mdrun -ntmpi $nmpi -ntomp 2 -bonded $offload -notunepme &>mdrun.out
            gmx check -e reference_s.edr -e2 ener.edr -tol 0.001 -abstol 0.05 -lastener Potential >checkpot.out 2>checkpot.err
            errors=`cat checkpot.out | grep step | wc -l`
            if [ $errors -ne 0 ] ; then
                echo "Number of errors is $errors in iteration $i" 
                break
            elif [ $i -eq 50 ] ; then
                rm -rf \#*
            fi
        done
    cd $wd
    done
done

The number of OpenMP threads is set to 2 so I know that all cases behave the same there.

Results are here

Checking ntmpi 1 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 1 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 2 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 2 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 3 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 3 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 4 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 4 and offload gpu with constant number of OpenMP threads of 2
Number of errors is 3 in iteration 1
Checking ntmpi 5 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 5 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 6 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 6 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 7 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 7 and offload gpu with constant number of OpenMP threads of 2
Number of errors is 1 in iteration 1
Checking ntmpi 8 and offload cpu with constant number of OpenMP threads of 2
Number of errors is 5 in iteration 2
Checking ntmpi 8 and offload gpu with constant number of OpenMP threads of 2
Number of errors is 1 in iteration 1

Files are attached as well. I managed to get an error once with a single thread but have not been able to reproduce it so far.

#7 Updated by Artem Zhmurov 3 months ago

In some cases, velocities are also off. Example (http://jenkins.gromacs.org/job/Matrix_OnDemand/724/OPTIONS=gcc-5%20gpuhw=nvidia%20nranks=4%20gpu_id=1%20cuda-10.0%20no-hwloc%20release-with-assert%20host=bs_nix1204,label=bs_nix1204/testReport/junit/(root)/complex/nbnxn_ljpme_LB_geometric/):

Regression
complex.nbnxn-ljpme-LB-geometric

Failing for the past 1 build (Since Failed#724 )
Took 0 ms.
Error Message
Errors in checkpot.out (1 errors), checkforce.out (1 errors) 
Stacktrace

checkpot.out:
comparing energy file ./reference_s.edr and ener.edr

There are 46 and 47 terms in the energy files

enm[13] (- - Conserved En.)
There are 11 terms to compare in the energy files

Ryckaert-Bell.   step  20:       29.3271,  step  20:      29.2595

Files read successfully

--------------------------------
checkforce.out:

v[    0] ( 3.82190e-05  2.32143e-05  4.64865e-05) - ( 8.11169e-05  4.94338e-05  9.85528e-05)
v[    1] ( 5.81662e-06  8.28132e-06 -5.22912e-06) - ( 1.10956e-05  1.58769e-05 -9.93199e-06)
v[    2] (-3.01374e-05  2.84497e-05 -1.75781e-05) - (-4.07619e-05  3.85687e-05 -2.37945e-05)
v[    3] (-4.19012e-05 -4.55146e-05  1.31545e-05) - (-1.01274e-04 -7.86563e-05 -3.50580e-06)
v[    4] ( 9.67434e-05  4.51581e-05 -1.01112e-04) - ( 1.10829e-04  2.65790e-05 -1.24159e-04)
v[    5] (-1.14873e-04 -1.92678e-05  1.10743e-04) - (-1.11022e-04 -1.35580e-05  1.07504e-04)
v[    6] ( 1.00374e-05 -6.46271e-05 -3.67005e-05) - ( 1.79006e-05 -5.98568e-05 -3.90244e-05)
v[    7] ( 2.76249e-06  3.12851e-05 -5.98766e-06) - ( 3.20643e-07  2.73925e-05 -4.50309e-06)
v[    8] (-1.76817e-05 -1.45096e-06 -6.15216e-05) - (-1.92906e-05 -1.56634e-06 -6.71429e-05)
v[    9] ( 5.71089e-05 -6.37113e-05  4.36481e-05) - ( 5.85135e-05 -6.52717e-05  4.47200e-05)
v[   10] (-7.50885e-05  3.10111e-05 -4.58133e-05) - (-7.63147e-05  3.44241e-05 -4.19702e-05)
v[   11] ( 8.81265e-05 -4.13817e-04 -1.55104e-04) - ( 8.70891e-05 -4.10036e-04 -1.53968e-04)
v[   12] (-1.16661e-04  2.44982e-04 -6.41148e-04) - (-1.19169e-04  2.41529e-04 -6.42103e-04)
v[   13] (-6.24408e-04 -2.68030e-04  7.36976e-04) - (-6.26274e-04 -2.68642e-04  7.35967e-04)
v[   14] ( 4.95552e-04 -3.37262e-04 -4.96085e-05) - ( 4.97243e-04 -3.36534e-04 -5.21813e-05)
v[   15] ( 3.25371e-04  3.23771e-04 -9.22419e-04) - ( 3.24985e-04  3.23386e-04 -9.21323e-04)
v[   16] ( 5.97346e-05  1.43612e-04  1.07300e-03) - ( 5.60530e-05  1.41558e-04  1.07392e-03)
v[   17] ( 3.33134e-04  2.08174e-04 -1.50028e-04) - ( 3.36819e-04  2.10446e-04 -1.51340e-04)
v[   18] ( 1.42048e-04  1.16935e-04  1.71837e-04) - ( 1.43429e-04  1.18107e-04  1.73371e-04)
v[   19] (-6.82770e-05 -1.05830e-04  5.34459e-05) - (-6.92111e-05 -1.06904e-04  5.34050e-05)
v[   20] (-3.48946e-05 -9.01594e-06 -7.51828e-05) - (-3.44894e-05 -7.93586e-06 -7.52799e-05)
v[   21] ( 9.54477e-05 -9.04110e-05 -1.44631e-05) - ( 9.93797e-05 -9.62242e-05 -1.52629e-05)
v[   22] (-1.28296e-04  2.74479e-05  9.31783e-06) - (-1.32999e-04  3.16080e-05  9.99477e-06)
v[   23] (-9.71807e-06  1.50228e-04  5.22361e-05) - (-7.80136e-06  1.48419e-04  5.07787e-05)
v[   24] ( 2.80659e-05 -8.09137e-05 -3.41859e-05) - ( 2.78088e-05 -7.80534e-05 -3.34443e-05)
v[   25] ( 1.25361e-04 -1.29479e-04 -5.14681e-05) - ( 1.24084e-04 -1.29845e-04 -5.07461e-05)
v[   26] (-6.21310e-05  1.55068e-04  7.19505e-06) - (-6.03985e-05  1.55792e-04  5.89743e-06)
v[   27] (-1.37482e-04  8.11113e-05  6.49504e-05) - (-1.39911e-04  8.11355e-05  6.64825e-05)
v[   28] (-1.71048e-05 -1.75463e-04  8.89317e-05) - (-1.75252e-05 -1.77019e-04  8.99376e-05)
v[   29] ( 1.63208e-06  1.93930e-04 -8.11589e-05) - ( 5.28352e-06  1.91935e-04 -8.33181e-05)
v[   30] ( 1.08665e-05 -1.33576e-04  3.17781e-05) - ( 9.14351e-06 -1.30849e-04  3.21592e-05)
v[   31] (-3.43823e-04  1.64988e-04 -4.84206e-05) - (-3.40158e-04  1.65210e-04 -4.87683e-05)
v[   32] (-2.98803e-04 -1.73145e-05  8.85885e-05) - (-3.06489e-04 -1.48491e-05  9.00484e-05)
v[   33] ( 4.13837e-05  5.56042e-05  3.93050e-05) - ( 3.18830e-05  4.55394e-05  4.88675e-05)
v[   34] ( 3.53090e-05 -2.48075e-04  3.91100e-05) - ( 3.61218e-05 -2.53770e-04  4.00083e-05)
v[   35] ( 2.31603e-04  7.97380e-05 -2.10578e-05) - ( 2.11370e-04  1.26287e-04 -9.63867e-05)
v[   36] (-9.75589e-05  2.86010e-06 -6.17516e-05) - (-7.17893e-05 -4.38211e-05  1.18293e-05)
v[   37] (-2.52811e-05  9.51696e-05  2.04383e-05) - (-1.59768e-05  1.07536e-04  1.22033e-05)
v[   38] ( 1.12089e-04 -5.42673e-06 -9.54357e-05) - ( 1.12320e-04 -6.36096e-06 -9.56390e-05)
v[   39] (-3.60630e-05 -4.90255e-05  2.86628e-05) - (-2.97641e-05 -5.48322e-05  2.21639e-05)
v[   40] ( 7.52181e-05  8.91778e-05 -5.66228e-05) - ( 7.07735e-05  9.67311e-05 -5.16049e-05)
v[   41] (-2.22031e-05 -4.58290e-05  1.23560e-04) - (-2.06177e-05 -4.54735e-05  1.27391e-04)
v[   42] (-3.10076e-06  1.70863e-05 -9.01438e-05) - (-4.75483e-07  2.03676e-05 -9.59646e-05)
v[   43] (-1.02326e-04 -8.81727e-05 -4.62480e-05) - (-1.08091e-04 -9.32737e-05 -4.35606e-05)
v[   44] ( 4.17640e-05  3.52450e-05  7.87176e-05) - ( 4.57714e-05  3.66875e-05  7.83296e-05)
v[   45] (-2.66810e-05 -2.56580e-05 -4.54502e-05) - (-2.97947e-05 -2.63001e-05 -4.36057e-05)
v[   46] ( 8.41690e-05  5.60919e-05  2.97270e-05) - ( 7.80422e-05  5.25156e-05  3.01309e-05)
v[   47] ( 1.94108e-05 -1.05462e-04 -2.55456e-05) - ( 2.27221e-05 -1.00400e-04 -2.58896e-05)
v[   48] (-1.00471e-04  5.42810e-05  5.29792e-05) - (-9.79143e-05  5.25165e-05  5.16167e-05)
v[   49] ( 2.34203e-05  1.90566e-05 -1.12018e-05) - ( 2.30160e-05  1.87254e-05 -1.10087e-05)
v[   50] ( 2.00862e-04  1.50599e-04  5.12998e-06) - ( 1.96730e-04  1.46180e-04  3.00885e-06)
v[   51] (-9.71415e-05 -1.19313e-04 -7.32232e-05) - (-9.39012e-05 -1.15337e-04 -7.07785e-05)

v[    4] ( 1.23373e-01  1.47249e-01  2.07036e-02) - ( 1.23320e-01  1.47011e-01  2.06277e-02)
v[    5] (-2.71846e-02  2.44431e-01 -5.42799e-02) - (-2.69355e-02  2.44430e-01 -5.44564e-02)
v[   39] (-5.17761e-02  2.06865e-01  1.88818e-02) - (-5.19735e-02  2.07007e-01  1.91502e-02)
v[   40] ( 2.15329e-01  1.11786e-01  3.17072e-01) - ( 2.15550e-01  1.11676e-01  3.16733e-01)
v[   45] ( 2.73470e-01 -4.14410e-01  2.97880e-01) - ( 2.74092e-01 -4.15157e-01  2.98151e-01)
v[   46] (-2.37029e-01  1.67531e-01  3.17544e-01) - (-2.36003e-01  1.66013e-01  3.17767e-01)
v[   47] (-7.77016e-02 -3.81583e-01 -2.34131e-01) - (-7.68766e-02 -3.83130e-01 -2.36025e-01)
v[   48] ( 3.58931e-01 -1.89605e-01  2.55897e-01) - ( 3.65229e-01 -1.99331e-01  2.70675e-01)
v[   49] (-8.48893e-02  3.07502e-01 -1.63603e-01) - (-8.47734e-02  2.49159e-01 -2.47983e-01)
v[   53] ( 3.85998e-01 -2.57216e-01 -5.79835e-03) - ( 3.85912e-01 -2.57476e-01 -5.76616e-03)
f[    3] ( 4.56915e+02 -5.44059e+02 -6.87115e+02) - ( 4.56747e+02 -5.44174e+02 -6.87120e+02)
f[    4] ( 1.57623e+02  8.71704e+02  4.34741e+02) - ( 1.57684e+02  8.71767e+02  4.34746e+02)
f[    5] (-1.04015e+03 -6.94798e+02  1.23352e+03) - (-1.04011e+03 -6.94741e+02  1.23351e+03)
f[    6] (-2.74076e+02 -7.04603e+02 -4.74374e+02) - (-2.74106e+02 -7.04668e+02 -4.74354e+02)
f[    7] ( 8.25692e+02  1.08274e+03  6.30820e+02) - ( 8.25697e+02  1.08268e+03  6.30738e+02)
f[    8] (-2.33392e+02  9.81907e+01  4.08810e+02) - (-2.33431e+02  9.82533e+01  4.08822e+02)
f[    9] (-3.40616e+02  5.33085e+02 -2.91028e+02) - (-3.40622e+02  5.33104e+02 -2.90957e+02)
f[   10] ( 3.66948e+02 -9.90795e+02 -1.02786e+03) - ( 3.66987e+02 -9.90861e+02 -1.02788e+03)
f[   12] (-2.42468e+01  6.29543e+02 -9.69893e+02) - (-2.41872e+01  6.29570e+02 -9.69893e+02)
f[   14] (-4.04757e+02 -5.71084e+02  6.78761e+01) - (-4.04822e+02 -5.71078e+02  6.79182e+01)
f[   20] ( 2.42875e+02 -4.40445e+02  3.03630e+02) - ( 2.42863e+02 -4.40378e+02  3.03587e+02)
f[   21] (-3.67315e+02  9.15026e+01 -3.77467e+02) - (-3.67242e+02  9.15095e+01 -3.77432e+02)
f[   22] ( 1.11269e+02  1.29791e+02 -3.60736e+01) - ( 1.11192e+02  1.29772e+02 -3.60648e+01)
f[   23] ( 8.31355e+01 -6.19152e+01  2.56658e+02) - ( 8.32332e+01 -6.18793e+01  2.56624e+02)
f[   24] (-8.44423e+01  4.82937e+01 -1.69570e+02) - (-8.44975e+01  4.82063e+01 -1.69560e+02)
f[   30] ( 5.61898e+01 -4.89037e+02  8.40488e+01) - ( 5.61614e+01 -4.89117e+02  8.40561e+01)
f[   33] ( 5.84155e+02 -4.61681e+02 -1.01464e+02) - ( 5.84065e+02 -4.61642e+02 -1.01492e+02)
f[   36] (-1.70262e+01 -1.26708e+01 -6.65787e+01) - (-1.69640e+01 -1.27226e+01 -6.65659e+01)
f[   37] ( 1.62544e+02  2.97955e+01  2.46282e+01) - ( 1.62501e+02  2.98786e+01  2.45910e+01)
f[   42] (-1.55798e+02  2.66295e+01  6.92210e+01) - (-1.55805e+02  2.66781e+01  6.93165e+01)
f[   43] (-7.57585e+01  4.37608e+01  4.69104e+01) - (-7.57396e+01  4.37630e+01  4.67860e+01)
f[   44] ( 6.36224e+01  1.45138e+02  2.05362e+02) - ( 6.35803e+01  1.45346e+02  2.05685e+02)
f[   45] ( 4.17827e+02  1.35040e+02 -5.41130e+02) - ( 4.18321e+02  1.33633e+02 -5.41153e+02)
f[   46] (-4.57374e+02 -9.78127e+01  5.37016e+02) - (-4.55926e+02 -9.92113e+01  5.39168e+02)
f[   47] (-2.84192e+02 -5.83224e+02 -2.21161e+02) - (-2.83312e+02 -5.75560e+02 -2.25527e+02)
f[   48] ( 2.25677e+02  6.89124e+02  1.31594e+02) - ( 2.22789e+02  6.80404e+02  1.30617e+02)
f[   49] ( 1.00691e+02 -2.76922e+02 -1.43526e+02) - ( 1.11041e+02 -3.68332e+02 -2.23872e+02)

Both files read correctly

--------------------------------

#8 Updated by Artem Zhmurov 3 months ago

I believe it appears only when one has an even number of ranks. Using Paul script on my computer (50 tries for each case):

Checking ntmpi 1 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 1 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 2 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 2 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 3 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 3 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 4 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 4 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 5 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 5 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 6 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 6 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 7 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 7 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 8 and offload gpu with constant number of OpenMP threads of 2
Number of errors is 5 in iteration 4
Checking ntmpi 8 and offload cpu with constant number of OpenMP threads of 2
Number of errors is 5 in iteration 12
Checking ntmpi 9 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 9 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 10 and offload gpu with constant number of OpenMP threads of 2
Number of errors is 2 in iteration 4
Checking ntmpi 10 and offload cpu with constant number of OpenMP threads of 2
Number of errors is 2 in iteration 8
Checking ntmpi 11 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 11 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 12 and offload gpu with constant number of OpenMP threads of 2
Number of errors is 1 in iteration 1
Checking ntmpi 12 and offload cpu with constant number of OpenMP threads of 2
Number of errors is 1 in iteration 7
Checking ntmpi 13 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 13 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 14 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 14 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 15 and offload gpu with constant number of OpenMP threads of 2
Checking ntmpi 15 and offload cpu with constant number of OpenMP threads of 2
Checking ntmpi 16 and offload gpu with constant number of OpenMP threads of 2
Number of errors is 3 in iteration 1
Checking ntmpi 16 and offload cpu with constant number of OpenMP threads of 2
Number of errors is 5 in iteration 2

#9 Updated by Paul Bauer 3 months ago

I'm bisecting now to find where the issue got actually introduced, it is definitely after the 2019 release candidate.

#10 Updated by Paul Bauer 3 months ago

the first commit that can trigger the bug for me is abfa9ed502c2fce9bca256f66f35a2bf4e446e68

#11 Updated by Paul Bauer 3 months ago

@Berk, can you check the commit above? I'm trying to understand the different check but think I don't understand the code enough

#12 Updated by Paul Bauer 3 months ago

  • Status changed from Accepted to Fix uploaded
  • Target version set to 2020-infrastructure-stable

#13 Updated by Berk Hess 3 months ago

This issue in not present in release-2019.
Szilard and I should come up with a proper solution.

#14 Updated by Szilárd Páll about 1 month ago

Fixed in 19ceb46

#15 Updated by Szilárd Páll about 1 month ago

  • Status changed from Fix uploaded to Closed
  • Assignee set to Szilárd Páll

Also available in: Atom PDF