Project

General

Profile

Bug #2424

LJ force and energy incorrect with cutoff-scheme=group and gcc 6

Added by Szilárd Páll about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
2016, likely all versions
Affected version:
Difficulty:
uncategorized
Close

Description

$ gmx --version 
[...]
GROMACS:      gmx, version 2018.1-dev-20180226-4419ecb
Executable:   /home/pszilard/projects/gromacs/gromacs-18/build_gcc64_cuda91/bin/gmx
Data prefix:  /home/pszilard/projects/gromacs/gromacs-18 (source tree)
Working dir:  /home/pszilard/projects/gromacs/regressiontests
Command line:
  gmx --version

GROMACS version:    2018.1-dev-20180226-4419ecb
GIT SHA1 hash:      4419ecb01d2cb4635e9f4d71520fa39c000a01b2
Branched from:      cbbe6979875942bedfbc9a7f25b18ecc3b8a78eb (37 newer local commits)
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        CUDA
SIMD instructions:  AVX2_256
FFT library:        fftw-3.3.7-sse2-avx-avx_128_fma-avx2-avx2_128-avx512
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      disabled
Tracing support:    disabled
Built on:           2018-02-26 18:05:34
Built by:           pszilard@racoon [CMAKE]
Build OS/arch:      Linux 4.4.0-116-generic x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
Build CPU family:   6   Model: 60   Stepping: 3
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /opt/tcbsys/gcc/6.4/bin/gcc-6 GNU 6.4.0
C compiler flags:    -march=core-avx2    -Wundef -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds 
C++ compiler:       /opt/tcbsys/gcc/6.4/bin/g++-6 GNU 6.4.0
C++ compiler flags:  -march=core-avx2    -std=c++11  -Wundef -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wall  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds 
CUDA compiler:      /opt/tcbsys/cuda/9.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85
CUDA compiler flags:-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_61,code=sm_61;-use_fast_math;-D_FORCE_INLINES;; ;-march=core-avx2;-std=c++11;-Wundef;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wmissing-declarations;-Wall;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
CUDA driver:        9.0
CUDA runtime:       32.67
perl gmxtest.pl -xml  all -ntomp 2 -npme 1 -nt 2
Will test on 2 OpenMP threads (if possible)
Will run PME tests using 1 separate PME ranks (if possible)
Will test on 2 thread-MPI ranks (if possible)
[...]
GROMACS:      gmx mdrun, version 2018.1-dev-20180226-4419ecb
Executable:   /home/pszilard/projects/gromacs/gromacs-18/build_gcc64_cuda91/bin/gmx
Data prefix:  /home/pszilard/projects/gromacs/gromacs-18 (source tree)
Working dir:  /home/pszilard/projects/gromacs/regressiontests
Command line:
  gmx mdrun -h

Thanx for Using GROMACS - Have a Nice Day

All 16 simple tests PASSED
FAILED. Check checkpot.out (12 errors), checkforce.out (51 errors) file(s) in cbt for cbt
FAILED. Check checkpot.out (14 errors), checkforce.out (53 errors) file(s) in reb for reb
FAILED. Check checkpot.out (63 errors), checkforce.out (615 errors) file(s) in tip4p for tip4p
3 out of 51 complex tests FAILED
FAILED. Check checkpot.out (4 errors) file(s) in nb_kernel_ElecCoul_VdwCSTab_GeomP1P1 for nb_kernel_ElecCoul_VdwCSTab_GeomP1P1
FAILED. Check checkpot.out (4 errors), checkforce.out (9 errors) file(s) in nb_kernel_ElecCoul_VdwCSTab_GeomW3P1 for nb_kernel_ElecCoul_VdwCSTab_GeomW3P1
FAILED. Check checkpot.out (4 errors), checkforce.out (453 errors) file(s) in nb_kernel_ElecCoul_VdwCSTab_GeomW3W3 for nb_kernel_ElecCoul_VdwCSTab_GeomW3W3
FAILED. Check checkpot.out (4 errors), checkforce.out (12 errors) file(s) in nb_kernel_ElecCoul_VdwCSTab_GeomW4P1 for nb_kernel_ElecCoul_VdwCSTab_GeomW4P1
FAILED. Check checkpot.out (4 errors), checkforce.out (420 errors) file(s) in nb_kernel_ElecCoul_VdwCSTab_GeomW4W4 for nb_kernel_ElecCoul_VdwCSTab_GeomW4W4
FAILED. Check checkpot.out (12 errors), checkforce.out (32 errors) file(s) in nb_kernel_ElecCSTab_VdwCSTab_GeomW4P1 for nb_kernel_ElecCSTab_VdwCSTab_GeomW4P1
FAILED. Check checkpot.out (2 errors) file(s) in nb_kernel_ElecEw_VdwCSTab_GeomP1P1 for nb_kernel_ElecEw_VdwCSTab_GeomP1P1
FAILED. Check checkpot.out (4 errors), checkforce.out (15 errors) file(s) in nb_kernel_ElecEw_VdwCSTab_GeomW3P1 for nb_kernel_ElecEw_VdwCSTab_GeomW3P1
FAILED. Check checkpot.out (4 errors), checkforce.out (453 errors) file(s) in nb_kernel_ElecEw_VdwCSTab_GeomW3W3 for nb_kernel_ElecEw_VdwCSTab_GeomW3W3
FAILED. Check checkpot.out (4 errors), checkforce.out (12 errors) file(s) in nb_kernel_ElecEw_VdwCSTab_GeomW4P1 for nb_kernel_ElecEw_VdwCSTab_GeomW4P1
FAILED. Check checkpot.out (4 errors), checkforce.out (420 errors) file(s) in nb_kernel_ElecEw_VdwCSTab_GeomW4W4 for nb_kernel_ElecEw_VdwCSTab_GeomW4W4
FAILED. Check checkpot.out (4 errors), checkforce.out (468 errors) file(s) in nb_kernel_ElecNone_VdwCSTab_GeomP1P1 for nb_kernel_ElecNone_VdwCSTab_GeomP1P1
FAILED. Check checkpot.out (4 errors) file(s) in nb_kernel_ElecRF_VdwCSTab_GeomP1P1 for nb_kernel_ElecRF_VdwCSTab_GeomP1P1
FAILED. Check checkpot.out (4 errors), checkforce.out (9 errors) file(s) in nb_kernel_ElecRF_VdwCSTab_GeomW3P1 for nb_kernel_ElecRF_VdwCSTab_GeomW3P1
FAILED. Check checkpot.out (4 errors), checkforce.out (453 errors) file(s) in nb_kernel_ElecRF_VdwCSTab_GeomW3W3 for nb_kernel_ElecRF_VdwCSTab_GeomW3W3
FAILED. Check checkpot.out (4 errors), checkforce.out (15 errors) file(s) in nb_kernel_ElecRF_VdwCSTab_GeomW4P1 for nb_kernel_ElecRF_VdwCSTab_GeomW4P1
FAILED. Check checkpot.out (4 errors), checkforce.out (426 errors) file(s) in nb_kernel_ElecRF_VdwCSTab_GeomW4W4 for nb_kernel_ElecRF_VdwCSTab_GeomW4W4
17 out of 142 kernel tests FAILED
FAILED. Check checkpot.out (179 errors), checkforce.out (2070 errors) file(s) in coulandvdwsequential_coul for coulandvdwsequential_coul
FAILED. Check checkpot.out (147 errors), checkforce.out (2082 errors) file(s) in coulandvdwsequential_vdw for coulandvdwsequential_vdw
FAILED. Check checkpot.out (160 errors), checkforce.out (2075 errors) file(s) in coulandvdwtogether for coulandvdwtogether
FAILED. Check checkpot.out (58 errors), checkforce.out (20290 errors) file(s) in restraints for restraints
FAILED. Check checkpot.out (25 errors), checkforce.out (2515 errors) file(s) in vdwalone for vdwalone
5 out of 10 freeenergy tests FAILED
All 12 rotation tests PASSED
All 0 extra tests PASSED
All 48 pdb2gmx tests PASSED
All 7 essential dynamics tests PASSED
regressiontests.FAILED.tar.gz (22.5 MB) regressiontests.FAILED.tar.gz Szilárd Páll, 02/27/2018 03:03 PM

Associated revisions

Revision a46c96c6 (diff)
Added by Berk Hess about 1 year ago

Work around gcc bug with group kernel tables

The LJ-only interaction tables for the group kernels were incorrectly
copied with the combination of gcc-6, AVX and -O3.
Since this bug replaced -1/(6 r^6), 1/(12 r^-12) by 1/r,-1/(6 r^6)
for odd table indices, this could not cause silent errors.

Fixes #2424

Change-Id: I021d4f2fe81bd635e6f02686118d661af8c444a6

History

#1 Updated by Szilárd Páll about 1 year ago

Triggered on verification of https://gerrit.gromacs.org/#/c/7617, reproduced locally with both gcc 6.4 and 6.3.

#2 Updated by Szilárd Páll about 1 year ago

  • Subject changed from some regression tests fail with gcc 6.4 to some regression tests fail with gcc 6

Reproduced with AVX_256, AVX2_128, AVX2_256, but not with SSE4.1. AVX codegen bug?

#3 Updated by Mark Abraham about 1 year ago

I was also looking into this earlier, and had no problem with a debug build on AVX_256. Trying alternatives.

#4 Updated by Szilárd Páll about 1 year ago

interesting, the above were all release builds.

#5 Updated by Szilárd Páll about 1 year ago

Szilárd Páll wrote:

interesting, the above were all release builds.

And indeed, a debug build does work, so does RelWithDebInfo, so -O2 seems to be fine. So it's perhaps -O3 breakage with AVX.

#6 Updated by Berk Hess about 1 year ago

All failures seem to be with the group scheme, so it looks like something group non-bonded kernel specific.
How large are the differences?

#7 Updated by Szilárd Páll about 1 year ago

Had a quick look and the differences are fairly large, but not orders of magnitude (attached the output of the failed tests).

#8 Updated by Berk Hess about 1 year ago

  • Subject changed from some regression tests fail with gcc 6 to LJ force and energy incorrect with cutoff-scheme=group and gcc 6
  • Status changed from New to Accepted
  • Priority changed from Normal to High
  • Target version set to 2018.1

The LJ energies and forces are completely off, everything else seems to be correct. This is a rather serious bug.

#9 Updated by Berk Hess about 1 year ago

  • Status changed from Accepted to In Progress

When there is only one atomtype the results are correct, so my guess is that something goes wrong in the LJ type or parameter lookup.

#10 Updated by Berk Hess about 1 year ago

My last comment was incorrect.
The results are correct with analytical LJ. With tables also a simple LJ system with one atom type is incorrect.

#11 Updated by Berk Hess about 1 year ago

The non-simd kernel is also incorrect.
I printed the table results in the plain-C kernel which should be -1/(6 r^6) and 1/(12 r^12) and for half of the pairs they are correct and the other half has the correct dispersion value in the repulsion result and the dispersion is something different, but systematic. So there seems to be a mixup in the table type offset for half the pairs.

#12 Updated by Aleksei Iupinov about 1 year ago

OK, just a wild guess since I'm not familiar with the code.
I see that some tables are created with ninteractions = 1 or 2, which is then multiplied by stride (of e.g. 4) for index/size calculation.
But some tables are (also?) made with function make_tables, which right away sets the default ninteractions = etiNR, which is "Total number of interaction types" and happens to be 3.
If the kernel assumes the table to have ninteractions 2, but it happened to actually have ninteractions 3, that would fit the symptoms, right?

#13 Updated by Berk Hess about 1 year ago

Very good guess!
I just noticed that the other incorrect potential is actually 1/r.
I now checked the indices and all even ones are correct and odd ones incorrect.
But if the stride is off, then the extracted potential can not be for the correct r. What we actually have is that the for even indices we have, correctly, -1/(6 r^6),1/(12 r^12) and for odd indices 1/r,-1/(6 r^6). So something must be going wrong with the table filling/copying.

#14 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2424.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I021d4f2fe81bd635e6f02686118d661af8c444a6
Gerrit URL: https://gerrit.gromacs.org/7630

#15 Updated by Berk Hess about 1 year ago

  • Status changed from In Progress to Fix uploaded
  • Assignee set to Berk Hess
  • Affected version - extra info set to 2016, likely all versions
  • Affected version changed from 2018.1 to 2018

#16 Updated by Berk Hess about 1 year ago

A bug in gcc6 causes incorrect copying of the triple Coulomb-LJ tables to the LJ-only tables. The bug seems to only occur with AVX and -O3. gcc 7.1 does not seem to be affected.
Since this replaces -1/(6 r^6), 1/(12 r^-12) by 1/r,-1/(6 r^6) for odd table indices, this could not cause silent errors.

#17 Updated by Berk Hess about 1 year ago

  • Status changed from Fix uploaded to Resolved

#18 Updated by Mark Abraham about 1 year ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF