Project

General

Profile

Bug #2504

gromacs 2018.1 doesn't run on KNL

Added by Carlo Camilloni about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
analysis tools
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I have tried to run a standard MD simulation on a KNL cluster but I get the following error:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 25280 RUNNING AT r065c04s03
=   EXIT CODE: 132
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 25280 RUNNING AT r065c04s03
=   EXIT CODE: 4
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================


The code is run as:

mpiexec -np 32 mdrun_knl -s topol0 -nb cpu -v -maxh 23.9 -nsteps -1 >& log

and is compiled with Intel 2017, the same happens with the intel 2018 and using the fftw instead of the mkl.

This is the log

md.2018.1.log:

ROMACS version:    2018.1
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        disabled
SIMD instructions:  AVX_512_KNL
FFT library:        Intel MKL
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.0
Tracing support:    disabled
Built on:           2018-05-17 11:33:17
Built by:           ccamillo@r000u06l01 [CMAKE]
Build OS/arch:      Linux 3.10.0-327.36.3.el7.x86_64 x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
Build CPU family:   6   Model: 79   Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /cineca/prod/opt/compilers/intel/pe-xe-2017/binary/bin/icc Intel 17.0.4.20170411
C compiler flags:    -xMIC-AVX512   -mkl=sequential  -std=gnu99  -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits
C++ compiler:       /cineca/prod/opt/compilers/intel/pe-xe-2017/binary/bin/icpc Intel 17.0.4.20170411
C++ compiler flags:  -xMIC-AVX512   -mkl=sequential  -std=c++11   -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits


(it stops here)

the same TPR with the same setup on the same cluster with gromacs 2016.5 compiled in the same way works well

md.2016.5.log:

GROMACS version:    2016.5
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:        disabled
SIMD instructions:  AVX_512_KNL
FFT library:        Intel MKL
RDTSCP usage:       enabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.0
Tracing support:    disabled
Built on:           Thu May 17 12:17:27 CEST 2018
Built by:           ccamillo@r000u06l01 [CMAKE]
Build OS/arch:      Linux 3.10.0-327.36.3.el7.x86_64 x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
Build CPU family:   6   Model: 79   Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler:         /cineca/prod/opt/compilers/intel/pe-xe-2017/binary/bin/icc Intel 17.0.4.20170411
C compiler flags:    -xMIC-AVX512   -mkl=sequential  -std=gnu99  -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias
C++ compiler:       /cineca/prod/opt/compilers/intel/pe-xe-2017/binary/bin/icpc Intel 17.0.4.20170411
C++ compiler flags:  -xMIC-AVX512   -mkl=sequential  -std=c++0x   -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias

Running on 1 node with total 68 cores, 272 logical cores
Hardware detected on host r065c06s01 (the node of MPI rank 0):
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
    Family: 6   Model: 87   Stepping: 1
    Features: aes apic avx avx2 avx512f avx512pf avx512er avx512cd clfsh cmov cx8 cx16 f16c fma htt lahf mmx msr nonstop_tsc pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
    SIMD instructions most likely to fit this hardware: AVX_512_KNL
    SIMD instructions selected at GROMACS compile time: AVX_512_KNL

  Hardware topology: Basic
    Sockets, cores, and logical processors:
      Socket  0: [   0  68 136 204] [   1  69 137 205] [   2  70 138 206] [   3  71 139 207] [   4  72 140 208] [   5  73 141 209] [   6  74 142 210] [   7  75 143 211] [   8  76 144 212] [   9  77 145 213] [  10  78 146 214] [  11  79 147 215] [  12  80 148 216] [  13  81 149 217] [  14  82 150 218] [  15  83 151 219] [  16  84 152 220] [  17  85 153 221] [  18  86 154 222] [  19  87 155 223] [  20  88 156 224] [  21  89 157 225] [  22  90 158 226] [  23  91 159 227] [  24  92 160 228] [  25  93 161 229] [  26  94 162 230] [  27  95 163 231] [  28  96 164 232] [  29  97 165 233] [  30  98 166 234] [  31  99 167 235] [  32 100 168 236] [  33 101 169 237] [  34 102 170 238] [  35 103 171 239] [  36 104 172 240] [  37 105 173 241] [  38 106 174 242] [  39 107 175 243] [  40 108 176 244] [  41 109 177 245] [  42 110 178 246] [  43 111 179 247] [  44 112 180 248] [  45 113 181 249] [  46 114 182 250] [  47 115 183 251] [  48 116 184 252] [  49 117 185 253] [  50 118 186 254] [  51 119 187 255] [  52 120 188 256] [  53 121 189 257] [  54 122 190 258] [  55 123 191 259] [  56 124 192 260] [  57 125 193 261] [  58 126 194 262] [  59 127 195 263] [  60 128 196 264] [  61 129 197 265] [  62 130 198 266] [  63 131 199 267] [  64 132 200 268] [  65 133 201 269] [  66 134 202 270] [  67 135 203 271]

topol0.tpr (1.31 MB) topol0.tpr Carlo Camilloni, 05/17/2018 01:54 PM

Associated revisions

Revision f8b78130 (diff)
Added by Roland Schulz about 1 year ago

Fix illegal instruction error on KNL

Fixes #2504

Change-Id: Ie2f55718f98d3dfbf3c312afa5141c77ead77a6d

History

#1 Updated by Mark Abraham about 1 year ago

  • Description updated (diff)

#2 Updated by Mark Abraham about 1 year ago

Your .tpr works fine for me with intel 2018 compilers (+ mkl,openmpi,hwloc) on our dev-knl01 node.

The timing suggests that the module(?) for gromacs 2018.1 is failing to permit mdrun resolve hwloc library at run time. You might try ldd mdrun_knl to see if that is the case, or explictly loading an hwloc module (though I do not see such a module on marconi)

#3 Updated by Mark Abraham about 1 year ago

However an otherwise identical build with icc 17.0.4.20170411 behaves very strangely, apparently starting 32 independent processes each of 272 threads. That's also consistent with Carlo's observations. I suggest Carlo use/request intel 2018 (I used 18.0.1.20171018)

#4 Updated by Roland Schulz about 1 year ago

For me it's fine with 17.0.4. You might want to compile without HWLOC (-DGMX_HWLOC=no) to check whether that works. Also you should get an error before the MPI error. Please check whether there is any output before the MPI error and copy it here. BTW: You don't want to run with default number of OpenMP on KNL. You want to use "-ntomp 4".

#5 Updated by Carlo Camilloni about 1 year ago

There is not much text before the mpi error

GROMACS: mdrun_knl, version 2018.1
Executable: /marconi/home/userexternal/ccamillo/opt/gromacs-2018.1/bin/mdrun_knl
Data prefix: /marconi/home/userexternal/ccamillo/opt/gromacs-2018.1
Working dir: /marconi_scratch/userexternal/ccamillo/test
Command line:
mdrun_knl -s topol0 -nb cpu -v -maxh 23.9 -nsteps -1

Back Off! I just backed up md.log to ./#md.log.9#

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 25280 RUNNING AT r065c04s03 = EXIT CODE: 132 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES ===================================================================================

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 25280 RUNNING AT r065c04s03 = EXIT CODE: 4 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES ===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764 ===================================================================================

I will do some more test and report back

#6 Updated by Carlo Camilloni about 1 year ago

Test 1:
(I have added hwloc in the module I am loading explicitly in the script and added explicitly -ntomp 2, but btw I am anyway setting the number of openmp thread in the slurm script, and indeed this is not an issue with GMX2016.5)

module load intel intelmpi mkl hwloc
mpiexec -np 32 mdrun_knl -s topol0 -nb cpu -v -maxh 23.9 -nsteps -1 -ntomp 2 >& log

Same error messages

Test 2:
Recompiled with HWLOC OFF

GROMACS version: 2018.1
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: disabled
SIMD instructions: AVX_512_KNL
FFT library: Intel MKL
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
Built on: 2018-05-17 11:33:17
Built by: ccamillo@r000u06l01 [CMAKE]
Build OS/arch: Linux 3.10.0-327.36.3.el7.x86_64 x86_64
Build CPU vendor: Intel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
Build CPU family: 6 Model: 79 Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /cineca/prod/opt/compilers/intel/pe-xe-2017/binary/bin/icc Intel 17.0.4.20170411
C compiler flags: -xMIC-AVX512 -mkl=sequential -std=gnu99 -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits
C++ compiler: /cineca/prod/opt/compilers/intel/pe-xe-2017/binary/bin/icpc Intel 17.0.4.20170411
C++ compiler flags: -xMIC-AVX512 -mkl=sequential -std=c++11 -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits

Same error message as before

I will test the Intel2018 available on Marconi

#7 Updated by Carlo Camilloni about 1 year ago

Unfortunately it doesn't work also with the intel 2018 available

GROMACS version: 2018.1
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: disabled
SIMD instructions: AVX_512_KNL
FFT library: Intel MKL
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
Built on: 2018-05-18 13:21:43
Built by: ccamillo@r000u06l01 [CMAKE]
Build OS/arch: Linux 3.10.0-327.36.3.el7.x86_64 x86_64
Build CPU vendor: Intel
Build CPU brand: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
Build CPU family: 6 Model: 79 Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /cineca/prod/opt/compilers/intel/pe-xe-2018/binary/bin/icc Intel 18.0.2.20180210
C compiler flags: -xMIC-AVX512 -mkl=sequential -std=gnu99 -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits
C++ compiler: /cineca/prod/opt/compilers/intel/pe-xe-2018/binary/bin/icpc Intel 18.0.2.20180210
C++ compiler flags: -xMIC-AVX512 -mkl=sequential -std=c++11 -O3 -DNDEBUG -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits

log:

Back Off! I just backed up md.log to ./#md.log.13#

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 2883 RUNNING AT r065c01s03 = EXIT CODE: 132 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES ===================================================================================

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 2883 RUNNING AT r065c01s03 = EXIT CODE: 4 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES ===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764 ===================================================================================

#8 Updated by Roland Schulz about 1 year ago

It might be hard to help without access to the cluster given that we can't reproduce it on other machines.
Ideas:
  • Try to run mdrun from an interactive slurm shell rather than starting it through a script. My hope is it shows more info in that case.
  • Build with "-g" (CXXFLAGS and CFLAGS) and run inside debugger.

Have you asked the Marconi admins?

#9 Updated by Carlo Camilloni about 1 year ago

I know, anyway for completeness

Compiled in debug mode it works

Compiled in RelWithDebInfo it doesn’t and unfortunately the interactive mode doesn’t give any additional information about the error

It is likely a compiler issue but a unfortunate one..

#10 Updated by Mark Abraham about 1 year ago

Or trying using an Intel compiler with another MPI library.

#11 Updated by Carlo Camilloni about 1 year ago

I don't know if it is of any help, but recompiling with openmpi and RelWithDebInfo and running it inside GDB I get

Tracepoint 1 at 0x63fc7b: file /marconi/home/userexternal/ccamillo/Codes/gromacs-2018.1/src/gromacs/hardware/identifyavx512fmaunits.cpp, line 244.

#12 Updated by Roland Schulz about 1 year ago

Could you try the Release build and change the SIMD_AVX_512_CXX_SUPPORTED in the CMakeCache.txt file to 0?

#13 Updated by Roland Schulz about 1 year ago

Could you also paste all (/much more) of your output from GDB? Did you set a tracepoint? In what context did you get that message?

#14 Updated by Carlo Camilloni about 1 year ago

Changing SIMD_AVX_512_CXX_SUPPORTED in the CMakeCache.txt file to 0 makes it work!

About the gdb output it was not much more than what I reported:

Reading symbols from /marconi/home/userexternal/ccamillo/opt/gromacs-2018.1/bin/mdrun_knl...done.
(gdb) run
Starting program: /marconi/home/userexternal/ccamillo/opt/gromacs-2018.1/bin/mdrun_knl
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /cineca/prod/opt/compilers/intel/pe-xe-2017/binary/inspector/lib64/libstdc++.so.6
Missing separate debuginfo for /cineca/prod/opt/compilers/intel/pe-xe-2017/binary/inspector/lib64/libgcc_s.so.1
[New Thread 0x2aaab4359700 (LWP 38979)]
[New Thread 0x2aaab497d700 (LWP 39012)]
:-) GROMACS - mdrun_knl, 2018.1 (-:

GROMACS is written by:
[...]
Back Off! I just backed up md.log to ./#md.log.9#

Program received signal SIGILL, Illegal instruction.
checkDualAvx512FmaUnits () at /marconi/home/userexternal/ccamillo/Codes/gromacs-2018.1/src/gromacs/hardware/identifyavx512fmaunits.cpp:244
244 return (timeFmaAndShuf > 1.5 * timeFmaOnly);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.5.x86_64 infinipath-psm-3.3-0.g6f42cdb1bb8.2.el7.x86_64 libhfi1-0.5-27.el7.x86_64 libibumad-1.3.10.2-1.el7.x86_64 libibverbs-1.1.8-8.el7.x86_64 libipathverbs-1.3-2.el7.x86_64 libnl3-3.2.21-10.el7.x86_64 libpciaccess-0.13.4-2.el7.x86_64 libpsm2-10.2.235-1.x86_64 librdmacm-1.0.21-1.el7.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 opensm-libs-3.3.19-1.el7.x86_64 sssd-client-1.13.0-40.el7_2.12.x86_64

(gdb) trace
Tracepoint 1 at 0x63fc7b: file /marconi/home/userexternal/ccamillo/Codes/gromacs-2018.1/src/gromacs/hardware/identifyavx512fmaunits.cpp, line 244.

#15 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2504.
Uploader: Roland Schulz ()
Change-Id: gromacs~release-2018~Ie2f55718f98d3dfbf3c312afa5141c77ead77a6d
Gerrit URL: https://gerrit.gromacs.org/7926

#16 Updated by Roland Schulz about 1 year ago

  • Status changed from New to Fix uploaded

Could you please try the fix I uploaded to Gerrit?

#17 Updated by Roland Schulz about 1 year ago

PS: Please undo the change to the CMakeCache.txt for the test.

#18 Updated by Carlo Camilloni about 1 year ago

yes your fix (patch set 2) works

#19 Updated by Roland Schulz about 1 year ago

Thanks. Until we release 2018.2 you can use the CMakeCache work-around. That shouldn't have any side-effects.

#20 Updated by Carlo Camilloni about 1 year ago

Great, thanks!

#21 Updated by Roland Schulz about 1 year ago

  • Status changed from Fix uploaded to Resolved

#22 Updated by Mark Abraham about 1 year ago

  • Category set to analysis tools
  • Status changed from Resolved to Closed
  • Assignee set to Roland Schulz
  • Target version set to 2018.2

Also available in: Atom PDF