Project

General

Profile

Bug #3448

GMX 2020.1 - Multidir simulations can stop at different times when killed by job manager

Added by Daniel Kozuch 5 months ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Related to Bug #2440 (https://redmine.gromacs.org/issues/2440). I am using GROMACS 2020.1 patched with PLUMED 2.6. I am still seeing incompatible subsystems when restarting from checkpoint files, specifically if the job is killed by the job manager (SLURM in my case). I can provide all the run files if need be.

Here is the error report:

GROMACS: gmx mdrun, version 2020.1-MODIFIED
Executable: /home/dkozuch/programs/gromacs_201gp/bin/gmx_201gp
Data prefix: /home/dkozuch/programs/gromacs_201gp
Working dir: /scratch/gpfs/dkozuch/phase_effects/2f4k/tip4pIce/ptwte/v3/1bar/folded/t1/0
Process ID: 4255
Command line:
gmx_201gp mdrun -s 2f4k_sim -cpi 2f4k_sim -append no -deffnm 2f4k_sim -plumed plumed.dat -multidir 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 -replex 500 -ntomp 3 -pin on

GROMACS version: 2020.1-MODIFIED
This program has been built from source code that has been altered and does not match the code released as part of the official GROMACS version 2020.1-MODIFIED. If you did not intend to use an altered GROMACS version, make sure to download an intact source distribution and compile that before proceeding.
If you have modified the source code, you are strongly encouraged to set your custom version suffix (using -DGMX_VERSION_STRING_OF_FORK) which will can help later with scientific reproducibility but also when reporting bugs.
Release checksum: 5cde61b9d46b24153ba84f499c996612640b965eff9a218f8f5e561f94ff4e43
Computed checksum: c88b2736bcaf07bce004173d41b5633c40a60da1c6e9d3a8732bf162ae8e9ca7
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: Intel MKL
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: hwloc-1.11.8
Tracing support: disabled
C compiler: /opt/intel/compilers_and_libraries_2019.1.144/linux/bin/intel64/icc Intel 19.0.0.20181018
C compiler flags: -march=core-avx2 -mkl=sequential -std=gnu99 -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits
C++ compiler: /opt/intel/compilers_and_libraries_2019.1.144/linux/bin/intel64/icpc Intel 19.0.0.20181018
C++ compiler flags: -march=core-avx2 -mkl=sequential -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits -qopenmp
CUDA compiler: /usr/local/cuda-10.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2019 NVIDIA Corporation;Built on Wed_Oct_23_19:24:38_PDT_2019;Cuda compilation tools, release 10.2, V10.2.89
CUDA compiler flags:-std=c++14;-gencode;arch=compute_60,code=sm_60;-use_fast_math;;-march=core-avx2 -mkl=sequential -ip -funroll-all-loops -alias-const -ansi-alias -no-prec-div -fimf-domain-exclusion=14 -qoverride-limits -qopenmp
CUDA driver: 10.20
CUDA runtime: 10.20

Running on 4 nodes with total 112 cores, 112 logical cores, 16 compatible GPUs
Cores per node: 28
Logical cores per node: 28
Compatible GPUs per node: 4
All nodes have identical type(s) of GPUs
Hardware detected on host tiger-i19g6 (the node of MPI rank 0):
CPU info:
Vendor: Intel
Brand: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Family: 6 Model: 79 Stepping: 1
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic

Hardware topology: Full, with devices
Sockets, cores, and logical processors:
Socket 0: [ 0] [ 2] [ 4] [ 6] [ 8] [ 10] [ 12] [ 14] [ 16] [ 18] [ 20] [ 22] [ 24] [ 26]
Socket 1: [ 1] [ 3] [ 5] [ 7] [ 9] [ 11] [ 13] [ 15] [ 17] [ 19] [ 21] [ 23] [ 25] [ 27]
Numa nodes:
Node 0 (137338761216 bytes mem): 0 2 4 6 8 10 12 14 16 18 20 22 24 26
Node 1 (137438953472 bytes mem): 1 3 5 7 9 11 13 15 17 19 21 23 25 27
Latency:
0 1
0 1.00 2.10
1 2.10 1.00
Caches:
L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
L3: 36700160 bytes, linesize 64 bytes, assoc. 20, shared 14 ways
PCI devices:
0000:02:00.0 Id: 8086:24f0 Class: 0x0208 Numa: 0
0000:03:00.0 Id: 10de:15f8 Class: 0x0302 Numa: 0
0000:04:00.0 Id: 10de:15f8 Class: 0x0302 Numa: 0
0000:00:11.4 Id: 8086:8d62 Class: 0x0106 Numa: 0
0000:01:00.0 Id: 8086:1521 Class: 0x0200 Numa: 0
0000:01:00.1 Id: 8086:1521 Class: 0x0200 Numa: 0
0000:08:00.0 Id: 102b:0534 Class: 0x0300 Numa: 0
0000:00:1f.2 Id: 8086:8d02 Class: 0x0106 Numa: 0
0000:81:00.0 Id: 144d:a821 Class: 0x0108 Numa: 1
0000:82:00.0 Id: 10de:15f8 Class: 0x0302 Numa: 1
0000:83:00.0 Id: 10de:15f8 Class: 0x0302 Numa: 1
GPU info:
Number of GPUs detected: 4
#0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible
#1: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible
#2: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible
#3: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat: compatible

[ ... ]

Initializing Replica Exchange
Repl There are 32 replicas:
Multi-checking the number of atoms ... OK
Multi-checking the integrator ... OK
Multi-checking init_step+nsteps ...
init_step+nsteps is not equal for all subsystems
subsystem 0: 4056879
subsystem 1: 4056849
subsystem 2: 4056849
subsystem 3: 4056849
subsystem 4: 4056849
subsystem 5: 4056849
subsystem 6: 4056849
subsystem 7: 4056849
subsystem 8: 4056849
subsystem 9: 4056849
subsystem 10: 4056849
subsystem 11: 4056849
subsystem 12: 4056849
subsystem 13: 4056849
subsystem 14: 4056839
subsystem 15: 4056839
subsystem 16: 4056839
subsystem 17: 4056839
subsystem 18: 4056839
subsystem 19: 4056839
subsystem 20: 4056839
subsystem 21: 4056839
subsystem 22: 4056839
subsystem 23: 4056839
subsystem 24: 4056839
subsystem 25: 4056839
subsystem 26: 4056839
subsystem 27: 4056839
subsystem 28: 4056839
subsystem 29: 4056839
subsystem 30: 4056839
subsystem 31: 4056839

-------------------------------------------------------
Program: gmx mdrun, version 2020.1-MODIFIED
Source file: src/gromacs/mdrunutility/multisim.cpp (line 376)
MPI rank: 0 (out of 32)

Fatal error:
The 32 subsystems are not compatible


Related issues

Related to GROMACS - Bug #2440: Multidir simulations can stop at different times when using mdrun -maxhClosed

History

#1 Updated by Szilárd Páll 5 months ago

  • Related to Bug #2440: Multidir simulations can stop at different times when using mdrun -maxh added

#2 Updated by Szilárd Páll 5 months ago

  • Subject changed from GMX 2020.1 - Multidir simulations can stop at different times when killed by job manager (Related to Bug #2440). to GMX 2020.1 - Multidir simulations can stop at different times when killed by job manager

#3 Updated by Szilárd Páll 5 months ago

Daniel Kozuch wrote:

Related to Bug #2440 (https://redmine.gromacs.org/issues/2440). I am using GROMACS 2020.1 patched with PLUMED 2.6. I am still seeing incompatible subsystems when restarting from checkpoint files, specifically if the job is killed by the job manager (SLURM in my case).

Can you please confirm that this happens also without PLUMED?

#4 Updated by Berk Hess 5 months ago

Yes, my fix should have fix this for communicating code.
As PLUMED does not use an API with GROMACS, mdrun can't know that it needs to sync checkpointing between the simulations.

Also available in: Atom PDF