Bug #3440
Multi-nodes run exits with error with openmpi/4.0.0
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Description
Not sure it is related to #3296, but I was trying to run multi jobs on three nodes, each with 32 logical cores and 4 gpus. Error output and tpr files are attached.
- Codes to reproduce the error
salloc -p lindahl --time=2:00:00 --nodes=3 -n 96 module load gromacs/2020.1 module load openmpi/4.0.0 mpirun -np 96 gmx_mpi mdrun -deffnm md -multidir walker{1..16} -ntomp 1 -nsteps 10000
- Codes that run successfully
salloc -p lindahl --time=2:00:00 --nodes=3 -n 96 module load gromacs/2020.1 module load openmpi/4.0.2 mpirun -np 96 gmx_mpi mdrun -deffnm md -multidir walker{1..16} -ntomp 1 -nsteps 10000
- Codes that works but only utilize one node.
(raise WARNING: On rank 0: oversubscribing the available 32 logical CPU cores per node with 96 threads. This will cause considerable performance loss.)salloc -p lindahl --time=2:00:00 --nodes=3 -n 96 module load gromacs/2020.1 module load openmpi/4.0.0 mpirun -np 16 gmx_mpi mdrun -deffnm md -multidir walker{1..16} -ntomp 6 -nsteps 10000
History
#2 Updated by Szilárd Páll 10 months ago
The linked output seems to indicate an hwloc crash, so probably not GROMACS-related. Can you try to build against a different OpenMP version?
Secondly, redarding launching multi-node, have you checked/do you know that the OpenMPI installation is configured with SLURM such that node allocations get passed to the MPI launcher?
On our cluster I've use the manual way where I allocating nodes first, then launched using mpirun --hostfile | --host
.