Project

General

Profile

Bug #3440

Multi-nodes run exits with error with openmpi/4.0.0

Added by Yuxuan Zhuang 5 months ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Not sure it is related to #3296, but I was trying to run multi jobs on three nodes, each with 32 logical cores and 4 gpus. Error output and tpr files are attached.

  • Codes to reproduce the error
    salloc -p lindahl --time=2:00:00 --nodes=3 -n 96
    
    module load gromacs/2020.1
    module load openmpi/4.0.0
    
    mpirun -np 96 gmx_mpi mdrun -deffnm md -multidir walker{1..16} -ntomp 1 -nsteps 10000
    
  • Codes that run successfully
    salloc -p lindahl --time=2:00:00 --nodes=3 -n 96
    
    module load gromacs/2020.1
    module load openmpi/4.0.2
    
    mpirun -np 96 gmx_mpi mdrun -deffnm md -multidir walker{1..16} -ntomp 1 -nsteps 10000
    
  • Codes that works but only utilize one node.
    (raise WARNING: On rank 0: oversubscribing the available 32 logical CPU cores per node with 96 threads. This will cause considerable performance loss.)
    salloc -p lindahl --time=2:00:00 --nodes=3 -n 96
    
    module load gromacs/2020.1
    module load openmpi/4.0.0
    
    mpirun -np 16 gmx_mpi mdrun -deffnm md -multidir walker{1..16} -ntomp 6 -nsteps 10000
    
    
md.tpr (15.3 MB) md.tpr Yuxuan Zhuang, 03/11/2020 04:31 PM
error_openmpi_4_0_0.txt (151 KB) error_openmpi_4_0_0.txt Yuxuan Zhuang, 03/11/2020 04:31 PM

History

#1 Updated by Berk Hess 5 months ago

Which MPI version was GROMACS built with (I think this is written in the log file)? You can't freely mix MPI versions, but normally you would expect mixing minor versions would work.

#2 Updated by Szilárd Páll 5 months ago

The linked output seems to indicate an hwloc crash, so probably not GROMACS-related. Can you try to build against a different OpenMP version?

Secondly, redarding launching multi-node, have you checked/do you know that the OpenMPI installation is configured with SLURM such that node allocations get passed to the MPI launcher?
On our cluster I've use the manual way where I allocating nodes first, then launched using mpirun --hostfile | --host.

Also available in: Atom PDF