Project

General

Profile

Bug #3296

multi run with >1 rank per simulation exits with MPI_ABORT

Added by Szilárd Páll 6 months ago. Updated 5 months ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

$ mpirun --mca mpi_abort_print_stack 1 -np 4 $gmx_mpi mdrun -v -multidir repl_{001..002} -nsteps 0 -s topol

GROMACS:      gmx mdrun, version 2020.1-dev-20200110-4879d39
Executable:   /nethome/pszilard-projects/gromacs/gromacs-20/build_AVX2_256_gcc8_cuda10.1_ompi400/bin/gmx_mpi
Data prefix:  /nethome/pszilard/projects/gromacs/gromacs-20 (source tree)
Working dir:  /nethome/pszilard-projects/gromacs/bench/LUMI-bench/aqp_ensemble/test_repl-128
Command line:
  gmx_mpi mdrun -v -multidir repl_001 repl_002 -nsteps 0 -s topol

Back Off! I just backed up md.log to ./#md.log.11#

Back Off! I just backed up md.log to ./#md.log.11#
Compiled SIMD: AVX2_256, but for this host/run AVX_512 might be better (see
log).
Reading file topol.tpr, VERSION 2020.1-dev-20200108-e05cc33 (single precision)
Reading file topol.tpr, VERSION 2020.1-dev-20200108-e05cc33 (single precision)
Overriding nsteps with value passed on the command line: 0 steps, 0 ps
Overriding nsteps with value passed on the command line: 0 steps, 0 ps
Changing nstlist from 40 to 100, rlist from 1.2 to 1.287

Changing nstlist from 40 to 100, rlist from 1.2 to 1.287

[dev-purley01:30476] *** An error occurred in MPI_Allreduce
[dev-purley01:30476] *** reported by process [566165505,3]
[dev-purley01:30476] *** on communicator MPI_COMM_WORLD
[dev-purley01:30476] *** MPI_ERR_COMM: invalid communicator
[dev-purley01:30476] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[dev-purley01:30476] ***    and potentially your MPI job)

Associated revisions

Revision 4d60bf59 (diff)
Added by Berk Hess 6 months ago

Correct fixed redmine issue id from 3297 to 3296

Recent commit 36a65816 said it fixed issue 3297, but this should
have been issue 3296.

Refs #3296

Change-Id: I85a95e818d3cda816211dc2aa8ddb32e9e0c69d4

History

#1 Updated by Szilárd Páll 6 months ago

(gdb) bt 
#0  0x00007f8f49ac6bd0 in PMPI_Allreduce () from /opt/tcbsys/openmpi/4.0.0/lib/libmpi.so.40
#1  0x00000000010e7106 in gmx_sumi_sim (nr=1, r=0x7ffdb99c8abc, ms=0x37f1670) at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/mdrunutility/multisim.cpp:220
#2  0x0000000001133299 in gmx::(anonymous namespace)::countOverAllRanks (cr=0x3839020, ms=0x37f1670, countOnThisRank=0)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/taskassignment/taskassignment.cpp:206
#3  0x00000000011338f0 in gmx::GpuTaskAssignmentsBuilder::build (this=0x7ffdb99ca8cf, gpuIdsToUse=std::vector of length 0, capacity 0, 
    userGpuTaskAssignment=std::vector of length 0, capacity 0, hardwareInfo=..., cr=0x3839020, ms=0x37f1670, physicalNodeComm=..., nonbondedTarget=<incomplete type>, 
    pmeTarget=<incomplete type>, bondedTarget=<incomplete type>, updateTarget=<incomplete type>, useGpuForNonbonded=false, useGpuForPme=false, rankHasPpTask=true, 
    rankHasPmeTask=true) at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/taskassignment/taskassignment.cpp:331
#4  0x00000000010ce53a in gmx::Mdrunner::mdrunner (this=0x7ffdb99cb540) at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/mdrun/runner.cpp:1159
#5  0x0000000000412911 in gmx::gmx_mdrun (argc=9, argv=0x37c27a8) at /nethome/pszilard/projects/gromacs/gromacs-20/src/programs/mdrun/mdrun.cpp:270
#6  0x0000000000ceef17 in gmx::(anonymous namespace)::CMainCommandLineModule::run (this=0x37d9250, argc=9, argv=0x37c27a8)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/commandline/cmdlinemodulemanager.cpp:128
#7  0x0000000000cf09ca in gmx::CommandLineModuleManager::run (this=0x7ffdb99cc568, argc=9, argv=0x37c27a8)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/commandline/cmdlinemodulemanager.cpp:570
#8  0x00000000004100a8 in main (argc=10, argv=0x37c27a0) at /nethome/pszilard/projects/gromacs/gromacs-20/src/programs/gmx.cpp:59
(gdb) up
#1  0x00000000010e7106 in gmx_sumi_sim (nr=1, r=0x7ffdb99c8abc, ms=0x37f1670) at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/mdrunutility/multisim.cpp:220
220         MPI_Allreduce(MPI_IN_PLACE, r, nr, MPI_INT, MPI_SUM, ms->mpi_comm_masters);
(gdb) p *ms 
$2 = {nsim = 2, sim = 0, mpi_group_masters = 0x37f16a0, mpi_comm_masters = 0x30ff760 <ompi_mpi_comm_null>, mpb = 0x0}
(gdb)

#2 Updated by Szilárd Páll 6 months ago

The issue is that gmx_multisim_t (and specifically mpi_comm_masters) is used in GpuTaskAssignmentsBuilder::build() in countOverAllRanks() wheread the multisim data is only initialized later with simulatorBuilder.build().

#3 Updated by Szilárd Páll 6 months ago

  • Private changed from Yes to No

#4 Updated by Berk Hess 6 months ago

  • Status changed from New to Fix uploaded
  • Assignee set to Berk Hess
  • Priority changed from Normal to High
  • Target version set to 2020.1

#5 Updated by Szilárd Páll 6 months ago

  • Status changed from Fix uploaded to Resolved

#6 Updated by Paul Bauer 5 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF