Project

General

Profile

Bug #2989

(thread-) MPI setup hanging on bs_jetson_tk1

Added by Paul Bauer 4 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
core library
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Running gmx mdrun leads to a hang during the thread-MPI setup phase since commit https://gerrit.gromacs.org/c/gromacs/+/11197


Related issues

Related to GROMACS - Task #2986: Post submit failing in two configurationsClosed
Related to GROMACS - Task #2992: Split hw_opt in const user options and dynamic settingsNew

Associated revisions

Revision 5c15e1c2 (diff)
Added by Berk Hess 4 months ago

Fix MPI deadlock in affinity setting

Commit 96d28d6b introduced the possiblity for deadlocks when
the affinity masks changed over time.

Fixes #2989

Change-Id: I454b38deb9ff11c90cdf4a19aaa80e79e8898df5

History

#1 Updated by Paul Bauer 4 months ago

  • Related to Task #2986: Post submit failing in two configurations added

#2 Updated by Szilárd Páll 4 months ago

Reproduced, the issue seems to be that some tMPI ranks are stuck in analyzeThreadsOnThisNode() MPI_Scan(), while the other is already in the next bcast collective:

Here's the backtrace.

$ gdb --args gmx mdrun -ntmpi 2 -ntomp 1 -nb cpu -notunepme -s topol 
[...]
(gdb) bt 
#0  tMPI_Event_wait (ev=0x398c8) at /nethome/pszilard/gromacs-master/src/external/thread_mpi/src/event.cpp:71
#1  0xb6cc9bdc in tMPI_Wait_for_others (cev=0x36fd4, myrank=0) at /nethome/pszilard/gromacs-master/src/external/thread_mpi/src/collective.cpp:522
#2  0xb6cc8ebc in tMPI_Bcast (buffer=0xbeffe9a0, count=8, datatype=0xb6fcbb24 <tmpi_byte>, root=0, comm=0x396b8)
    at /nethome/pszilard/gromacs-master/src/external/thread_mpi/src/bcast.cpp:98
#3  0xb6ae3b5a in gmx_bcast_sim (nbytes=8, b=0xbeffe9a0, cr=0x5ad90) at /nethome/pszilard/gromacs-master/src/gromacs/gmxlib/network.cpp:286
#4  0xb6bfcdb2 in gmx::Mdrunner::mdrunner (this=0xbeffede8) at /nethome/pszilard/gromacs-master/src/gromacs/mdrun/runner.cpp:1225
#5  0x00014e4a in gmx::gmx_mdrun (argc=12, argv=0xbefff608) at /nethome/pszilard/gromacs-master/src/programs/mdrun/mdrun.cpp:276
#6  0xb671bd98 in gmx::(anonymous namespace)::CMainCommandLineModule::run (this=0x2e148, argc=12, argv=0xbefff608)
    at /nethome/pszilard/gromacs-master/src/gromacs/commandline/cmdlinemodulemanager.cpp:133
#7  0xb671d034 in gmx::CommandLineModuleManager::run (this=0xbefff48c, argc=12, argv=0xbefff608)
    at /nethome/pszilard/gromacs-master/src/gromacs/commandline/cmdlinemodulemanager.cpp:589
#8  0x00012b7c in main (argc=13, argv=0xbefff604) at /nethome/pszilard/gromacs-master/src/programs/gmx.cpp:60

Simple reason could be that the refactoring broke things and hw_opt.thread_affinity != threadaffOFF check evaluates differently on the two ranks (affinities get turned off as for some reason the mask on this slave is 0x1 by default).

Can't dig deeper ATM, but if anybody wants to debug here's a binary you can grab on dev-jetson01: /home/pszilard/gromacs-master/build_gcc8/bin/gmx

#3 Updated by Szilárd Páll 4 months ago

A few more tests:

$ gmx mdrun -ntmpi 2 -ntomp 1 -nb cpu -notunepme -s topol -pin off
$ gmx mdrun -ntmpi 2 -ntomp 1 -nb cpu -notunepme -s topol -pin on

both work to some extent -- crashes due to unrelated issue).

The latter actually fails to set affinities, but this may be a peculiarity on the ARM board.

This seems to support my above hypothesis about the thread affinity check not evaluating to the same value on all ranks

#4 Updated by Mark Abraham 4 months ago

ok i'll look over the code again on the flight

#5 Updated by Mark Abraham 4 months ago

I didn't spot anything that suggested the code is a problem :-( I suggest we revert my change (on Monday!)

#6 Updated by Szilárd Páll 4 months ago

Mark Abraham wrote:

I didn't spot anything that suggested the code is a problem :-( I suggest we revert my change (on Monday!)

The issue was indeed that conditional on line 1194 that I pointed at; after refactoring that conditional communication became a path that was triggered with tMPI and as on the Jetson TK1 the kernel seems to change the system-wide affinity mask based on load, different ranks detected different masks and therefore different values of hw_opt.thread_affinity.

#7 Updated by Berk Hess 4 months ago

  • Related to Task #2992: Split hw_opt in const user options and dynamic settings added

#8 Updated by Berk Hess 4 months ago

  • Status changed from New to Resolved

#9 Updated by Mark Abraham 4 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF