Project

General

Profile

Bug #2131

mdrun hangs upon "-nsteps " or "-maxh" trigger with more than 20 MPI processes

Added by Jan-Philipp Machtens almost 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Affected version - extra info:
GROMACS 2016.2
Affected version:
Difficulty:
uncategorized
Close

Description

Dear all,
In standard MD simulations with 20 or more MPI processes in total, my mdrun hangs (GROMACS 2016.2), when either "-nsteps " or "-maxh " should trigger mdrun termination.
I extensively tested on CPU-only nodes (2x E5-2680 v3 each)
(1) mdrun using thread MPI on a single node
(2) mdrun compiled up-to-date ParaStation MPI across 1,2,4, or 5 nodes

Summary:
Mdrun did not terminate upon a -nsteps/-maxh trigger, whenever the total number of MPI threads or MPI processes across all nodes was equal/larger than 20, irrespective of the number of OpenMP threads per MPI/process.
I tested GROMACS 5.1.x versions, GROMACS 2016, and GROMACS 2016.1, and it appears that this issue is specific to GROMACS 2016.2.

When GROMACS hangs, the output looks like this:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
GROMACS:      gmx mdrun, version 2016.2
Executable:   /home/XXX/XXX/bin/gromacs-2016.2-threadMPI/bin/gmx
Data prefix:  /home/XXX/XXX/bin/gromacs-2016.2-threadMPI
Working dir:  /work/XXX/XXX/test-norestraint
Command line:
  gmx mdrun -nsteps 300 -ntomp 2
Running on 1 node with total 24 cores, 48 logical cores
Hardware detected:
  CPU info:
    Vendor: Intel
    Brand:  Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256
  Hardware topology: Basic
Reading file topol.tpr, VERSION 5.1.4 (single precision)
Note: file tpx version 103, software tpx version 110
Overriding nsteps with value passed on the command line: 300 steps, 1.2 ps
Will use 16 particle-particle and 8 PME only ranks
This is a guess, check the performance at the end of the log file
Using 24 MPI threads
Using 2 OpenMP threads per tMPI thread
starting mdrun 'Protein'
300 steps,      1.2 ps.
step 40 Turning on dynamic load balancing, because the performance loss due to load imbalance is 13.0 %.
Writing final coordinates.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Does anyone know a solution?
Many thanks in advance!!!

Jan-Philipp ==========================================================
Dr. Jan-Philipp Machtens
Computational Neurophysiology Group
Institute of Complex Systems - Zelluläre Biophysik (ICS-4)
Forschungszentrum Jülich, Germany ==========================================================


Related issues

Related to GROMACS - Task #2134: assess whether Jenkins is testing multi-rank runs appropriatelyClosed
Related to GROMACS - Task #1781: re-design benchmarking functionalityAccepted
Related to GROMACS - Bug #2041: mdrun -resetstep can finish too earlyClosed

Associated revisions

Revision 66ec44e6 (diff)
Added by Szilárd Páll over 2 years ago

Fix mdrun hanging upon exit with sep PME ranks

Commit 1d2d95e introduced a check and early return to skip printing perf
stats when no valid wallcycle data was collected (due to missed reset).
However, as the validity of wallcycle data does not get checked/recorded
on separate PME ranks, mdrun deadlocks before exit in collective comm
that PME ranks never enter.

This change fixes the hang by refactoring the printing code to use a
boolean rather than an early return. This means the normal code path
is unaffected in all cases (only the simulation master can ever write
reports), and the case where it is invalid to write a report
(premature termination) works correctly because all ranks communicate
the data for the report that is never written (and efficiency is not
of concern in this case).

Fixes #2131

Change-Id: If8b0813444d0b00a1a9a4a21d30fc8655c52752a

History

#1 Updated by Jan-Philipp Machtens almost 3 years ago

The title should be "mdrun hangs upon "-nsteps " or "-maxh" trigger with 20 or more MPI processes"

#2 Updated by Gerrit Code Review Bot almost 3 years ago

Gerrit received a related patchset '1' for Issue #2131.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2016~If8b0813444d0b00a1a9a4a21d30fc8655c52752a
Gerrit URL: https://gerrit.gromacs.org/6499

#3 Updated by Szilárd Páll almost 3 years ago

  • Status changed from New to Accepted

Confirmed. Thanks for the report!

Unfortunately, 2016.2 has a bug that gets triggered whenever PME ranks are used. A fix is being worked on, until then use -npme 0 if the performance loss is acceptable or revert to 2016.1 if you need to do parallel runs.

#4 Updated by Szilárd Páll almost 3 years ago

  • Status changed from Accepted to In Progress

#5 Updated by Mark Abraham over 2 years ago

  • Related to Task #2134: assess whether Jenkins is testing multi-rank runs appropriately added

#6 Updated by Mark Abraham over 2 years ago

  • Related to Task #1781: re-design benchmarking functionality added

#7 Updated by Mark Abraham over 2 years ago

  • Description updated (diff)

#8 Updated by Szilárd Páll over 2 years ago

  • Target version set to 2016.3

#9 Updated by Mark Abraham over 2 years ago

  • Related to Bug #2041: mdrun -resetstep can finish too early added

#10 Updated by Mark Abraham over 2 years ago

  • Status changed from In Progress to Resolved

#12 Updated by Mark Abraham over 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF