Bug #3165

task assignment silent abort

Added by Szilárd Páll over 1 year ago.

Target version:
Affected version - extra info:
Affected version:


When testing the GPU PP-PME comm changes I came across an issue with task assignment silently aborting (or failing to emit a meaningful message).

$ GMX_GPU_PME_PP_COMMS=1 /home/jenkins/workspace/Matrix_OnDemand/e50f0390/gromacs/bin/gmx mdrun -ntmpi 3 -ntomp 2 -notunepme -gputasks 001 -nb gpu -pme gpu

         :-) GROMACS - gmx mdrun, 2020-beta1-dev-20191018-98a955a (-:

                            GROMACS is written by:
     Emile Apol      Rossen Apostolov      Paul Bauer     Herman J.C. Berendsen
    Par Bjelkmar      Christian Blau   Viacheslav Bolnykh     Kevin Boyd    
 Aldert van Buuren   Rudi van Drunen     Anton Feenstra       Alan Gray     
  Gerrit Groenhof     Anca Hamuraru    Vincent Hindriksen  M. Eric Irrgang  
  Aleksei Iupinov   Christoph Junghans     Joe Jordan     Dimitrios Karkoulis
    Peter Kasson        Jiri Kraus      Carsten Kutzner      Per Larsson    
  Justin A. Lemkul    Viveca Lindahl    Magnus Lundborg     Erik Marklund   
    Pascal Merz     Pieter Meulenhoff    Teemu Murtola       Szilard Pall   
    Sander Pronk      Roland Schulz      Michael Shirts    Alexey Shvetsov  
   Alfons Sijbers     Peter Tieleman      Jon Vincent      Teemu Virolainen 
 Christian Wennberg    Maarten Wolf   
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2018, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2020-beta1-dev-20191018-98a955a
Executable:   /home/jenkins/workspace/Matrix_OnDemand/e50f0390/gromacs/bin/gmx
Data prefix:  /home/jenkins/workspace/Matrix_OnDemand/e50f0390/gromacs (source tree)
Working dir:  /mnt/workspace/Matrix_OnDemand/e50f0390/regressiontests/complex/nbnxn_pme
Command line:
  gmx mdrun -ntmpi 3 -ntomp 2 -notunepme -gputasks 001 -nb gpu -pme gpu

Back Off! I just backed up md.log to ./#md.log.6#
Compiled SIMD: None, but for this host/run AVX2_256 might be better (see log).
The current CPU can measure timings more accurately than the code in
gmx mdrun was configured to use. This might affect your simulation
speed as accurate timings are needed for load-balancing.
Please consider rebuilding gmx mdrun with the GMX_USE_RDTSCP=ON CMake option.
Reading file topol.tpr, VERSION 2020-beta1-dev-20191018-98a955a (single precision)

NOTE: This run uses the 'GPU PME-PP communications' feature, enabled by the GMX_GPU_PME_PP_COMMS environment variable.
Changing nstlist from 10 to 100, rlist from 0.9 to 0.999

The stack trace shows that it is clearly exiting on a fatal error, but not sure why the message is missing:

(gdb) bt 
#0  __GI__exit (status=1) at ../sysdeps/unix/sysv/linux/_exit.c:28
#1  0x00007ffff47d1316 in gmx_exit_on_fatal_error (exitType=ExitType_Abort, returnValue=1)
    at /home/jenkins/workspace/Matrix_OnDemand/e50f0390/gromacs/src/gromacs/utility/fatalerror.cpp:207
#2  0x00007ffff4f58792 in gmx::GpuTaskAssignmentsBuilder::build (this=0x7fffe70b3fef, gpuIdsToUse=..., userGpuTaskAssignment=..., 
    hardwareInfo=..., cr=0x7fffd8007f40, ms=0x0, physicalNodeComm=..., nonbondedTarget=gmx::Gpu, pmeTarget=gmx::Gpu, 
    bondedTarget=gmx::Auto, updateTarget=gmx::Auto, useGpuForNonbonded=true, useGpuForPme=true, rankHasPpTask=true, rankHasPmeTask=true)
    at /home/jenkins/workspace/Matrix_OnDemand/e50f0390/gromacs/src/gromacs/taskassignment/taskassignment.cpp:317
#3  0x00007ffff4eefc36 in gmx::Mdrunner::mdrunner (this=0x7fffe70b4c40)
    at /home/jenkins/workspace/Matrix_OnDemand/e50f0390/gromacs/src/gromacs/mdrun/runner.cpp:1119
#4  0x00007ffff4eecf15 in gmx::mdrunner_start_fn (arg=0x7fffffffd6a0)
    at /home/jenkins/workspace/Matrix_OnDemand/e50f0390/gromacs/src/gromacs/mdrun/runner.cpp:323
#5  0x00007ffff50f65a7 in tMPI_Thread_starter (arg=0x6fcdf0)
    at /home/jenkins/workspace/Matrix_OnDemand/e50f0390/gromacs/src/external/thread_mpi/src/tmpi_init.cpp:399
#6  0x00007ffff50ed93b in tMPI_Thread_starter (arg=0xccbbe0)
    at /home/jenkins/workspace/Matrix_OnDemand/e50f0390/gromacs/src/external/thread_mpi/src/pthreads.cpp:235
#7  0x00007ffff39a1184 in start_thread (arg=0x7fffe70b5700) at pthread_create.c:312
#8  0x00007ffff2e3603d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Related issues

Related to GROMACS - Feature #2891: PME/PP GPU communications In Progress


#1 Updated by Szilárd Páll over 1 year ago

Also available in: Atom PDF