Project

General

Profile

Bug #3240

segv with GPU DD direct communication with GPU update and -dlb off

Added by Szilárd Páll 7 months ago. Updated 6 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
2020-beta3-dev-20191212-2c760d2 + GPU update limitations removed
Affected version:
Difficulty:
uncategorized
Close

Description

$ GMX_GPU_DD_COMMS=1 GMX_USE_GPU_BUFFER_OPS=1 GMX_GPU_PME_PP_COMMS=1  $gmx mdrun -ntmpi 4 -npme 1 $opts -nb gpu -pme gpu -bonded gpu -nsteps 20000 -notunepme -update gpu -dlb no
       :-) GROMACS - gmx mdrun, 2020-beta3-dev-20191212-2c760d2-dirty (-:

                            GROMACS is written by:
     Emile Apol      Rossen Apostolov      Paul Bauer     Herman J.C. Berendsen
    Par Bjelkmar      Christian Blau   Viacheslav Bolnykh     Kevin Boyd    
 Aldert van Buuren   Rudi van Drunen     Anton Feenstra       Alan Gray     
  Gerrit Groenhof     Anca Hamuraru    Vincent Hindriksen  M. Eric Irrgang  
  Aleksei Iupinov   Christoph Junghans     Joe Jordan     Dimitrios Karkoulis
    Peter Kasson        Jiri Kraus      Carsten Kutzner      Per Larsson    
  Justin A. Lemkul    Viveca Lindahl    Magnus Lundborg     Erik Marklund   
    Pascal Merz     Pieter Meulenhoff    Teemu Murtola       Szilard Pall   
    Sander Pronk      Roland Schulz      Michael Shirts    Alexey Shvetsov  
   Alfons Sijbers     Peter Tieleman      Jon Vincent      Teemu Virolainen 
 Christian Wennberg    Maarten Wolf      Artem Zhmurov   
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2019, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, version 2020-beta3-dev-20191212-2c760d2-dirty
Executable:   /nethome/pszilard-projects/gromacs/gromacs-20/build_AVX2_256_gcc8_cuda10.1/bin/gmx
Data prefix:  /nethome/pszilard/projects/gromacs/gromacs-20 (source tree)
Working dir:  /nethome/pszilard-projects/gromacs/testing/ion_channel/tmp
Command line:
  gmx mdrun -ntmpi 4 -npme 1 -v -resethway -noconfout -pin on -nb gpu -pme gpu -bonded gpu -nsteps 20000 -notunepme -update gpu -dlb no

Back Off! I just backed up md.log to ./#md.log.62#
Compiled SIMD: AVX2_256, but for this host/run AVX_512 might be better (see
log).
Reading file topol.tpr, VERSION 5.1-dev-20150218-4c60631 (single precision)
Note: file tpx version 100, software tpx version 119
NOTE: This run uses the 'GPU buffer ops' feature, enabled by the GMX_USE_GPU_BUFFER_OPS environment variable.

NOTE: This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.

NOTE: This run uses the 'GPU PME-PP communications' feature, enabled by the GMX_GPU_PME_PP_COMMS environment variable.

Overriding nsteps with value passed on the command line: 20000 steps, 40 ps
Changing nstlist from 10 to 100, rlist from 1 to 1.132

On host skylake-x-gpu01 2 GPUs selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
  PP:0,PP:0,PP:1,PME:1
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PME tasks will do all aspects on the GPU
Using 4 MPI threads
Using 6 OpenMP threads per tMPI thread

WARNING: There are no atom pairs for dispersion correction

Back Off! I just backed up ener.edr to ./#ener.edr.78#
starting mdrun 'Protein'
20000 steps,     40.0 ps.
step 0imb F  5% pme/F 0.09 Segmentation fault (core dumped)

rf.tpr (4.14 MB) rf.tpr Szilárd Páll, 01/15/2020 09:53 AM
md.log (25.3 KB) md.log Alan Gray, 01/15/2020 06:15 PM

History

#1 Updated by Alan Gray 7 months ago

I'm not able to reproduce, @Szilard could you please post the .tpr file you are testing with, and confirm which commit you are using and on what GPUs? Thanks.

#2 Updated by Szilárd Páll 6 months ago

  • Affected version changed from 2019-beta3 to 2020

Lost track of what tpr I was using, but I've reproduced with another input, that I've uploaded now; here's the command:

GMX_GPU_DD_COMMS=1 gmx mdrun -v -resethway -noconfout -pin on -s ../rf.tpr -ntmpi 3 -gputasks 012 -nsteps 2000 -nstlist 100 -bonded cpu -ntomp 1 -nb gpu -update cpu
[...]
NOTE: This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.

Back Off! I just backed up ener.edr to ./#ener.edr.23#
starting mdrun 'ethanol in water'
2000 steps,      4.0 ps.
step 0Floating point exception (core dumped)

Backtrace:

Thread 12 "gmx" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fffe5047700 (LWP 29636)]
0x0000000000d4e6ca in _mm_mul_ps(float __vector(4), float __vector(4)) (__B=..., __A=...)
    at /opt/tcbsys/gcc/8.1/lib/gcc/x86_64-linux-gnu/8.1.0/include/xmmintrin.h:198
198       return (__m128) ((__v4sf)__A * (__v4sf)__B);
(gdb) bt
#0  0x0000000000d4e6ca in _mm_mul_ps(float __vector(4), float __vector(4)) (__B=..., __A=...)
    at /opt/tcbsys/gcc/8.1/lib/gcc/x86_64-linux-gnu/8.1.0/include/xmmintrin.h:198
#1  gmx::operator* (a=..., b=...) at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/simd/impl_x86_sse2/impl_x86_sse2_simd_float.h:204
#2  0x0000000000d4ef86 in gmx::rsqrtIter (lu=..., x=...) at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/simd/simd_math.h:125
#3  0x0000000000d4f01c in gmx::invsqrt (x=...) at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/simd/simd_math.h:149
#4  0x0000000000d56768 in (anonymous namespace)::angles<(BondedKernelFlavor)0> (nbonds=69888, forceatoms=0x7fffc0536b80,
    forceparams=0x7fffc0016730, x=0x7fffc157c000, f=0x7fffc1268b00, fshift=0x0, pbc=0x7fffe5042ba0, g=0x0, lambda=0, dvdlambda=0x7fffe50427d0,
    md=0x7fffc035d380, fcd=0x7fffc000c910, global_atom_index=0x7fffc0488100)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/listed_forces/bonded.cpp:1102
#5  0x0000000000d53cb2 in calculateSimpleBond (ftype=10, numForceatoms=69888, forceatoms=0x7fffc0536b80, forceparams=0x7fffc0016730,
    x=0x7fffc157c000, f=0x7fffc1268b00, fshift=0x0, pbc=0x7fffe5042ba0, g=0x0, lambda=0, dvdlambda=0x7fffe50427d0, md=0x7fffc035d380,
    fcd=0x7fffc000c910, global_atom_index=0x7fffc0488100, bondedKernelFlavor=BondedKernelFlavor::ForcesSimdWhenAvailable)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/listed_forces/bonded.cpp:4100
#6  0x000000000123c638 in (anonymous namespace)::calc_one_bond (thread=0, ftype=10, idef=0x7fffe5043c70, workDivision=..., x=0x7fffc157c000,
    f=0x7fffc1268b00, fshift=0x0, fr=0x7fffc000d310, pbc=0x7fffe5042ba0, g=0x0, grpp=0x7fffe50452b8, nrnb=0x7fffe50456c0, lambda=0x7fffc0364c28,
    dvdl=0x7fffe50427c0, md=0x7fffc035d380, fcd=0x7fffc000c910, stepWork=..., global_atom_index=0x7fffc0488100)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/listed_forces/listed_forces.cpp:358
#7  0x000000000123d7b0 in calcBondedForces(t_idef const*, float const (*) [3], t_forcerec const*, t_pbc const*, t_graph const*, float (*) [3], gmx_enerdata_t*, t_nrnb*, float const*, float*, t_mdatoms const*, t_fcdata*, gmx::StepWorkload const&, int*) [clone ._omp_fn.1] ()
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/listed_forces/listed_forces.cpp:440
#8  0x00007fffee7f7ecf in GOMP_parallel () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#9  0x000000000123c7d3 in calcBondedForces (idef=0x7fffe5043c70, x=0x7fffc157c000, fr=0x7fffc000d310, pbc_null=0x7fffe5042ba0, g=0x0,
    fshiftMasterBuffer=0x0, enerd=0x7fffe5045140, nrnb=0x7fffe50456c0, lambda=0x7fffc0364c28, dvdl=0x7fffe50427c0, md=0x7fffc035d380,
    fcd=0x7fffc000c910, stepWork=..., global_atom_index=0x7fffc0488100)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/listed_forces/listed_forces.cpp:401
#10 0x000000000123cb4d in calc_listed (cr=0x7fffc0008460, ms=0x0, wcycle=0x7fffc0009f30, idef=0x7fffe5043c70, x=0x7fffc157c000, hist=
    0x7fffc0364ec0, forceOutputs=0x7fffe5042fc0, fr=0x7fffc000d310, pbc=0x7fffe5042ba0, pbc_full=0x7fffe50428d0, g=0x0, enerd=0x7fffe5045140,
    nrnb=0x7fffe50456c0, lambda=0x7fffc0364c28, md=0x7fffc035d380, fcd=0x7fffc000c910, global_atom_index=0x7fffc0488100, stepWork=...)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/listed_forces/listed_forces.cpp:548
#11 0x000000000123d001 in do_force_listed (wcycle=0x7fffc0009f30, box=0x7fffc0364c44, fepvals=0x7fffc00082f0, cr=0x7fffc0008460, ms=0x0,
    idef=0x7fffe5043c70, x=0x7fffc157c000, hist=0x7fffc0364ec0, forceOutputs=0x7fffe5042fc0, fr=0x7fffc000d310, pbc=0x7fffe5042ba0, graph=0x0,
    enerd=0x7fffe5045140, nrnb=0x7fffe50456c0, lambda=0x7fffc0364c28, md=0x7fffc035d380, fcd=0x7fffc000c910, global_atom_index=0x7fffc0488100,
    stepWork=...) at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/listed_forces/listed_forces.cpp:674
#12 0x00000000012611ad in do_force_lowlevel (fr=0x7fffc000d310, ir=0x7fffe5045b20, idef=0x7fffe5043c70, cr=0x7fffc0008460, ms=0x0,
    nrnb=0x7fffe50456c0, wcycle=0x7fffc0009f30, md=0x7fffc035d380, coordinates=..., hist=0x7fffc0364ec0, forceOutputs=0x7fffe5042fc0,
    enerd=0x7fffe5045140, fcd=0x7fffc000c910, box=0x7fffc0364c44, lambda=0x7fffc0364c28, graph=0x0, mu_tot=0x7fffc000d388, stepWork=...,
    ddBalanceRegionHandler=...) at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/mdlib/force.cpp:183
#13 0x0000000001237466 in do_force (fplog=0x0, cr=0x7fffc0008460, ms=0x0, inputrec=0x7fffe5045b20, awh=0x0, enforcedRotation=0x0,
    imdSession=0x7fffc035dc30, pull_work=0x0, step=1, nrnb=0x7fffe50456c0, wcycle=0x7fffc0009f30, top=0x7fffe5043c70, box=0x7fffc0364c44, x=...,
    hist=0x7fffc0364ec0, force=..., vir_force=0x7fffe50446d0, mdatoms=0x7fffc035d380, enerd=0x7fffe5045140, fcd=0x7fffc000c910, lambda=...,
    graph=0x0, fr=0x7fffc000d310, runScheduleWork=0x7fffe5045600, vsite=0x0, mu_tot=0x7fffe5044604, t=0.002, ed=0x0, legacyFlags=209,
    ddBalanceRegionHandler=...) at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/mdlib/sim_util.cpp:1519
#14 0x000000000118f5a6 in gmx::LegacySimulator::do_md (this=0x7fffc0361220)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/mdrun/md.cpp:942
#15 0x000000000118ac37 in gmx::LegacySimulator::run (this=0x7fffc0361220)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/mdrun/legacysimulator.cpp:73
#16 0x0000000000db9aae in gmx::Mdrunner::mdrunner (this=0x7fffe5046720)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/mdrun/runner.cpp:1616
#17 0x0000000000db4ce4 in gmx::mdrunner_start_fn (arg=0x7fffffffb860)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/gromacs/mdrun/runner.cpp:374
#18 0x0000000000f7ca1e in tMPI_Thread_starter (arg=0x2f543d0)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/external/thread_mpi/src/tmpi_init.cpp:399
#19 0x0000000000f76136 in tMPI_Thread_starter (arg=0x2ee63b0)
    at /nethome/pszilard/projects/gromacs/gromacs-20/src/external/thread_mpi/src/pthreads.cpp:235
#20 0x00007ffff7bbd6db in start_thread (arg=0x7fffe5047700) at pthread_create.c:463
#21 0x00007fffedbdb88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Without direct comm the same works.

#3 Updated by Alan Gray 6 months ago

Looks like the tpr has not uploaded properly (I can't see it), can you post it again? Thanks

#4 Updated by Szilárd Páll 6 months ago

#5 Updated by Szilárd Páll 6 months ago

Alan Gray wrote:

Looks like the tpr has not uploaded properly (I can't see it), can you post it again? Thanks

sorry about that, uploaded it now.

#6 Updated by Alan Gray 6 months ago

I still can't reproduce, it is running fine for me using latest release branch commit. Can you please let me know which commit you are running, and on what hardware? From the log and backtrace it looks like the code is falling over in the step 0 CPU bonded force code, so it's hard to see why the GPU communication would be the culprit (unless we have a subtle problem in the initialization). [Also, in reference to original post of this issue, your latest mdrun command does not include "-dlb no" so not sure if this is the same issue as previously.]

#7 Updated by Szilárd Páll 6 months ago

Alan Gray wrote:

I still can't reproduce, it is running fine for me using latest release branch commit. Can you please let me know which commit you are running,

Latest release-2020 at the time of testing, e05cc33ecfd0c65bb11da87808485a5c83741270.

and on what hardware?

I don't think it matter much, I reproduced it on multiple of our dev machines with GV100 and GP100 cards.

From the log and backtrace it looks like the code is falling over in the step 0 CPU bonded force code, so it's hard to see why the GPU communication would be the culprit (unless we have a subtle problem in the initialization).

The bondeds run on the CPU and because these are not split, the task always depends on the nonlocal coordinates.
The fpe due to zero coordinates in the invsqrt() of the angles kernel which suggests that something is not right with the coordinates.

[Also, in reference to original post of this issue, your latest mdrun command does not include "-dlb no" so not sure if this is the same issue as previously.]

Indeed, but that was admittedly a different input. I still suspect it is the same error -- though it does not need to be. I can also reproduce the issue regardless of -update cpu. I'll see if I can figure what the other input was, but for now I suggest we focus on this issue (I can also upload a separate redmine if you prefer).

#8 Updated by Alan Gray 6 months ago

I reproduced it on multiple of our dev machines with GV100 and GP100 cards.

Weird, I'm using (on the same commit) GV100 and it runs fine. I've attached my md.log in case you can spot any obvious difference (this is from a Debug build but I've also tried a Release build).

From the log and backtrace it looks like the code is falling over in the step 0 CPU bonded force code, so it's hard to see why the GPU communication would be the culprit (unless we have a subtle problem in the initialization).

The bondeds run on the CPU and because these are not split, the task always depends on the nonlocal coordinates.
The fpe due to zero coordinates in the invsqrt() of the angles kernel which suggests that something is not right with the coordinates.

But step 0 is a search step, and therefore the GPU halo exchange is not active, so the coordinates (on GPU or CPU) shouldn't have been affected in any way by the GPU halo exchange until step 1.

for now I suggest we focus on this issue (I can also upload a separate redmine if you prefer).

Sure, no need for a separate redmine.

Also available in: Atom PDF