Project

General

Profile

Bug #2689

gpu timer state not cleaned up

Added by Mark Abraham 6 months ago. Updated 2 months ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
core library
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

In the API MPI tests, on the CUDA+OpenCL config on bs_1310, the following intermittent problem can be seen when run with multiple ranks:

00:09:11.740 [==========] Running 7 tests from 4 test cases.
00:09:11.740 [----------] Global test environment set-up.
00:09:11.740 [----------] 3 tests from ApiRunner
00:09:11.740 [ RUN      ] ApiRunner.BasicMD
00:09:11.740 Checkpoint file is from part 5, new output files will be suffixed '.part0006'.
00:09:11.740 The current CPU can measure timings more accurately than the code in
00:09:11.740 gmxapi-mpi-test was configured to use. This might affect your simulation
00:09:11.740 speed as accurate timings are needed for load-balancing.
00:09:11.740 Please consider rebuilding gmxapi-mpi-test with the GMX_USE_RDTSCP=ON CMake option.
00:09:11.740 Reading file /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/src/api/cpp/tests/topol.tpr, VERSION 2019-dev-20181010-8a92135 (single precision)
00:09:11.740 
00:09:11.740 Overriding nsteps with value passed on the command line: 10 steps, 0.01 ps
00:09:11.740 Changing nstlist from 10 to 100, rlist from 1 to 1
00:09:11.740 
00:09:11.740 
00:09:11.740 Using 2 MPI processes
00:09:11.740 Using 4 OpenMP threads per MPI process
00:09:11.740 
00:09:11.740 On host bs-nix1310 2 GPUs auto-selected for this run.
00:09:11.740 Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
00:09:11.740   PP:0,PP:1
00:09:11.740 starting mdrun 'spc-and-methanol'
00:09:11.740 80 steps,      0.1 ps (continuing from step 70,      0.1 ps).
00:09:11.740 
00:09:11.740 Writing final coordinates.
00:09:11.740 
00:09:11.740 NOTE: 45 % of the run time was spent communicating energies,
00:09:11.740       you might want to use the -gcom option of mdrun
00:09:11.740 
00:09:11.740                Core t (s)   Wall t (s)        (%)
00:09:11.740        Time:        0.175        0.022      795.8
00:09:11.740                  (ns/day)    (hour/ns)
00:09:11.740 Performance:       43.232        0.555
00:09:11.740 [       OK ] ApiRunner.BasicMD (1319 ms)
00:09:11.740 [ RUN      ] ApiRunner.Reinitialize
00:09:11.740 Checkpoint file is from part 6, new output files will be suffixed '.part0007'.
00:09:11.740 The current CPU can measure timings more accurately than the code in
00:09:11.740 gmxapi-mpi-test was configured to use. This might affect your simulation
00:09:11.740 speed as accurate timings are needed for load-balancing.
00:09:11.740 Please consider rebuilding gmxapi-mpi-test with the GMX_USE_RDTSCP=ON CMake option.
00:09:11.740 Reading file /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/src/api/cpp/tests/topol.tpr, VERSION 2019-dev-20181010-8a92135 (single precision)
00:09:11.740 
00:09:11.740 Overriding nsteps with value passed on the command line: 20 steps, 0.02 ps
00:09:11.740 Changing nstlist from 10 to 100, rlist from 1 to 1
00:09:11.740 
00:09:11.740 
00:09:11.740 Using 2 MPI processes
00:09:11.740 Using 4 OpenMP threads per MPI process
00:09:11.740 
00:09:11.740 On host bs-nix1310 2 GPUs auto-selected for this run.
00:09:11.740 Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
00:09:11.740   PP:0,PP:1
00:09:11.740 
00:09:11.740 Non-default thread affinity set probably by the OpenMP library,
00:09:11.740 disabling internal thread affinity
00:09:11.740 starting mdrun 'spc-and-methanol'
00:09:11.740 100 steps,      0.1 ps (continuing from step 80,      0.1 ps).
00:09:11.740 
00:09:11.740 
00:09:11.740 Received the remote INT/TERM signal, stopping within 100 steps
00:09:11.740 
00:09:11.740 
00:09:11.740 
00:09:11.740 Received the remote INT/TERM signal, stopping within 100 steps
00:09:11.740 
00:09:11.740 
00:09:11.740 NOTE: 47 % of the run time was spent communicating energies,
00:09:11.740       you might want to use the -gcom option of mdrun
00:09:11.740 
00:09:11.740                Core t (s)   Wall t (s)        (%)
00:09:11.740        Time:        0.144        0.018      796.5
00:09:11.740                  (ns/day)    (hour/ns)
00:09:11.740 Performance:        9.589        2.503
00:09:11.740 Checkpoint file is from part 7, new output files will be suffixed '.part0008'.
00:09:11.740 The current CPU can measure timings more accurately than the code in
00:09:11.740 gmxapi-mpi-test was configured to use. This might affect your simulation
00:09:11.740 speed as accurate timings are needed for load-balancing.
00:09:11.740 Please consider rebuilding gmxapi-mpi-test with the GMX_USE_RDTSCP=ON CMake option.
00:09:11.740 Reading file /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/src/api/cpp/tests/topol.tpr, VERSION 2019-dev-20181010-8a92135 (single precision)
00:09:11.740 
00:09:11.740 Overriding nsteps with value passed on the command line: 20 steps, 0.02 ps
00:09:11.740 Changing nstlist from 10 to 100, rlist from 1 to 1
00:09:11.740 
00:09:11.740 
00:09:11.740 Using 2 MPI processes
00:09:11.740 Using 4 OpenMP threads per MPI process
00:09:11.740 
00:09:11.740 On host bs-nix1310 2 GPUs auto-selected for this run.
00:09:11.740 Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
00:09:11.740   PP:0,PP:1
00:09:11.740 
00:09:11.740 Non-default thread affinity set probably by the OpenMP library,
00:09:11.740 disabling internal thread affinity
00:09:11.740 starting mdrun 'spc-and-methanol'
00:09:11.740 101 steps,      0.1 ps (continuing from step 81,      0.1 ps).
00:09:11.740 
00:09:11.740 -------------------------------------------------------
00:09:11.740 Program:     gmxapi-mpi-test, version 2019-dev-20181010-8a92135
00:09:11.740 Source file: src/gromacs/gpu_utils/gpuregiontimer.h (line 94)
00:09:11.740 Function:    GpuRegionTimerWrapper<GpuRegionTimerImpl>::openTimingRegion(CommandStream)::<lambda()> [with GpuRegionTimerImpl = GpuRegionTimerImpl]
00:09:11.740 MPI rank:    1 (out of 2)
00:09:11.740 
00:09:11.740 Assertion failed:
00:09:11.740 Condition: debugState_ == TimerState::Idle
00:09:11.740 GPU timer should be idle, but is stopped.
00:09:11.740 
00:09:11.740 For more information and tips for troubleshooting, please check the GROMACS
00:09:11.740 website at http://www.gromacs.org/Documentation/Errors
00:09:11.740 -------------------------------------------------------
00:09:11.740 --------------------------------------------------------------------------
00:09:11.740 MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
00:09:11.740 with errorcode 1.
00:09:11.740 
00:09:11.740 NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
00:09:11.740 You may or may not see output from other processes, depending on
00:09:11.740 exactly when Open MPI kills them.
00:09:11.740 --------------------------------------------------------------------------
00:09:11.740 --------------------------------------------------------------------------
00:09:11.740 mpiexec has exited due to process rank 1 with PID 19321 on
00:09:11.740 node bs-nix1310 exiting improperly. There are two reasons this could occur:
00:09:11.740 
00:09:11.740 1. this process did not call "init" before exiting, but others in
00:09:11.740 the job did. This can cause a job to hang indefinitely while it waits
00:09:11.740 for all processes to call "init". By rule, if one process calls "init",
00:09:11.740 then ALL processes must call "init" prior to termination.
00:09:11.740 
00:09:11.740 2. this process called "init", but exited without calling "finalize".
00:09:11.740 By rule, all processes that call "init" MUST call "finalize" prior to
00:09:11.740 exiting or it will be considered an "abnormal termination" 
00:09:11.740 
00:09:11.740 This may have caused other processes in the application to be
00:09:11.740 terminated by signals sent by mpiexec (as reported here).
00:09:11.740 --------------------------------------------------------------------------

I have reproduced with a manual build on that host. Sometimes BasicMD also fails.

This seems like a GPU infrastructure issue with multiple simulation parts within the same process, which is exposed by the API use case, but probably not caused by it. I'm not sure why our existing continuation test cases don't trigger it, but perhaps they are not run with multiple ranks per simulation.

Associated revisions

Revision 4266d9c5 (diff)
Added by Paul Bauer 4 months ago

Document OpenCL timer issue.

Refs #2689

Change-Id: I03964cc239aa423713793f5bd0e38d8afe19b0da

History

#2 Updated by Paul Bauer 4 months ago

and another occurence here http://jenkins.gromacs.org/job/Matrix_PreSubmit_2019/456/OPTIONS=gcc-8%20openmp%20simd=avx2_256%20gpuhw=amd%20opencl-1.2%20clFFT-2.14%20buildfftw%20host=bs_gpu01,label=bs_gpu01/console

Program:     gmx mdrun, version 2019-rc1-dev-20181218-9ff89c8
Source file: src/gromacs/gpu_utils/gpuregiontimer.h (line 94)
Function:    GpuRegionTimerWrapper<GpuRegionTimerImpl>::openTimingRegion(CommandStream) [with GpuRegionTimerImpl = GpuRegionTimerImpl; CommandStream = _cl_command_queue*]::<lambda()>
MPI rank:    1 (out of 2)

Assertion failed:
Condition: debugState_ == TimerState::Idle
GPU timer should be idle, but is stopped.

#3 Updated by Paul Bauer 4 months ago

I had a look through the code and can't say more concrete things, but it does look to me like this is a race condition, or a case where the different PME calculations (coulomb and LJ) clash

#4 Updated by Gerrit Code Review Bot 4 months ago

Gerrit received a related patchset '1' for Issue #2689.
Uploader: Paul Bauer ()
Change-Id: gromacs~release-2019~I03964cc239aa423713793f5bd0e38d8afe19b0da
Gerrit URL: https://gerrit.gromacs.org/8869

#5 Updated by Paul Bauer 4 months ago

  • Target version changed from 2019 to 2019.1

I moved this to the first patch release for now, but I think it should stay open until the real cause has been found.

#6 Updated by Mark Abraham 3 months ago

We're not sure if this still reproduces

#7 Updated by Mark Abraham 2 months ago

  • Status changed from Accepted to Rejected

I haven't seen this reproduce

Also available in: Atom PDF