Bug #2689
gpu timer state not cleaned up
Description
In the API MPI tests, on the CUDA+OpenCL config on bs_1310, the following intermittent problem can be seen when run with multiple ranks:
00:09:11.740 [==========] Running 7 tests from 4 test cases. 00:09:11.740 [----------] Global test environment set-up. 00:09:11.740 [----------] 3 tests from ApiRunner 00:09:11.740 [ RUN ] ApiRunner.BasicMD 00:09:11.740 Checkpoint file is from part 5, new output files will be suffixed '.part0006'. 00:09:11.740 The current CPU can measure timings more accurately than the code in 00:09:11.740 gmxapi-mpi-test was configured to use. This might affect your simulation 00:09:11.740 speed as accurate timings are needed for load-balancing. 00:09:11.740 Please consider rebuilding gmxapi-mpi-test with the GMX_USE_RDTSCP=ON CMake option. 00:09:11.740 Reading file /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/src/api/cpp/tests/topol.tpr, VERSION 2019-dev-20181010-8a92135 (single precision) 00:09:11.740 00:09:11.740 Overriding nsteps with value passed on the command line: 10 steps, 0.01 ps 00:09:11.740 Changing nstlist from 10 to 100, rlist from 1 to 1 00:09:11.740 00:09:11.740 00:09:11.740 Using 2 MPI processes 00:09:11.740 Using 4 OpenMP threads per MPI process 00:09:11.740 00:09:11.740 On host bs-nix1310 2 GPUs auto-selected for this run. 00:09:11.740 Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node: 00:09:11.740 PP:0,PP:1 00:09:11.740 starting mdrun 'spc-and-methanol' 00:09:11.740 80 steps, 0.1 ps (continuing from step 70, 0.1 ps). 00:09:11.740 00:09:11.740 Writing final coordinates. 00:09:11.740 00:09:11.740 NOTE: 45 % of the run time was spent communicating energies, 00:09:11.740 you might want to use the -gcom option of mdrun 00:09:11.740 00:09:11.740 Core t (s) Wall t (s) (%) 00:09:11.740 Time: 0.175 0.022 795.8 00:09:11.740 (ns/day) (hour/ns) 00:09:11.740 Performance: 43.232 0.555 00:09:11.740 [ OK ] ApiRunner.BasicMD (1319 ms) 00:09:11.740 [ RUN ] ApiRunner.Reinitialize 00:09:11.740 Checkpoint file is from part 6, new output files will be suffixed '.part0007'. 00:09:11.740 The current CPU can measure timings more accurately than the code in 00:09:11.740 gmxapi-mpi-test was configured to use. This might affect your simulation 00:09:11.740 speed as accurate timings are needed for load-balancing. 00:09:11.740 Please consider rebuilding gmxapi-mpi-test with the GMX_USE_RDTSCP=ON CMake option. 00:09:11.740 Reading file /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/src/api/cpp/tests/topol.tpr, VERSION 2019-dev-20181010-8a92135 (single precision) 00:09:11.740 00:09:11.740 Overriding nsteps with value passed on the command line: 20 steps, 0.02 ps 00:09:11.740 Changing nstlist from 10 to 100, rlist from 1 to 1 00:09:11.740 00:09:11.740 00:09:11.740 Using 2 MPI processes 00:09:11.740 Using 4 OpenMP threads per MPI process 00:09:11.740 00:09:11.740 On host bs-nix1310 2 GPUs auto-selected for this run. 00:09:11.740 Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node: 00:09:11.740 PP:0,PP:1 00:09:11.740 00:09:11.740 Non-default thread affinity set probably by the OpenMP library, 00:09:11.740 disabling internal thread affinity 00:09:11.740 starting mdrun 'spc-and-methanol' 00:09:11.740 100 steps, 0.1 ps (continuing from step 80, 0.1 ps). 00:09:11.740 00:09:11.740 00:09:11.740 Received the remote INT/TERM signal, stopping within 100 steps 00:09:11.740 00:09:11.740 00:09:11.740 00:09:11.740 Received the remote INT/TERM signal, stopping within 100 steps 00:09:11.740 00:09:11.740 00:09:11.740 NOTE: 47 % of the run time was spent communicating energies, 00:09:11.740 you might want to use the -gcom option of mdrun 00:09:11.740 00:09:11.740 Core t (s) Wall t (s) (%) 00:09:11.740 Time: 0.144 0.018 796.5 00:09:11.740 (ns/day) (hour/ns) 00:09:11.740 Performance: 9.589 2.503 00:09:11.740 Checkpoint file is from part 7, new output files will be suffixed '.part0008'. 00:09:11.740 The current CPU can measure timings more accurately than the code in 00:09:11.740 gmxapi-mpi-test was configured to use. This might affect your simulation 00:09:11.740 speed as accurate timings are needed for load-balancing. 00:09:11.740 Please consider rebuilding gmxapi-mpi-test with the GMX_USE_RDTSCP=ON CMake option. 00:09:11.740 Reading file /mnt/workspace/Matrix_PreSubmit_master/87cf13c6/gromacs/src/api/cpp/tests/topol.tpr, VERSION 2019-dev-20181010-8a92135 (single precision) 00:09:11.740 00:09:11.740 Overriding nsteps with value passed on the command line: 20 steps, 0.02 ps 00:09:11.740 Changing nstlist from 10 to 100, rlist from 1 to 1 00:09:11.740 00:09:11.740 00:09:11.740 Using 2 MPI processes 00:09:11.740 Using 4 OpenMP threads per MPI process 00:09:11.740 00:09:11.740 On host bs-nix1310 2 GPUs auto-selected for this run. 00:09:11.740 Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node: 00:09:11.740 PP:0,PP:1 00:09:11.740 00:09:11.740 Non-default thread affinity set probably by the OpenMP library, 00:09:11.740 disabling internal thread affinity 00:09:11.740 starting mdrun 'spc-and-methanol' 00:09:11.740 101 steps, 0.1 ps (continuing from step 81, 0.1 ps). 00:09:11.740 00:09:11.740 ------------------------------------------------------- 00:09:11.740 Program: gmxapi-mpi-test, version 2019-dev-20181010-8a92135 00:09:11.740 Source file: src/gromacs/gpu_utils/gpuregiontimer.h (line 94) 00:09:11.740 Function: GpuRegionTimerWrapper<GpuRegionTimerImpl>::openTimingRegion(CommandStream)::<lambda()> [with GpuRegionTimerImpl = GpuRegionTimerImpl] 00:09:11.740 MPI rank: 1 (out of 2) 00:09:11.740 00:09:11.740 Assertion failed: 00:09:11.740 Condition: debugState_ == TimerState::Idle 00:09:11.740 GPU timer should be idle, but is stopped. 00:09:11.740 00:09:11.740 For more information and tips for troubleshooting, please check the GROMACS 00:09:11.740 website at http://www.gromacs.org/Documentation/Errors 00:09:11.740 ------------------------------------------------------- 00:09:11.740 -------------------------------------------------------------------------- 00:09:11.740 MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 00:09:11.740 with errorcode 1. 00:09:11.740 00:09:11.740 NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. 00:09:11.740 You may or may not see output from other processes, depending on 00:09:11.740 exactly when Open MPI kills them. 00:09:11.740 -------------------------------------------------------------------------- 00:09:11.740 -------------------------------------------------------------------------- 00:09:11.740 mpiexec has exited due to process rank 1 with PID 19321 on 00:09:11.740 node bs-nix1310 exiting improperly. There are two reasons this could occur: 00:09:11.740 00:09:11.740 1. this process did not call "init" before exiting, but others in 00:09:11.740 the job did. This can cause a job to hang indefinitely while it waits 00:09:11.740 for all processes to call "init". By rule, if one process calls "init", 00:09:11.740 then ALL processes must call "init" prior to termination. 00:09:11.740 00:09:11.740 2. this process called "init", but exited without calling "finalize". 00:09:11.740 By rule, all processes that call "init" MUST call "finalize" prior to 00:09:11.740 exiting or it will be considered an "abnormal termination" 00:09:11.740 00:09:11.740 This may have caused other processes in the application to be 00:09:11.740 terminated by signals sent by mpiexec (as reported here). 00:09:11.740 --------------------------------------------------------------------------
I have reproduced with a manual build on that host. Sometimes BasicMD also fails.
This seems like a GPU infrastructure issue with multiple simulation parts within the same process, which is exposed by the API use case, but probably not caused by it. I'm not sure why our existing continuation test cases don't trigger it, but perhaps they are not run with multiple ranks per simulation.
Associated revisions
History
#2 Updated by Paul Bauer about 2 years ago
and another occurence here http://jenkins.gromacs.org/job/Matrix_PreSubmit_2019/456/OPTIONS=gcc-8%20openmp%20simd=avx2_256%20gpuhw=amd%20opencl-1.2%20clFFT-2.14%20buildfftw%20host=bs_gpu01,label=bs_gpu01/console
Program: gmx mdrun, version 2019-rc1-dev-20181218-9ff89c8 Source file: src/gromacs/gpu_utils/gpuregiontimer.h (line 94) Function: GpuRegionTimerWrapper<GpuRegionTimerImpl>::openTimingRegion(CommandStream) [with GpuRegionTimerImpl = GpuRegionTimerImpl; CommandStream = _cl_command_queue*]::<lambda()> MPI rank: 1 (out of 2) Assertion failed: Condition: debugState_ == TimerState::Idle GPU timer should be idle, but is stopped.
#3 Updated by Paul Bauer about 2 years ago
I had a look through the code and can't say more concrete things, but it does look to me like this is a race condition, or a case where the different PME calculations (coulomb and LJ) clash
#4 Updated by Gerrit Code Review Bot about 2 years ago
Gerrit received a related patchset '1' for Issue #2689.
Uploader: Paul Bauer (paul.bauer.q@gmail.com)
Change-Id: gromacs~release-2019~I03964cc239aa423713793f5bd0e38d8afe19b0da
Gerrit URL: https://gerrit.gromacs.org/8869
#5 Updated by Paul Bauer about 2 years ago
- Target version changed from 2019 to 2019.1
I moved this to the first patch release for now, but I think it should stay open until the real cause has been found.
#6 Updated by Mark Abraham about 2 years ago
We're not sure if this still reproduces
#7 Updated by Mark Abraham about 2 years ago
- Status changed from Accepted to Rejected
I haven't seen this reproduce
Document OpenCL timer issue.
Refs #2689
Change-Id: I03964cc239aa423713793f5bd0e38d8afe19b0da