Bug #1388
Data race in PME with large prime number of threads
Description
src/gromacs/mdlib/pme.c:3745 has a data race for a
mdrun -ntmpi 2 -ntomp 11
for regressiontests/complex/nbnxn_pme
Related issues
Associated revisions
Fixed two PME issues with MPI+OpenMP
Change 272736bc partially fixed #1388, but broke the more general
case of multiple MPI communication pulses in PME. Change 272736bc
incorrectly changed tx1 and ty1. This change has been reverted.
Change 27189bba fixed the incorrect PME grid reduction with multiple
thread grid overlap in y. But it broke the, much more common, case
where the y-size of the PME grid is not divisible by the domains in y.
This change, incorrectly, changed buf_my.
Now buf_my is set to the correct value, which solves both issues.
Fixes #1578.
Refs #1388 and #1572.
Change-Id: Id2d7d013a3b8cdc04eda1fb026567088a38ec81f
History
#1 Updated by Roland Schulz over 5 years ago
- Priority changed from Low to Normal
The problem is still present with latest master (e494403ead). The TSAN output is
WARNING: ThreadSanitizer: data race (pid=19968) Write of size 4 at 0x7da000015880 by thread T2: #0 reduce_threadgrid_overlap /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/pme.c:4042 (libgromacs.so.0+0x000000e775e2) #1 gomp_thread_start ../../../libgomp/team.c:117 (libgomp.so.1+0x00000000c359) Previous write of size 4 at 0x7da000015880 by thread T1: #0 reduce_threadgrid_overlap /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/pme.c:4042 (libgromacs.so.0+0x000000e775e2) #1 GOMP_parallel ../../../libgomp/parallel.c:167 (libgomp.so.1+0x0000000096ae) #2 gmx_pme_do /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/pme.c:4845 (libgromacs.so.0+0x000000e856ec) #3 do_force_lowlevel /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/force.c:608 (libgromacs.so.0+0x000000ea1c87) #4 do_force_cutsVERLET /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/sim_util.c:1345 (libgromacs.so.0+0x000000e2b9d1) #5 do_force /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/sim_util.c:2063 (libgromacs.so.0+0x000000e2de8c) #6 do_md /mnt/workspace/roland-temp/gromacs/src/programs/mdrun/md.c:1067 (gmx+0x00000001bc19) #7 mdrunner /mnt/workspace/roland-temp/gromacs/src/programs/mdrun/runner.c:1737 (gmx+0x0000000170df) #8 mdrunner_start_fn /mnt/workspace/roland-temp/gromacs/src/programs/mdrun/runner.c:180 (gmx+0x000000017a7d) #9 tMPI_Thread_starter /mnt/workspace/roland-temp/gromacs/src/external/thread_mpi/src/tmpi_init.c:397 (libgromacs.so.0+0x000000ec2b9d) #10 tMPI_Thread_starter /mnt/workspace/roland-temp/gromacs/src/external/thread_mpi/src/pthreads.c:231 (libgromacs.so.0+0x000000eb94d0) Location is heap block of size 15872 at 0x7da000014000 allocated by thread T1: #0 calloc ../../../../libsanitizer/tsan/tsan_interceptors.cc:499 (libtsan.so.0+0x0000000499a1) #1 save_calloc /mnt/workspace/roland-temp/gromacs/src/gromacs/utility/smalloc.c:175 (libgromacs.so.0+0x0000001ef2cc) #2 init_overlap_comm /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/pme.c:3223 (libgromacs.so.0+0x000000e749b1) #3 gmx_pme_init /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/pme.c:3563 (libgromacs.so.0+0x000000e81f7b) #4 mdrunner /mnt/workspace/roland-temp/gromacs/src/programs/mdrun/runner.c:1677 (gmx+0x000000016bff) #5 mdrunner_start_fn /mnt/workspace/roland-temp/gromacs/src/programs/mdrun/runner.c:180 (gmx+0x000000017a7d) #6 tMPI_Thread_starter /mnt/workspace/roland-temp/gromacs/src/external/thread_mpi/src/tmpi_init.c:397 (libgromacs.so.0+0x000000ec2b9d) #7 tMPI_Thread_starter /mnt/workspace/roland-temp/gromacs/src/external/thread_mpi/src/pthreads.c:231 (libgromacs.so.0+0x000000eb94d0)
#2 Updated by Roland Schulz over 5 years ago
- Target version set to 5.0
#3 Updated by Erik Lindahl over 5 years ago
- Target version changed from 5.0 to 5.x
It would be good if somebody has a chance to fix this before the release, but since we're a bit stretched for manpower with Berk on parental leave I've altered the target version to clarify that it's not a showstopper.
#4 Updated by Gerrit Code Review Bot over 5 years ago
Gerrit received a related patchset '1' for Issue #1388.
Uploader: Berk Hess (hess@kth.se)
Change-Id: I22904c7f55d2e96fc4b8cd1498af2087eaed47ac
Gerrit URL: https://gerrit.gromacs.org/3584
#5 Updated by Roland Schulz over 5 years ago
- Status changed from New to Fix uploaded
- Assignee set to Berk Hess
#6 Updated by Berk Hess over 5 years ago
- Status changed from Fix uploaded to Resolved
- % Done changed from 0 to 100
Applied in changeset 272736bc4e81828bcd9d73e04e4e6e70a1712cf1.
#7 Updated by Erik Lindahl over 5 years ago
- Status changed from Resolved to Closed
#8 Updated by Teemu Murtola over 5 years ago
- Category set to mdrun
- Target version changed from 5.x to 4.6.x
#9 Updated by Gerrit Code Review Bot over 5 years ago
Gerrit received a related DRAFT patchset '1' for Issue #1388.
Uploader: Szilárd Páll (pall.szilard@gmail.com)
Change-Id: I3649294a143bb744a2e26fd1d9dbfb87dea421ca
Gerrit URL: https://gerrit.gromacs.org/3905
#10 Updated by Gerrit Code Review Bot over 5 years ago
Gerrit received a related DRAFT patchset '1' for Issue #1388.
Uploader: Szilárd Páll (pall.szilard@gmail.com)
Change-Id: I3649294a143bb744a2e26fd1d9dbfb87dea421ca
Gerrit URL: https://gerrit.gromacs.org/3905
#11 Updated by Roland Schulz about 5 years ago
- Related to Bug #1572: Incorrect PME energies and forces with high numbers of OpenMP threads added
Fixed PME bug with #OpenMP-threads a large prime
With hybrid MPI+OpenMP parallelization and the rank local FFT grid
size for a dimension not divisible by the number of OpenMP threads
in that dimension, a very small amount of the FFT grid overlap part
could set/added twice. This would only occur at low-medium MPI
parallelization with OpenMP thread counts with large prime factors,
which is practice means almost never. Even when it occured, no actual
differences in PME energies or forces were observed.
This issue was due to a leftover from when space was uniformly
divided over the grids iso assigning whole grid lines.
Fixes #1388.
Change-Id: I22904c7f55d2e96fc4b8cd1498af2087eaed47ac