Project

General

Profile

Bug #1388

Data race in PME with large prime number of threads

Added by Roland Schulz over 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

src/gromacs/mdlib/pme.c:3745 has a data race for a
mdrun -ntmpi 2 -ntomp 11
for regressiontests/complex/nbnxn_pme


Related issues

Related to GROMACS - Bug #1572: Incorrect PME energies and forces with high numbers of OpenMP threads Closed 08/02/2014

Associated revisions

Revision 272736bc (diff)
Added by Berk Hess about 3 years ago

Fixed PME bug with #OpenMP-threads a large prime

With hybrid MPI+OpenMP parallelization and the rank local FFT grid
size for a dimension not divisible by the number of OpenMP threads
in that dimension, a very small amount of the FFT grid overlap part
could set/added twice. This would only occur at low-medium MPI
parallelization with OpenMP thread counts with large prime factors,
which is practice means almost never. Even when it occured, no actual
differences in PME energies or forces were observed.
This issue was due to a leftover from when space was uniformly
divided over the grids iso assigning whole grid lines.

Fixes #1388.

Change-Id: I22904c7f55d2e96fc4b8cd1498af2087eaed47ac

Revision 6ba80a26 (diff)
Added by Berk Hess almost 3 years ago

Fixed two PME issues with MPI+OpenMP

Change 272736bc partially fixed #1388, but broke the more general
case of multiple MPI communication pulses in PME. Change 272736bc
incorrectly changed tx1 and ty1. This change has been reverted.

Change 27189bba fixed the incorrect PME grid reduction with multiple
thread grid overlap in y. But it broke the, much more common, case
where the y-size of the PME grid is not divisible by the domains in y.
This change, incorrectly, changed buf_my.

Now buf_my is set to the correct value, which solves both issues.

Fixes #1578.
Refs #1388 and #1572.

Change-Id: Id2d7d013a3b8cdc04eda1fb026567088a38ec81f

History

#1 Updated by Roland Schulz about 3 years ago

  • Priority changed from Low to Normal

The problem is still present with latest master (e494403ead). The TSAN output is

WARNING: ThreadSanitizer: data race (pid=19968)
  Write of size 4 at 0x7da000015880 by thread T2:
    #0 reduce_threadgrid_overlap /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/pme.c:4042 (libgromacs.so.0+0x000000e775e2)
    #1 gomp_thread_start ../../../libgomp/team.c:117 (libgomp.so.1+0x00000000c359)

  Previous write of size 4 at 0x7da000015880 by thread T1:
    #0 reduce_threadgrid_overlap /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/pme.c:4042 (libgromacs.so.0+0x000000e775e2)
    #1 GOMP_parallel ../../../libgomp/parallel.c:167 (libgomp.so.1+0x0000000096ae)
    #2 gmx_pme_do /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/pme.c:4845 (libgromacs.so.0+0x000000e856ec)
    #3 do_force_lowlevel /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/force.c:608 (libgromacs.so.0+0x000000ea1c87)
    #4 do_force_cutsVERLET /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/sim_util.c:1345 (libgromacs.so.0+0x000000e2b9d1)
    #5 do_force /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/sim_util.c:2063 (libgromacs.so.0+0x000000e2de8c)
    #6 do_md /mnt/workspace/roland-temp/gromacs/src/programs/mdrun/md.c:1067 (gmx+0x00000001bc19)
    #7 mdrunner /mnt/workspace/roland-temp/gromacs/src/programs/mdrun/runner.c:1737 (gmx+0x0000000170df)
    #8 mdrunner_start_fn /mnt/workspace/roland-temp/gromacs/src/programs/mdrun/runner.c:180 (gmx+0x000000017a7d)
    #9 tMPI_Thread_starter /mnt/workspace/roland-temp/gromacs/src/external/thread_mpi/src/tmpi_init.c:397 (libgromacs.so.0+0x000000ec2b9d)
    #10 tMPI_Thread_starter /mnt/workspace/roland-temp/gromacs/src/external/thread_mpi/src/pthreads.c:231 (libgromacs.so.0+0x000000eb94d0)

  Location is heap block of size 15872 at 0x7da000014000 allocated by thread T1:
    #0 calloc ../../../../libsanitizer/tsan/tsan_interceptors.cc:499 (libtsan.so.0+0x0000000499a1)
    #1 save_calloc /mnt/workspace/roland-temp/gromacs/src/gromacs/utility/smalloc.c:175 (libgromacs.so.0+0x0000001ef2cc)
    #2 init_overlap_comm /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/pme.c:3223 (libgromacs.so.0+0x000000e749b1)
    #3 gmx_pme_init /mnt/workspace/roland-temp/gromacs/src/gromacs/mdlib/pme.c:3563 (libgromacs.so.0+0x000000e81f7b)
    #4 mdrunner /mnt/workspace/roland-temp/gromacs/src/programs/mdrun/runner.c:1677 (gmx+0x000000016bff)
    #5 mdrunner_start_fn /mnt/workspace/roland-temp/gromacs/src/programs/mdrun/runner.c:180 (gmx+0x000000017a7d)
    #6 tMPI_Thread_starter /mnt/workspace/roland-temp/gromacs/src/external/thread_mpi/src/tmpi_init.c:397 (libgromacs.so.0+0x000000ec2b9d)
    #7 tMPI_Thread_starter /mnt/workspace/roland-temp/gromacs/src/external/thread_mpi/src/pthreads.c:231 (libgromacs.so.0+0x000000eb94d0)

#2 Updated by Roland Schulz about 3 years ago

  • Target version set to 5.0

#3 Updated by Erik Lindahl about 3 years ago

  • Target version changed from 5.0 to 5.x

It would be good if somebody has a chance to fix this before the release, but since we're a bit stretched for manpower with Berk on parental leave I've altered the target version to clarify that it's not a showstopper.

#4 Updated by Gerrit Code Review Bot about 3 years ago

Gerrit received a related patchset '1' for Issue #1388.
Uploader: Berk Hess ()
Change-Id: I22904c7f55d2e96fc4b8cd1498af2087eaed47ac
Gerrit URL: https://gerrit.gromacs.org/3584

#5 Updated by Roland Schulz about 3 years ago

  • Status changed from New to Fix uploaded
  • Assignee set to Berk Hess

#6 Updated by Berk Hess about 3 years ago

  • Status changed from Fix uploaded to Resolved
  • % Done changed from 0 to 100

#7 Updated by Erik Lindahl about 3 years ago

  • Status changed from Resolved to Closed

#8 Updated by Teemu Murtola almost 3 years ago

  • Category set to mdrun
  • Target version changed from 5.x to 4.6.x

#9 Updated by Gerrit Code Review Bot almost 3 years ago

Gerrit received a related DRAFT patchset '1' for Issue #1388.
Uploader: Szilárd Páll ()
Change-Id: I3649294a143bb744a2e26fd1d9dbfb87dea421ca
Gerrit URL: https://gerrit.gromacs.org/3905

#10 Updated by Gerrit Code Review Bot almost 3 years ago

Gerrit received a related DRAFT patchset '1' for Issue #1388.
Uploader: Szilárd Páll ()
Change-Id: I3649294a143bb744a2e26fd1d9dbfb87dea421ca
Gerrit URL: https://gerrit.gromacs.org/3905

#11 Updated by Roland Schulz over 2 years ago

  • Related to Bug #1572: Incorrect PME energies and forces with high numbers of OpenMP threads added

Also available in: Atom PDF