Project

General

Profile

Bug #1572

Incorrect PME energies and forces with high numbers of OpenMP threads

Added by Berk Hess about 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

With combined MPI and very high OpenMP parallelization the PME energy and forces can be incorrect and there are data races between threads. This only happens with PME grid spreading subdivision along y over 3 or more threads with the y thread-local grid size smaller than pme_order-1. For production runs this situation will likely never occur. The only case where this happens is a relatively small PME grid with a very high OpenMP thread count containing relatively large prime factors, e.g. 20^3 grid with 3x2x1 domain decomposition and 25 OpenMP threads (which is an order of magnitude beyond the scaling limit).

pme.c (170 KB) pme.c Berk Hess, 02/05/2015 09:25 PM

Related issues

Related to GROMACS - Bug #1578: PME incorrect with MPI+OpenMP and multiple MPI communication pulsesClosed2014-08-15
Related to GROMACS - Bug #1388: Data race in PME with large prime number of threadsClosed2013-11-25

Associated revisions

Revision 27189bba (diff)
Added by Berk Hess about 3 years ago

Fixed PME bug with high OpenMP thread count

PME energies and forces could be incorrect with combined MPI+OpenMP
parallelization. This would, only, happen when
pmegrids->nthread_comm[YY] >= 2, which can only occur with high OpenMP
thread count with multiple large prime factors.
It's unlikely that this issue affected production runs.

Fixes #1572.

Change-Id: I03b38c279c8f8ab2e111dad0976edad88b3ea93b

Revision 6ba80a26 (diff)
Added by Berk Hess about 3 years ago

Fixed two PME issues with MPI+OpenMP

Change 272736bc partially fixed #1388, but broke the more general
case of multiple MPI communication pulses in PME. Change 272736bc
incorrectly changed tx1 and ty1. This change has been reverted.

Change 27189bba fixed the incorrect PME grid reduction with multiple
thread grid overlap in y. But it broke the, much more common, case
where the y-size of the PME grid is not divisible by the domains in y.
This change, incorrectly, changed buf_my.

Now buf_my is set to the correct value, which solves both issues.

Fixes #1578.
Refs #1388 and #1572.

Change-Id: Id2d7d013a3b8cdc04eda1fb026567088a38ec81f

Revision c0602644 (diff)
Added by Berk Hess over 2 years ago

Re-fixed PME bug with high OpenMP thread count

PME energies and forces could be incorrect with combined MPI+OpenMP
parallelization. This would, only, happen when
pmegrids->nthread_comm[YY] >= 2, which can only occur with high OpenMP
thread count with multiple prime factors that are large wrt the grid.
It's unlikely that this issue affected production runs.
This bug was fixed in 27189bba, but 6ba80a26 broke it again.
Fixes #1572.

Change-Id: Ic01bed4193062f8ca885fcb6bf347f2ef0de909f

History

#1 Updated by Gerrit Code Review Bot about 3 years ago

Gerrit received a related patchset '1' for Issue #1572.
Uploader: Berk Hess ()
Change-Id: I03b38c279c8f8ab2e111dad0976edad88b3ea93b
Gerrit URL: https://gerrit.gromacs.org/3847

#2 Updated by Berk Hess about 3 years ago

  • Status changed from New to Fix uploaded

#3 Updated by Roland Schulz about 3 years ago

  • Status changed from Fix uploaded to Closed

#4 Updated by Gerrit Code Review Bot about 3 years ago

Gerrit received a related patchset '1' for Issue #1572.
Uploader: Berk Hess ()
Change-Id: Id2d7d013a3b8cdc04eda1fb026567088a38ec81f
Gerrit URL: https://gerrit.gromacs.org/3896

#5 Updated by Gerrit Code Review Bot about 3 years ago

Gerrit received a related DRAFT patchset '1' for Issue #1572.
Uploader: Szilárd Páll ()
Change-Id: I3649294a143bb744a2e26fd1d9dbfb87dea421ca
Gerrit URL: https://gerrit.gromacs.org/3905

#6 Updated by Roland Schulz almost 3 years ago

  • Status changed from Closed to Accepted
  • Target version changed from 4.6.7 to 5.0.3

This wasn't fully solved. The nbnxn-free-energy test run with "-nt 8 -ntomp 16" and commit 50e7c6408e45f (merge of master and release-5-0) gives:

WARNING: ThreadSanitizer: data race (pid=12745)
  Write of size 4 at 0x7d8800067e7c by thread T26:
    #0 reduce_threadgrid_overlap ../src/gromacs/ewald/pme.c:4071 (libgromacs.so.1+0x000000270bf4)
    #1 gomp_thread_start ../../../libgomp/team.c:117 (libgomp.so.1+0x00000000cdc9)

  Previous write of size 4 at 0x7d8800067e7c by thread T83:
    #0 reduce_threadgrid_overlap ../src/gromacs/ewald/pme.c:4071 (libgromacs.so.1+0x000000270bf4)
    #1 gomp_thread_start ../../../libgomp/team.c:117 (libgomp.so.1+0x00000000cdc9)

  Location is heap block of size 5152 at 0x7d8800067800 allocated by thread T3:
    #0 calloc <null>:0 (libtsan.so.0+0x00000004c571)
    #1 save_calloc ../src/gromacs/utility/smalloc.c:179 (libgromacs.so.1+0x0000002e8b6c)
    #2 init_overlap_comm ../src/gromacs/ewald/pme.c:3234 (libgromacs.so.1+0x00000026dfb2)
    #3 gmx_pme_init ../src/gromacs/ewald/pme.c:3574 (libgromacs.so.1+0x00000027ba12)
    #4 mdrunner ../src/programs/mdrun/runner.cpp:1644 (gmx+0x000000024a27)
    #5 mdrunner_start_fn ../src/programs/mdrun/runner.cpp:187 (gmx+0x000000025e67)
    #6 tMPI_Thread_starter ../src/external/thread_mpi/src/tmpi_init.c:397 (libgromacs.so.1+0x000000f89409)
    #7 tMPI_Thread_starter ../src/external/thread_mpi/src/pthreads.c:234 (libgromacs.so.1+0x000000f7fd28)
    #8 <null> <null>:0 (libtsan.so.0+0x000000035609)

This specific version isn't necessary. A few other master versions give the same error. If I revert commit 6ba80a267eaf2 (the 2nd patch) then it is OK.

#7 Updated by Roland Schulz almost 3 years ago

  • Related to Bug #1578: PME incorrect with MPI+OpenMP and multiple MPI communication pulses added

#8 Updated by Roland Schulz almost 3 years ago

  • Related to Bug #1388: Data race in PME with large prime number of threads added

#9 Updated by Roland Schulz almost 3 years ago

PS: If I also revert 27189bba (1st patch) then it is still OK. If I revert also 272736bc then it isn't OK. So the 6ba80a267eaf2 breaks the fix in 272736bc.

#10 Updated by Mark Abraham almost 3 years ago

  • Target version changed from 5.0.3 to 5.0.4

#11 Updated by Mark Abraham almost 3 years ago

  • Target version changed from 5.0.4 to 5.0.5

Sorry haven't had time to look into this, and won't have time until January.

#12 Updated by Roland Schulz over 2 years ago

  • Priority changed from Normal to High

Increasing priority: It produces incorrect results. It doesn't cause a crash. It affects the regressiontests.

#13 Updated by Berk Hess over 2 years ago

I think both min operations in the final change should be combined for the communication buffer. The attached pme.c fixes this issue. But we should check thoroughly that it doesn't brake anything else. I'm currently to tired to think this through carefully.

#14 Updated by Berk Hess over 2 years ago

  • Status changed from Accepted to In Progress

#15 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #1572.
Uploader: Berk Hess ()
Change-Id: Ic01bed4193062f8ca885fcb6bf347f2ef0de909f
Gerrit URL: https://gerrit.gromacs.org/4440

#16 Updated by Mark Abraham over 2 years ago

  • Status changed from In Progress to Resolved

I think this is fixed and merged through to all branches now?

#17 Updated by Mark Abraham over 2 years ago

  • Status changed from Resolved to Closed

No known work to do, closing

Also available in: Atom PDF