Project

General

Profile

Bug #1873

dEkin/dl parallelization issue

Added by Igor Leontyev over 4 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
probably all versions since about GROMACS 4.0
Affected version:
Difficulty:
uncategorized
Close

Description

It looks there is a parallelization bug in computation of "dEkin/dl" term in free energy simulations.
I simulate ddG with gromacs 5.1 and mixed topology, i.e. need dG in protein and in water:
ddG = dG(InProtein) - dG(InWater)
The alchemical transformation involves only one atom mutation H->Cl. I expect that dEkin/dl terms InProtein and InWater cancel each other. But it is not always true.

In simulations with 1 or 2 OMP threads thread-MPI ranks I get <dEkin/dl > value about 130 kJ/mol, with 4 or 8 threads <dEkin/dl > drops down to ~ 65 kJ/mol and with 16 threads it drops down again about twice.

Attached are tpr-file and log-files for 2,4 and 8 thread simulations. 30 ps which I ran these short tests is not totally sufficient to get completely converged <dEkin/dl > value, but enough to see the qualitative difference between 2 and 4-8 thread simulations.

079_c1c2-liq_0.tpr (1.15 MB) 079_c1c2-liq_0.tpr tpr-file Igor Leontyev, 12/10/2015 11:01 AM
log-files.tgz (12.2 KB) log-files.tgz log-files Igor Leontyev, 12/10/2015 11:01 AM

Associated revisions

Revision 414747c9 (diff)
Added by Mark Abraham about 4 years ago

Fix dEkin/dl handling with multiple ranks

With non-vv integrators, enerd->dekindl was computed at
nstglobalcomm-1 step, but not accumulated across ranks unless bGStat
also happened to be true. Then at the next (ie nstglobalcomm) step,
bGStat and bEkinhOld were true, so calc_ke_part copied the values into
enerd->ekindl_old. These were then not accumulated, so sum_ekin
averaged the accumulated enerd->ekind with the non-accumulated
enerd->ekind_old to store in enerd->dvdl_lin[efptMASS]. So, it seems
likely that mass-perturbed free-energy calculations with multiple
ranks have been broken with (at least) non-vv integrators for a long
time (perhaps since 031a8b58f).

Fixes #1873

Change-Id: I262e3cfc97a50e1a343563134c7ba89539bba59a

History

#1 Updated by Mark Abraham about 4 years ago

  • Description updated (diff)
  • Status changed from New to Accepted

There's definitely problems here. Igor, can you please also share a tarball of grompp inputs? I would like to be able to change the integrator and cutoff scheme while diagnosing what is wrong with the code. (Side point Igor, you using the group scheme, which does not support OpenMP, so your threads are those of thread-MPI, creating multiple ranks.)

With your .tpr (SD + FE, group scheme with nstlist 10) and release-5-1 branch HEAD (double, MPI), doing gmx_mpi_d mdrun -np [124] -nsteps 1 everything agrees. With -nsteps 2, I see

ranks_01.log:           Step           Time         Lambda
ranks_01.log-              0        0.00000        0.00000
--
ranks_01.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_01.log-   -4.78635e+04    8.29309e+01   -1.31394e+05    2.88844e+04   -1.02510e+05
ranks_01.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_01.log-    2.97839e+02   -3.44277e+02   -8.55108e+02    4.70784e+01   -6.86837e+01
--
ranks_01.log:           Step           Time         Lambda
ranks_01.log-              2        0.00200        0.00000
--
ranks_01.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_01.log-   -4.78767e+04    8.14817e+01   -1.31418e+05    2.89460e+04   -1.02472e+05
ranks_01.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_01.log-    2.98474e+02   -3.44867e+02    2.47349e+02    6.80340e+01    1.01129e+02
--
ranks_02.log:           Step           Time         Lambda
ranks_02.log-              0        0.00000        0.00000
--
ranks_02.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_02.log-   -4.78635e+04    8.29309e+01   -1.31394e+05    2.88844e+04   -1.02510e+05
ranks_02.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_02.log-    2.97839e+02   -3.44277e+02   -8.55108e+02    4.70784e+01   -6.86837e+01
--
ranks_02.log:           Step           Time         Lambda
ranks_02.log-              2        0.00200        0.00000
--
ranks_02.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_02.log-   -4.78767e+04    8.14817e+01   -1.31418e+05    4.34000e+04   -8.80183e+04
ranks_02.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_02.log-    4.47516e+02   -3.44867e+02    1.73830e+03    6.80340e+01    1.01129e+02
--
ranks_04.log:           Step           Time         Lambda
ranks_04.log-              0        0.00000        0.00000
--
ranks_04.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_04.log-   -4.78635e+04    8.29309e+01   -1.31394e+05    2.88844e+04   -1.02510e+05
ranks_04.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_04.log-    2.97839e+02   -3.44277e+02   -8.55108e+02    4.70784e+01   -6.86837e+01
--
ranks_04.log:           Step           Time         Lambda
ranks_04.log-              2        0.00200        0.00000
--
ranks_04.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_04.log-   -4.78767e+04    8.14817e+01   -1.31418e+05    7.23081e+04   -5.91103e+04
ranks_04.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_04.log-    7.45598e+02   -3.44867e+02    4.72019e+03    6.80340e+01    1.01129e+02

ie. the KE and Temperature on step 2 is wrong with multiple ranks but eKin/dl is OK. (Same pattern for -nsteps 9, of course.)

With -nsteps 10,

ranks_01.log:           Step           Time         Lambda
ranks_01.log-              0        0.00000        0.00000
--
ranks_01.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_01.log-   -4.78635e+04    8.29309e+01   -1.31394e+05    2.88844e+04   -1.02510e+05
ranks_01.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_01.log-    2.97839e+02   -3.44277e+02   -8.55108e+02    4.70784e+01   -6.86837e+01
--
ranks_01.log:           Step           Time         Lambda
ranks_01.log-             10        0.01000        0.00000
--
ranks_01.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_01.log-   -4.79354e+04    7.98044e+01   -1.31746e+05    2.91660e+04   -1.02580e+05
ranks_01.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_01.log-    3.00742e+02   -3.44867e+02   -7.05551e+02    2.13336e+01    1.00194e+02
--
ranks_02.log:           Step           Time         Lambda
ranks_02.log-              0        0.00000        0.00000
--
ranks_02.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_02.log-   -4.78635e+04    8.29309e+01   -1.31394e+05    2.88844e+04   -1.02510e+05
ranks_02.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_02.log-    2.97839e+02   -3.44277e+02   -8.55108e+02    4.70784e+01   -6.86837e+01
--
ranks_02.log:           Step           Time         Lambda
ranks_02.log-             10        0.01000        0.00000
--
ranks_02.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_02.log-   -4.79354e+04    7.98044e+01   -1.31746e+05    2.91660e+04   -1.02580e+05
ranks_02.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_02.log-    3.00742e+02   -3.44867e+02   -7.05551e+02    2.13336e+01    1.00194e+02
--
ranks_04.log:           Step           Time         Lambda
ranks_04.log-              0        0.00000        0.00000
--
ranks_04.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_04.log-   -4.78635e+04    8.29309e+01   -1.31394e+05    2.88844e+04   -1.02510e+05
ranks_04.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_04.log-    2.97839e+02   -3.44277e+02   -8.55108e+02    4.70784e+01   -6.86837e+01
--
ranks_04.log:           Step           Time         Lambda
ranks_04.log-             10        0.01000        0.00000
--
ranks_04.log:   Coul. recip. Position Rest.      Potential    Kinetic En.   Total Energy
ranks_04.log-   -4.79354e+04    7.98044e+01   -1.31746e+05    2.91660e+04   -1.02580e+05
ranks_04.log:    Temperature Pres. DC (bar) Pressure (bar)       dEkin/dl      dVcoul/dl
ranks_04.log-    3.00742e+02   -3.44867e+02   -7.05551e+02    1.20617e+01    1.00194e+02

ie. the KE and Temperature are now fine, but eEkin/dl has problems once there are more than 2 ranks (confirmed with 3, 5, 6 and 8 ranks also). 3-6 ranks still use a 1D decomposition, 8 uses 2D, and all report the same value for eKin/dl at step 10.

So, it looks like the reduction of KE is wrong for terminating non-nstlist steps, and the reduction of dEkin/dl is wrong for nstlist steps.

#2 Updated by Gerrit Code Review Bot about 4 years ago

Gerrit received a related patchset '1' for Issue #1873.
Uploader: Mark Abraham ()
Change-Id: I262e3cfc97a50e1a343563134c7ba89539bba59a
Gerrit URL: https://gerrit.gromacs.org/5550

#3 Updated by Mark Abraham about 4 years ago

  • Assignee set to Mark Abraham
  • Target version set to 5.0.8
  • Affected version - extra info set to probably all versions since about GROMACS 4.0

Yeah, the handling of enerd->dekindl is not right with multiple ranks. At step 9, it isn't accumulated (because !bGStat), then at step 10 it's immediately copied to enerd->dekindl_old. This issue seems to have been present for a long time.

I've uploaded a candidate fix, but I need to try some other integrators and stuff before we can feel any confidence about this.

#4 Updated by Gerrit Code Review Bot about 4 years ago

Gerrit received a related patchset '1' for Issue #1873.
Uploader: Mark Abraham ()
Change-Id: I8d4d69de2ceabe0b13b5aaeea31edb8f88b100a3
Gerrit URL: https://gerrit.gromacs.org/5554

#5 Updated by Mark Abraham about 4 years ago

  • Status changed from Accepted to Fix uploaded

#6 Updated by Mark Abraham about 4 years ago

  • Status changed from Fix uploaded to Resolved

#7 Updated by Mark Abraham about 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF