Project

General

Profile

Bug #1686

bonded force regression

Added by Szilárd Páll almost 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
core library
Target version:
Affected version - extra info:
5.1-dev-20150212-85937e6
Affected version:
Difficulty:
uncategorized
Close

Description

The performance of the bonded force computation seems to have regressed greatly.

VERSION 5.0.5-dev-20150204-42bb5b3

Bonded F               1    8       2001       0.927         19.290  12.2

VERSION 5.1-dev-20150212-85937e6

Listed F               1    8       2001       1.875         39.024  21.9

Logs and input (54k water+ethanol system) attached.


Related issues

Related to GROMACS - Bug #1673: mis-use of simd.h after refactoring Closed

Associated revisions

Revision 42897afc (diff)
Added by Mark Abraham almost 5 years ago

Reenable SIMD for R-B dihedrals

Fixes #1686

Change-Id: If612a5afde4bcfd55319d2b6e38b521aa43a9973

Revision 63a5ca30 (diff)
Added by Mark Abraham over 4 years ago

Extend Force sub-counters

Need more data for understanding performance variation

Implemented subcounter "restart" and used it for accumulating
position-restraints time with FEP to the position-restraints
subcounter.

Noted TODOs for some future extensions not currently possible.

Also added logfile output from GMX_CYCLE_BARRIER where people
analyzing the performance will see it.

Refs #1686

Change-Id: I9d60d0a683f56549879bb739269e9466c96572c4

History

#1 Updated by Mark Abraham almost 5 years ago

On master HEAD + fix for mdrun-only build (ie. with the SIMD fixes, but nothing else from gerrit) and recent release-5-0, on 1 OpenMP thread on BG/Q (so no MPI or OpenMP issues involved), I saw Force get slower by rather more than Listed got slower (same kind of thing if I used more OpenMP ranks):

master on left, release-5-0 on right

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G                       R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 256 MPI ranks doing PP, and                                                    On 256 MPI ranks doing PP, and
on 256 MPI ranks doing PME                                                        on 256 MPI ranks doing PME

 Computing:          Num   Num      Call    Wall time         Giga-Cycles          Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %                              Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------     -----------------------------------------------------------------------------
 Domain decomp.       256    1        501       2.857       1170.089   1.8     |   Domain decomp.       256    1        501       2.857       1170.230   2.1
 DD comm. load        256    1        501       0.131         53.456   0.1     |   DD comm. load        256    1        501       0.130         53.200   0.1
 DD comm. bounds      256    1        501       0.075         30.674   0.0     |   DD comm. bounds      256    1        501       0.076         31.269   0.1
 Send X to PME        256    1      10001       0.393        161.178   0.3     |   Send X to PME        256    1      10001       0.394        161.454   0.3
 Neighbor search      256    1        501       3.579       1465.891   2.3     |   Neighbor search      256    1        501       3.143       1287.434   2.3
 Comm. coord.         256    1       9500       1.946        797.107   1.3     |   Comm. coord.         256    1       9500       2.107        862.928   1.5
 Force                256    1      10001      56.192      23016.305  36.3     |   Force                256    1      10001      48.175      19732.699  35.0
 Wait + Comm. F       256    1      10001       2.055        841.669   1.3     |   Wait + Comm. F       256    1      10001       2.211        905.468   1.6
 PME mesh *           256    1      10001      25.995      10647.784  16.8     |   PME mesh *           256    1      10001      25.823      10577.079  18.7
 PME wait for PP *                             51.373      21042.655  33.2     |   PME wait for PP *                             43.084      17647.242  31.3
 Wait + Recv. PME F   256    1      10001       0.413        169.065   0.3     |   Wait + Recv. PME F   256    1      10001       0.390        159.895   0.3
 NB X/F buffer ops.   256    1      29001       2.321        950.829   1.5     |   NB X/F buffer ops.   256    1      29001       2.216        907.746   1.6
 Update               256    1      10001       0.747        305.790   0.5     |   Update               256    1      10001       0.702        287.636   0.5
 Constraints          256    1      10001       3.320       1359.991   2.1     |   Constraints          256    1      10001       3.207       1313.755   2.3
 Comm. energies       256    1       1001       0.275        112.717   0.2     |   Comm. energies       256    1       1001       0.273        111.843   0.2
 Rest                                           3.066       1255.897   2.0     |   Rest                                           3.025       1238.987   2.2
-----------------------------------------------------------------------------     -----------------------------------------------------------------------------
 Total                                         77.369      63381.318 100.0     |   Total                                         68.907      56449.088 100.0
-----------------------------------------------------------------------------     -----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to       (*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.            twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------     -----------------------------------------------------------------------------
 Breakdown of PME mesh computation                                                 Breakdown of PME mesh computation
-----------------------------------------------------------------------------     -----------------------------------------------------------------------------
 PME redist. X/F      256    1      20002       7.374       3020.341   4.8     |   PME redist. X/F      256    1      20002       7.245       2967.406   5.3
 PME spread/gather    256    1      20002      10.545       4319.241   6.8     |   PME spread/gather    256    1      20002      10.556       4323.926   7.7
 PME 3D-FFT           256    1      20002       3.637       1489.734   2.4     |   PME 3D-FFT           256    1      20002       3.605       1476.447   2.6
 PME 3D-FFT Comm.     256    1      40004       2.462       1008.551   1.6     |   PME 3D-FFT Comm.     256    1      40004       2.463       1008.661   1.8
 PME solve Elec       256    1      10001       0.615        251.889   0.4     |   PME solve Elec       256    1      10001       0.613        251.158   0.4
-----------------------------------------------------------------------------     -----------------------------------------------------------------------------
 Breakdown of PP computation                                                       Breakdown of PP computation
-----------------------------------------------------------------------------     -----------------------------------------------------------------------------
 DD redist.           256    1        501       0.168         68.795   0.1     |   DD redist.           256    1        501       0.190         78.002   0.1
 DD NS grid + sort    256    1        501       0.091         37.258   0.1     |   DD NS grid + sort    256    1        501       0.107         43.798   0.1
 DD setup comm.       256    1        501       0.565        231.384   0.4     |   DD setup comm.       256    1        501       0.545        223.359   0.4
 DD make top.         256    1        501       0.622        254.709   0.4     |   DD make top.         256    1        501       0.677        277.236   0.5
 DD make constr.      256    1        501       0.866        354.755   0.6     |   DD make constr.      256    1        501       0.834        341.437   0.6
 DD top. other        256    1        501       0.200         82.108   0.1     |   DD top. other        256    1        501       0.210         85.865   0.2
 NS grid non-loc.     256    1        501       0.291        119.376   0.2     |   NS grid non-loc.     256    1        501       0.289        118.363   0.2
 NS search local      256    1        501       0.175         71.708   0.1     |   NS search local      256    1        501       0.170         69.720   0.1
 NS search non-loc.   256    1        501       1.585        649.314   1.0     |   NS search non-loc.   256    1        501       1.607        658.413   1.2
 Listed F             256    1      10001       2.583       1057.879   1.7     |   Bonded F             256    1      10001       1.252        512.731   0.9
 Nonbonded F          256    1      20002      27.770      11374.736  17.9     |   Nonbonded F          256    1      20002      27.670      11333.547  20.1
 NB X buffer ops.     256    1      19000       0.823        336.997   0.5     |   NB X buffer ops.     256    1      19000       0.814        333.489   0.6
 NB F buffer ops.     256    1      10001       0.561        229.734   0.4     |   NB F buffer ops.     256    1      10001       0.558        228.469   0.4
-----------------------------------------------------------------------------     -----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)                                                Core t (s)   Wall t (s)        (%)
       Time:    39613.055       77.369    51200.0                              |         Time:    35280.406       68.907    51200.0

For my calcs, neither branch has PP go close to adding up to the total Force (unlike Szilard's runs) so it seems like we are missing some noteworthy contributions for CPU-only runs...

#2 Updated by Mark Abraham almost 5 years ago

I did a bit of bisection; I can see the Bonded sub-counter take ~twice the time as release-5-0 as far back as commit f9820250 (merge right after 5.0). Thus, the formation of the listed-forces module is not to blame here.

The bloat in the total Force counter I have observed to go away by that time, so I will find the offending commit in between somewhere - it's bigger, and may illuminate the problem with bondeds also.

#3 Updated by Gerrit Code Review Bot almost 5 years ago

Gerrit received a related patchset '1' for Issue #1686.
Uploader: Mark Abraham ()
Change-Id: If612a5afde4bcfd55319d2b6e38b521aa43a9973
Gerrit URL: https://gerrit.gromacs.org/4457

#4 Updated by Mark Abraham almost 5 years ago

The patch I just uploaded re-enables R-B SIMD, but it is not the full answer

#5 Updated by Mark Abraham almost 5 years ago

So I think that patch fixes the observed regression, but there is an increase in Force that is not in the PP accounting, so we should probably have another field or two there - and a "Rest". The total for Force is nothing like the total of the PP subcounters...

release-5-0 on left, fixed master on right

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G                                 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 256 MPI ranks doing PP, and                                                              On 256 MPI ranks doing PP, and
on 256 MPI ranks doing PME                                                                  on 256 MPI ranks doing PME

 Computing:          Num   Num      Call    Wall time         Giga-Cycles                    Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %                                        Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Domain decomp.       256    1        501       2.857       1170.230   2.1               |   Domain decomp.       256    1        501       2.713       1111.411   1.9
 DD comm. load        256    1        501       0.130         53.200   0.1               |   DD comm. load        256    1        501       0.133         54.477   0.1
 DD comm. bounds      256    1        501       0.076         31.269   0.1               |   DD comm. bounds      256    1        501       0.072         29.289   0.0
 Send X to PME        256    1      10001       0.394        161.454   0.3               |   Send X to PME        256    1      10001       0.393        160.924   0.3
 Neighbor search      256    1        501       3.143       1287.434   2.3               |   Neighbor search      256    1        501       3.368       1379.425   2.3
 Comm. coord.         256    1       9500       2.107        862.928   1.5               |   Comm. coord.         256    1       9500       1.857        760.743   1.3
 Force                256    1      10001      48.175      19732.699  35.0               |   Force                256    1      10001      50.177      20552.487  34.9
 Wait + Comm. F       256    1      10001       2.211        905.468   1.6               |   Wait + Comm. F       256    1      10001       1.946        797.014   1.4
 PME mesh *           256    1      10001      25.823      10577.079  18.7               |   PME mesh *           256    1      10001      25.462      10429.332  17.7
 PME wait for PP *                             43.084      17647.242  31.3               |   PME wait for PP *                             46.523      19055.971  32.3
 Wait + Recv. PME F   256    1      10001       0.390        159.895   0.3               |   Wait + Recv. PME F   256    1      10001       0.401        164.267   0.3
 NB X/F buffer ops.   256    1      29001       2.216        907.746   1.6               |   NB X/F buffer ops.   256    1      29001       2.178        891.995   1.5
 Update               256    1      10001       0.702        287.636   0.5               |   Update               256    1      10001       0.721        295.364   0.5
 Constraints          256    1      10001       3.207       1313.755   2.3               |   Constraints          256    1      10001       4.753       1946.767   3.3
 Comm. energies       256    1       1001       0.273        111.843   0.2               |   Comm. energies       256    1       1001       0.275        112.682   0.2
 Rest                                           3.025       1238.987   2.2               |   Rest                                           3.000       1228.682   2.1
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Total                                         68.907      56449.088 100.0               |   Total                                         71.986      58971.051 100.0
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to                 (*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.                      twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Breakdown of PME mesh computation                                                           Breakdown of PME mesh computation
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 PME redist. X/F      256    1      20002       7.245       2967.406   5.3               |   PME redist. X/F      256    1      20002       6.812       2790.100   4.7
 PME spread/gather    256    1      20002      10.556       4323.926   7.7               |   PME spread/gather    256    1      20002      10.553       4322.398   7.3
 PME 3D-FFT           256    1      20002       3.605       1476.447   2.6               |   PME 3D-FFT           256    1      20002       3.639       1490.357   2.5
 PME 3D-FFT Comm.     256    1      40004       2.463       1008.661   1.8               |   PME 3D-FFT Comm.     256    1      40004       2.498       1023.315   1.7
 PME solve Elec       256    1      10001       0.613        251.158   0.4               |   PME solve Elec       256    1      10001       0.612        250.830   0.4
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Breakdown of PP computation                                                                 Breakdown of PP computation
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 DD redist.           256    1        501       0.190         78.002   0.1               |   DD redist.           256    1        501       0.156         63.934   0.1
 DD NS grid + sort    256    1        501       0.107         43.798   0.1               |   DD NS grid + sort    256    1        501       0.091         37.099   0.1
 DD setup comm.       256    1        501       0.545        223.359   0.4               |   DD setup comm.       256    1        501       0.543        222.578   0.4
 DD make top.         256    1        501       0.677        277.236   0.5               |   DD make top.         256    1        501       0.627        256.696   0.4
 DD make constr.      256    1        501       0.834        341.437   0.6               |   DD make constr.      256    1        501       0.751        307.434   0.5
 DD top. other        256    1        501       0.210         85.865   0.2               |   DD top. other        256    1        501       0.204         83.443   0.1
 NS grid non-loc.     256    1        501       0.289        118.363   0.2               |   NS grid non-loc.     256    1        501       0.292        119.502   0.2
 NS search local      256    1        501       0.170         69.720   0.1               |   NS search local      256    1        501       0.178         72.788   0.1
 NS search non-loc.   256    1        501       1.607        658.413   1.2               |   NS search non-loc.   256    1        501       1.566        641.446   1.1
 Bonded F             256    1      10001       1.252        512.731   0.9               |   Listed F             256    1      10001       1.309        536.293   0.9
 Nonbonded F          256    1      20002      27.670      11333.547  20.1               |   Nonbonded F          256    1      20002      27.697      11344.805  19.2
 NB X buffer ops.     256    1      19000       0.814        333.489   0.6               |   NB X buffer ops.     256    1      19000       0.811        332.280   0.6
 NB F buffer ops.     256    1      10001       0.558        228.469   0.4               |   NB F buffer ops.     256    1      10001       0.555        227.215   0.4
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)                                                          Core t (s)   Wall t (s)        (%)
       Time:    35280.406       68.907    51200.0                                        |         Time:    36856.634       71.986    51200.0
                 (ns/day)    (hour/ns)                                                                       (ns/day)    (hour/ns)
Performance:       25.080        0.957                                                   |  Performance:       24.007        1.000

#6 Updated by Szilárd Páll almost 5 years ago

What about the difference in constraints? Is that just fluctuation in communication cost?

#7 Updated by Mark Abraham almost 5 years ago

I ran another comparison identical to comment 5, and got

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G                                 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 256 MPI ranks doing PP, and                                                              On 256 MPI ranks doing PP, and
on 256 MPI ranks doing PME                                                                  on 256 MPI ranks doing PME

 Computing:          Num   Num      Call    Wall time         Giga-Cycles                    Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %                                        Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Domain decomp.       256    1        501       2.857       1170.230   2.1               |   Domain decomp.       256    1        501       2.616       1071.504   1.9
 DD comm. load        256    1        501       0.130         53.200   0.1               |   DD comm. load        256    1        501       0.133         54.580   0.1
 DD comm. bounds      256    1        501       0.076         31.269   0.1               |   DD comm. bounds      256    1        501       0.073         29.946   0.1
 Send X to PME        256    1      10001       0.394        161.454   0.3               |   Send X to PME        256    1      10001       0.392        160.719   0.3
 Neighbor search      256    1        501       3.143       1287.434   2.3               |   Neighbor search      256    1        501       3.243       1328.184   2.4
 Comm. coord.         256    1       9500       2.107        862.928   1.5               |   Comm. coord.         256    1       9500       1.789        732.984   1.3
 Force                256    1      10001      48.175      19732.699  35.0               |   Force                256    1      10001      47.581      19489.367  34.6
 Wait + Comm. F       256    1      10001       2.211        905.468   1.6               |   Wait + Comm. F       256    1      10001       1.905        780.313   1.4
 PME mesh *           256    1      10001      25.823      10577.079  18.7               |   PME mesh *           256    1      10001      25.398      10403.141  18.5
 PME wait for PP *                             43.084      17647.242  31.3               |   PME wait for PP *                             43.404      17778.291  31.5
 Wait + Recv. PME F   256    1      10001       0.390        159.895   0.3               |   Wait + Recv. PME F   256    1      10001       0.388        158.909   0.3
 NB X/F buffer ops.   256    1      29001       2.216        907.746   1.6               |   NB X/F buffer ops.   256    1      29001       2.116        866.729   1.5
 Update               256    1      10001       0.702        287.636   0.5               |   Update               256    1      10001       0.696        285.092   0.5
 Constraints          256    1      10001       3.207       1313.755   2.3               |   Constraints          256    1      10001       4.608       1887.621   3.3
 Comm. energies       256    1       1001       0.273        111.843   0.2               |   Comm. energies       256    1       1001       0.275        112.684   0.2
 Rest                                           3.025       1238.987   2.2               |   Rest                                           2.986       1223.022   2.2
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Total                                         68.907      56449.088 100.0               |   Total                                         68.802      56363.310 100.0
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to                 (*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.                      twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Breakdown of PME mesh computation                                                           Breakdown of PME mesh computation
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 PME redist. X/F      256    1      20002       7.245       2967.406   5.3               |   PME redist. X/F      256    1      20002       6.768       2772.113   4.9
 PME spread/gather    256    1      20002      10.556       4323.926   7.7               |   PME spread/gather    256    1      20002      10.551       4321.726   7.7
 PME 3D-FFT           256    1      20002       3.605       1476.447   2.6               |   PME 3D-FFT           256    1      20002       3.632       1487.529   2.6
 PME 3D-FFT Comm.     256    1      40004       2.463       1008.661   1.8               |   PME 3D-FFT Comm.     256    1      40004       2.491       1020.329   1.8
 PME solve Elec       256    1      10001       0.613        251.158   0.4               |   PME solve Elec       256    1      10001       0.611        250.089   0.4
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Breakdown of PP computation                                                                 Breakdown of PP computation
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 DD redist.           256    1        501       0.190         78.002   0.1               |   DD redist.           256    1        501       0.148         60.789   0.1
 DD NS grid + sort    256    1        501       0.107         43.798   0.1               |   DD NS grid + sort    256    1        501       0.091         37.233   0.1
 DD setup comm.       256    1        501       0.545        223.359   0.4               |   DD setup comm.       256    1        501       0.532        217.995   0.4
 DD make top.         256    1        501       0.677        277.236   0.5               |   DD make top.         256    1        501       0.625        255.950   0.5
 DD make constr.      256    1        501       0.834        341.437   0.6               |   DD make constr.      256    1        501       0.674        275.895   0.5
 DD top. other        256    1        501       0.210         85.865   0.2               |   DD top. other        256    1        501       0.204         83.515   0.1
 NS grid non-loc.     256    1        501       0.289        118.363   0.2               |   NS grid non-loc.     256    1        501       0.293        119.988   0.2
 NS search local      256    1        501       0.170         69.720   0.1               |   NS search local      256    1        501       0.187         76.431   0.1
 NS search non-loc.   256    1        501       1.607        658.413   1.2               |   NS search non-loc.   256    1        501       1.595        653.229   1.2
 Bonded F             256    1      10001       1.252        512.731   0.9               |   Listed F             256    1      10001       1.308        535.876   1.0
 Nonbonded F          256    1      20002      27.670      11333.547  20.1               |   Nonbonded F          256    1      20002      27.757      11369.297  20.2
 NB X buffer ops.     256    1      19000       0.814        333.489   0.6               |   NB X buffer ops.     256    1      19000       0.818        335.123   0.6
 NB F buffer ops.     256    1      10001       0.558        228.469   0.4               |   NB F buffer ops.     256    1      10001       0.554        227.095   0.4
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)                                                          Core t (s)   Wall t (s)        (%)
       Time:    35280.406       68.907    51200.0                                        |         Time:    35226.796       68.802    51200.0
                 (ns/day)    (hour/ns)                                                                       (ns/day)    (hour/ns)
Performance:       25.080        0.957                                                   |  Performance:       25.118        0.955

Constraints section is still up, but Force got faster by much more. There's no external network effect, but I don't think there's a good way to probe for what is going on. The two code versions could lead to different localized clustering of ethanol leading to different constraint-imbalance properties. It's hard to speculate about what is changing in Force, since there's about a third of its time not in subcycle counters...

#8 Updated by Gerrit Code Review Bot almost 5 years ago

Gerrit received a related DRAFT patchset '1' for Issue #1686.
Uploader: Mark Abraham ()
Change-Id: I9d60d0a683f56549879bb739269e9466c96572c4
Gerrit URL: https://gerrit.gromacs.org/4461

#9 Updated by Mark Abraham almost 5 years ago

  • Status changed from New to Fix uploaded
  • Assignee set to Mark Abraham

False alarm. I was inadvertently running with GMX_CYCLE_BARRIER :-( I have added a higher-visibility warning in the patch that extends the cycle subcounters https://gerrit.gromacs.org/#/c/4461/ - likely only one of them is useful.

I think the issue here is no longer apparent on BG/Q - listed are a bit faster, constraints a bit slower, comm coord a bit faster. Since the R-B patch probably fixes Szilard's observations about slower listed forces, I think the issue is resolved

-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Domain decomp.       256    2        501       2.141       1753.517   3.0               |   Domain decomp.       256    2        501       2.045       1675.236   2.9
 DD comm. load        256    2        501       0.060         48.932   0.1               |   DD comm. load        256    2        501       0.060         49.427   0.1
 DD comm. bounds      256    2        501       0.175        143.070   0.2               |   DD comm. bounds      256    2        501       0.176        144.460   0.2
 Send X to PME        256    2      10001       0.229        187.718   0.3               |   Send X to PME        256    2      10001       0.228        186.668   0.3
 Neighbor search      256    2        501       1.656       1356.535   2.3               |   Neighbor search      256    2        501       1.636       1339.914   2.3
 Comm. coord.         256    2       9500       2.102       1722.366   3.0               |   Comm. coord.         256    2       9500       1.909       1564.190   2.7
 Force                256    2      10001      20.283      16615.659  28.5               |   Force                256    2      10001      20.121      16482.822  28.4
 Wait + Comm. F       256    2      10001       3.004       2460.638   4.2               |   Wait + Comm. F       256    2      10001       2.819       2309.512   4.0
 PME mesh *           256    2      10001      19.068      15620.832  26.8               |   PME mesh *           256    2      10001      19.210      15736.779  27.1
 PME wait for PP *                             16.527      13538.902  23.2               |   PME wait for PP *                             16.192      13264.374  22.9
 Wait + Recv. PME F   256    2      10001       0.168        137.991   0.2               |   Wait + Recv. PME F   256    2      10001       0.163        133.243   0.2
 NB X/F buffer ops.   256    2      29001       1.580       1294.496   2.2               |   NB X/F buffer ops.   256    2      29001       1.583       1296.719   2.2
 Update               256    2      10001       0.483        395.299   0.7               |   Update               256    2      10001       0.484        396.517   0.7
 Constraints          256    2      10001       2.790       2285.780   3.9               |   Constraints          256    2      10001       3.289       2694.599   4.6
 Comm. energies       256    2       1001       0.410        335.752   0.6               |   Comm. energies       256    2       1001       0.403        330.514   0.6
 Rest                                           0.515        421.990   0.7               |   Rest                                           0.485        397.340   0.7
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Total                                         35.595      58319.484 100.0               |   Total                                         35.402      58002.321 100.0
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
(*) Note that with separate PME ranks, the walltime column actually sums to                 (*) Note that with separate PME ranks, the walltime column actually sums to
    twice the total reported, but the cycle count total and % are correct.                      twice the total reported, but the cycle count total and % are correct.
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Breakdown of PME mesh computation                                                           Breakdown of PME mesh computation
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 PME redist. X/F      256    2      20002       6.824       5589.910   9.6               |   PME redist. X/F      256    2      20002       6.920       5668.911   9.8
 PME spread/gather    256    2      20002       7.582       6210.817  10.6               |   PME spread/gather    256    2      20002       7.619       6241.870  10.8
 PME 3D-FFT           256    2      20002       1.473       1206.772   2.1               |   PME 3D-FFT           256    2      20002       1.488       1218.989   2.1
 PME 3D-FFT Comm.     256    2      40004       2.824       2313.398   4.0               |   PME 3D-FFT Comm.     256    2      40004       2.815       2306.086   4.0
 PME solve Elec       256    2      10001       0.198        162.050   0.3               |   PME solve Elec       256    2      10001       0.197        161.512   0.3
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 Breakdown of PP computation                                                                 Breakdown of PP computation
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------
 DD redist.           256    2        501       0.169        138.121   0.2               |   DD redist.           256    2        501       0.154        125.841   0.2
 DD NS grid + sort    256    2        501       0.107         87.472   0.1               |   DD NS grid + sort    256    2        501       0.090         73.340   0.1
 DD setup comm.       256    2        501       0.534        437.360   0.7               |   DD setup comm.       256    2        501       0.535        438.320   0.8
 DD make top.         256    2        501       0.560        458.926   0.8               |   DD make top.         256    2        501       0.526        430.915   0.7
 DD make constr.      256    2        501       0.461        377.363   0.6               |   DD make constr.      256    2        501       0.436        356.908   0.6
 DD top. other        256    2        501       0.181        147.930   0.3               |   DD top. other        256    2        501       0.183        149.897   0.3
 NS grid non-loc.     256    2        501       0.329        269.268   0.5               |   NS grid non-loc.     256    2        501       0.332        272.298   0.5
 NS search local      256    2        501       0.134        110.017   0.2               |   NS search local      256    2        501       0.134        110.163   0.2
 NS search non-loc.   256    2        501       1.143        936.668   1.6               |   NS search non-loc.   256    2        501       1.119        916.829   1.6
 Bonded F             256    2      10001       1.253       1026.194   1.8               |   Listed F             256    2      10001       0.872        714.535   1.2
 Nonbonded F          256    2      20002      18.886      15471.257  26.5               |   Listed buffer ops.   256    2      10001       0.250        204.899   0.4
 NB X buffer ops.     256    2      19000       0.905        741.538   1.3               |   Nonbonded F          256    2      20002      18.848      15440.225  26.6
 NB F buffer ops.     256    2      10001       0.661        541.799   0.9               |   Shift X              256    2      10001       0.001          1.010   0.0
                                                                                         >   Ewald Q correction   256    2      10001       0.004          3.497   0.0
                                                                                         >   NB X buffer ops.     256    2      19000       0.911        745.938   1.3
                                                                                         >   NB F buffer ops.     256    2      10001       0.657        537.823   0.9
                                                                                         >   NB virial buf. ops.  256    2         21       0.000          0.117   0.0
-----------------------------------------------------------------------------               -----------------------------------------------------------------------------

               Core t (s)   Wall t (s)        (%)                                                          Core t (s)   Wall t (s)        (%)
       Time:    36449.664       35.595   102400.0                                        |         Time:    36251.437       35.402   102400.0
                 (ns/day)    (hour/ns)                                                                       (ns/day)    (hour/ns)
Performance:       48.550        0.494                                                   |  Performance:       48.816        0.492

#10 Updated by Mark Abraham over 4 years ago

One lesson to learn here is that it is easy to lose code introduced in the maintenance branch, if there are long-lived refactoring patches that move code around. This applies both to bug fixes, and to things like the SIMD enhancements that we sneaked into release-5-0, whose triggers were lost upon refactoring leading to apparent performance regression.

In future, I will comment and vote to suggest that we should complete pending code motion before merging maintenance branches into master. In practice, people are generally not diligent about reviewing merge patches, which have to be dealt with in a timely fashion before the branch HEADs move on. So, that means I will be less willing to submit bug fixes to the maintenance branches until there's been review and merging of work done in master. If people want bugs fixed, then they need to help out with the whole lifetime.

#11 Updated by Szilárd Páll over 4 years ago

Mark Abraham wrote:

In future, I will comment and vote to suggest that we should complete pending code motion before merging maintenance branches into master.

Good point.

So, that means I will be less willing to submit bug fixes to the maintenance branches until there's been review and merging of work done in master. If people want bugs fixed, then they need to help out with the whole lifetime.

That's where I'm not sure you convinced me. This sounds (again) like a convenience measure that motivates early abandoning of released code. I think we should be careful about de-emphasizing maintenance releases; instead, we should be much more strict about allowing only finished and ready features into a release (otherwise simply disable it after forking off a release branch) and instead focus on actual fixes during maintenance rather than patching up what was not implemented or tested properly before release.

#12 Updated by Mark Abraham over 4 years ago

Szilárd Páll wrote:

Mark Abraham wrote:

In future, I will comment and vote to suggest that we should complete pending code motion before merging maintenance branches into master.

Good point.

So, that means I will be less willing to submit bug fixes to the maintenance branches until there's been review and merging of work done in master. If people want bugs fixed, then they need to help out with the whole lifetime.

That's where I'm not sure you convinced me. This sounds (again) like a convenience measure that motivates early abandoning of released code.

Submitting a valid fix to the active maintenance branch, the time spent making releases from that branch, and the number of active maintenance branches are separate issues.

I'm concerned here about the actual lifetime of a bug fix being the time we notice a problem to the time resulting fixes are merged into master branch. Currently people seem to switch off once a bug fix is submitted from gerrit into the maintenance branch, and assume someone else will take care of the merges, and reviewing the merges. This does not combine well with long-lived master-branch gerrit patches (or feature branches from it), and combines particularly poorly with code moving from one place to another. Hence, I would prefer to hold off on submitting maintenance-branch changes until we've got somewhere moderately stable for that code to eventually go in master branch. This means the merge process can go more smoothly, which is important if most people are just going to leave that work to someone else. Naturally, I would much prefer people to be diligent about prioritising reviewing merge patches when they appear in gerrit, but every few weeks someone chasing up all the people knowledgeable about the content of the latest merge patch is not a fun prospect either.

I think we should be careful about de-emphasizing maintenance releases; instead, we should be much more strict about allowing only finished and ready features into a release (otherwise simply disable it after forking off a release branch) and instead focus on actual fixes during maintenance rather than patching up what was not implemented or tested properly before release.

This discussion is probably better in person. We now do pretty much only actual fixes during maintenance, and I can recall no feature-completion patches on release-5-0 after 5.0. Where we did add some functionality during the lifetime of release-5-0, those were improved implementations of well-tested code paths. I'd generally prefer to avoid even that.

#13 Updated by Szilárd Páll over 4 years ago

Mark Abraham wrote:

Szilárd Páll wrote:

Mark Abraham wrote:

In future, I will comment and vote to suggest that we should complete pending code motion before merging maintenance branches into master.

Good point.

So, that means I will be less willing to submit bug fixes to the maintenance branches until there's been review and merging of work done in master. If people want bugs fixed, then they need to help out with the whole lifetime.

That's where I'm not sure you convinced me. This sounds (again) like a convenience measure that motivates early abandoning of released code.

Submitting a valid fix to the active maintenance branch, the time spent making releases from that branch, and the number of active maintenance branches are separate issues.

I'm concerned here about the actual lifetime of a bug fix being the time we notice a problem to the time resulting fixes are merged into master branch. Currently people seem to switch off once a bug fix is submitted from gerrit into the maintenance branch, and assume someone else will take care of the merges, and reviewing the merges. This does not combine well with long-lived master-branch gerrit patches (or feature branches from it), and combines particularly poorly with code moving from one place to another. Hence, I would prefer to hold off on submitting maintenance-branch changes until we've got somewhere moderately stable for that code to eventually go in master branch.

IIUC you are suggesting that a maintenance commit is blocked until its merge into master, including conflict resolution with pending master branch changes is all done, right? I agree that in the spirit of "code contribution requires commitment to maintenance" it's fair to expect that the author of a e.g. a bugfix helps with merging it upstream. However, a bugfix and its forward porting are IMO a related, but separate matters while, the proposed approach seems to link the two together and has the disadvantage that it prioritizes new development at the expense of fixes. Putting on hold maintenance commits will delay bugfixes (and with that delay bugfix releases), generates additional work required to keep the repsective change up to date, and in general leaves work that's ready in a pending state - without saving work in any of the branches.

This means the merge process can go more smoothly, which is important if most people are just going to leave that work to someone else.

I don't think it's simply a case of people leaving work behind...

Naturally, I would much prefer people to be diligent about prioritising reviewing merge patches when they appear in gerrit, but every few weeks someone chasing up all the people knowledgeable about the content of the latest merge patch is not a fun prospect either.

but it seems more of a matter of communication and community effort. If one dev contributes a fix, another is working on related code in master, while yet another does the merge, is it not reasonable for all three to discuss the merge(s) rather than delaying the fix? Even better would be if the three made sure that all three commits in question get reviewed by the other two devs.
Keeping people informed about ongoing development, about the help needed with merging (some may not even know about their fix conflicting with new code), doing early trial merges, posting about top important and/or urgent changes that need review should all help. If it does not, I can't imagine that putting on hold bugfixes will on the long run be beneficial.

I think we should be careful about de-emphasizing maintenance releases; instead, we should be much more strict about allowing only finished and ready features into a release (otherwise simply disable it after forking off a release branch) and instead focus on actual fixes during maintenance rather than patching up what was not implemented or tested properly before release.

This discussion is probably better in person. We now do pretty much only actual fixes during maintenance, and I can recall no feature-completion patches on release-5-0 after 5.0. Where we did add some functionality during the lifetime of release-5-0, those were improved implementations of well-tested code paths. I'd generally prefer to avoid even that.

Sure, let's discuss it offline.

#14 Updated by Roland Schulz over 4 years ago

I think that the SIMD trigger was lost during a merge is more of a symptom than the problem. I think the problem was that losing the trigger didn't cause any warning (performance tests, preprocessor warning, ...). Even if the merge would have been fine this problem could have been prompted by some refactoring now or later. I think we should put effort into creating more warnings. This would help with both merging and refactoring and not just address one symptom.

#15 Updated by Mark Abraham over 4 years ago

Roland Schulz wrote:

I think that the SIMD trigger was lost during a merge is more of a symptom than the problem. I think the problem was that losing the trigger didn't cause any warning (performance tests, preprocessor warning, ...). Even if the merge would have been fine this problem could have been prompted by some refactoring now or later. I think we should put effort into creating more warnings. This would help with both merging and refactoring and not just address one symptom.

Sure, that's what https://gerrit.gromacs.org/#/c/4390/ does.

#16 Updated by Mark Abraham over 4 years ago

  • Related to Bug #1673: mis-use of simd.h after refactoring added

#17 Updated by Mark Abraham over 4 years ago

  • Status changed from Fix uploaded to Resolved

#18 Updated by Erik Lindahl over 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF