Project

General

Profile

Bug #2802

Force counter contains up to few % time unaccounted for

Added by Szilárd Páll 5 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

With all bonded, nonbonded, and PME interaction in the system offloaded, the cycle counters still record non-negligible work under "Force"; e.g. with GluCL on a 12-core + GV 100 system running OpenMP-only:
gmx mdrun -ntmpi 1 -ntomp 12 -v -quiet -noconfout -npme 0 -pin on -nsteps 100
00 -resetstep 8000 -pinstride 2 -nb gpu -pme gpu -tunepme -bonded gpu

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 1 MPI rank, each using 12 OpenMP threads

 Computing:          Num   Num      Call    Wall time         Giga-Cycles
                     Ranks Threads  Count      (s)         total sum    %
-----------------------------------------------------------------------------
 Neighbor search        1   12         21       0.145          5.037   3.9
 Launch GPU ops.        1   12       4002       0.260          9.068   7.1
 Force                  1   12       2001       0.038          1.335   1.0
 Wait PME GPU gather    1   12       2001       0.771         26.885  21.0
 Reduce GPU PME F       1   12       2001       0.050          1.752   1.4
 Wait GPU NB local      1   12       2001       1.532         53.383  41.8
 NB X/F buffer ops.     1   12       3981       0.283          9.867   7.7
 Update                 1   12       2001       0.172          5.997   4.7
 Constraints            1   12       2001       0.295         10.289   8.1
 Rest                                           0.118          4.111   3.2
-----------------------------------------------------------------------------
 Total                                          3.665        127.724 100.0
-----------------------------------------------------------------------------
 Breakdown of PP computation
-----------------------------------------------------------------------------
 NS grid local          1   12         21       0.026          0.918   0.7
 NS search local        1   12         21       0.101          3.529   2.8
 Launch NB GPU tasks    1   12       2001       0.044          1.531   1.2
 Launch PME GPU task    1   12       2001       0.145          5.064   4.0
 NB X buffer ops.       1   12       1980       0.135          4.698   3.7
 NB F buffer ops.       1   12       2001       0.148          5.164   4.0
-----------------------------------------------------------------------------

Associated revisions

Revision 5ada2357 (diff)
Added by Szilárd Páll 3 months ago

Clarify force buffer setup code in do_force

Refactored code and made conditionals non-nested to improve
the ease of understanding when is a common or separate buffer used for
the forces when direct virial contribution is computed.

Also add subcounter for force buffer clearing which also helps annotate
code that should be conditional on whether any of these buffers are used
to accumulate or only to copy into (e.g. with everything offloaded to a
GPU).

Refs #2802

Change-Id: I3fa5a3e4e4adf5cfe0eb417f0c1c3d0ed4a96769

Revision cb258586 (diff)
Added by Szilárd Páll 3 months ago

Encapsulate force output setup in do_force_*

Code that sets up force buffer outputs for force-only and virial
contribution together with clearing is moved into a function that
produces a struct containing the relecant data.

Refs #2802

Change-Id: Ie04a8c8edf703610ff8e357792d6ec22ebb718ff

History

#1 Updated by Berk Hess 5 months ago

How much?
We are not calculating anything and also not reducing AFAIK. So there should only be loops over bonded types and maybe reduction buffer parts that only check bools. These could still cost some time and we could skip them.

#2 Updated by Szilárd Páll 5 months ago

  • Subject changed from Listed force call not skipped when all bondeds are offloaded to Force counter contains up to few % time unaccoutned for
  • Description updated (diff)

#3 Updated by Szilárd Páll 5 months ago

  • Subject changed from Force counter contains up to few % time unaccoutned for to Force counter contains up to few % time unaccounted for

#4 Updated by Paul Bauer 5 months ago

Is this the issue that got fixed by https://gerrit.gromacs.org/#/c/8822/?

#5 Updated by Szilárd Páll 5 months ago

No, it did not get fixed. The title was updated because unlike I originally thought, it is likely not related to bondeds, but something else in the force counter is taking time that's not supposed to take time.

#6 Updated by Mark Abraham 4 months ago

  • Target version set to 2019.1

#7 Updated by Mark Abraham 4 months ago

  • Assignee set to Szilárd Páll

#8 Updated by Gerrit Code Review Bot 3 months ago

Gerrit received a related DRAFT patchset '2' for Issue #2802.
Uploader: Szilárd Páll ()
Change-Id: gromacs~master~I3fa5a3e4e4adf5cfe0eb417f0c1c3d0ed4a96769
Gerrit URL: https://gerrit.gromacs.org/9136

#9 Updated by Szilárd Páll 3 months ago

  • Status changed from New to Resolved

It turns out it's force buffer clearing that can take significant amount of time (additional to some dangling pbc setup code) -- even if forces are not produced on the CPU. We won't address this in 2019, made some progress on this in master however.

#10 Updated by Paul Bauer 3 months ago

  • Status changed from Resolved to Closed

#11 Updated by Gerrit Code Review Bot 3 months ago

Gerrit received a related patchset '1' for Issue #2802.
Uploader: Szilárd Páll ()
Change-Id: gromacs~master~Ie04a8c8edf703610ff8e357792d6ec22ebb718ff
Gerrit URL: https://gerrit.gromacs.org/9246

Also available in: Atom PDF