Project

General

Profile

Feature #1188

Separate timing reports for PP and PME nodes

Added by Mikhail Plotnikov about 4 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

Please implement separate timing reports for PP and PME nodes in case separate PME nodes are used (-npme option). Current report is a mess in this case, because it contains PP and PME timings and thus total walltime is smaller than sum of all components. Also PME thread count is incorrectly reported and probably timing itself is incorrect in case PME num threads is taken from OMP_NUM_THREADS variable and is not set by –ntomp_pme option.
To reproduce please use attached test case.

PME_timings_report.tbz2 (6.84 MB) Mikhail Plotnikov, 03/12/2013 01:59 PM

Associated revisions

Revision 1df1ceb2 (diff)
Added by Mark Abraham about 3 years ago

Improved wallcycle reporting

Removed reporting of MPI and thread counts on each row, in
favour of a header with that information.

With npme > 0, prints a note that the time column is not
supposed to add up.

Works correctly with a range of -npme, -ntomp and -ntomp_pme values:
PP times add up to the total, which equals the final walltime
reported; cycle count and percentage column totals are correct and
reflect the actual work done.

Partial fix for #1188

Change-Id: Ic870d981bf0375189601bf8c9bc67bc5d6226497

History

#1 Updated by Mark Abraham almost 4 years ago

  • Status changed from New to Accepted
  • Assignee set to Mark Abraham
  • Target version changed from 4.6.2 to 4.6.3
  • Affected version set to 4.6

Reproduced, thanks.

In the pre-threading era we produced output like:

 Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
 Domain decomp.        48        750      154.369      220.5     1.9
 DD comm. load         48        750        1.186        1.7     0.0
 DD comm. bounds       48        751       10.257       14.7     0.1
 Send X to PME         48       7501       36.525       52.2     0.4
 Comm. coord.          48       7501      167.265      239.0     2.0
 Neighbor search       48        751      345.441      493.5     4.2
 Force                 48       7501     4363.061     6233.0    53.4
 Wait + Comm. F        48       7501      370.087      528.7     4.5
 PME mesh              16       7501     1652.623     2360.9    20.2
 Wait + Comm. X/F      16                 388.828      555.5     4.8
 Wait + Recv. PME F    48       7501      273.452      390.6     3.3
 Write traj.           48         16        2.632        3.8     0.0
 Update                48       7501       87.795      125.4     1.1
 Constraints           48       7501      112.100      160.1     1.4
 Comm. energies        48        752      126.219      180.3     1.5
 Rest                  48                  73.972      105.7     0.9
-----------------------------------------------------------------------
 Total                 64                8165.812    11665.5   100.0
-----------------------------------------------------------------------

i.e. the columns added correctly, but we did not separate PP-only from PME-only when that was what was going on. I am working on a patch to return to that functionality.

#2 Updated by Mark Abraham almost 4 years ago

  • Category set to mdrun
  • Status changed from Accepted to In Progress

#3 Updated by Mark Abraham almost 4 years ago

  • Status changed from In Progress to Fix uploaded

I have not been able to see any issues with OMP_NUM_THREADS such as Mikhail reported. Reported thread counts and timings look correct to me.

https://gerrit.gromacs.org/2352

#4 Updated by Berk Hess almost 4 years ago

  • Status changed from Fix uploaded to In Progress

The timing reports are separate, but written out to the same table, which can be confusing.
I have a patch in progress which writes two separate tables when separate PME ranks are used.
I guess with OMP_NUM_THREADS Mikhail actually means have different thread counts on different ranks.

#5 Updated by Mark Abraham almost 4 years ago

Moved discussion here from https://gerrit.gromacs.org/2412

There, Berk and I were discussing the merits of starting iteration timers in sync or not, and how to interpret the numbers that result.

I don't feel synchronization is a goal we should have. MPMDism and workload overlap is going to get more complex. Already we have the first iteration of PP starting later than the first iteration of PME, because the former has more setup after the latter gets its first data. That only affects the timing of the first iteration, which is no big deal.

I think one of the things that has to break as we move towards more task parallelism is the notion of homogeneity in the role of a set of processing units. I think the main iteration loop should look the same on each process, and the data it has should determine what work it should do. That work can get billed to different accounts so we know where things stand, sure.

In this light, the proposed master-branch split of the wallcycle-reporting table into PP and PME sections seems like a backward step. Why shouldn't the table look the same for npme=0 and npme>0? Right now, in choosing which npme value to use, the important thing is the total time. Tweaking the settings will lead to changes in half a dozen of the table rows. Those changes have useful feedback, sure. I do not see what is sacred about the walltime that we should scale raw counts and not the times reported. They're both scaled from the same value! Knowing the PME mesh time relative to the time spent by some number of PME-only nodes is no more useful than knowing the PME mesh time relative to the time spent by all nodes. And it's less useful for evaluating the particular case people have at the moment of whether to use the MPMD, because the timings change scale when you do so.

#6 Updated by Mark Abraham almost 4 years ago

  • Target version changed from 4.6.3 to 4.6.x

#7 Updated by Szilárd Páll over 3 years ago

Mark Abraham wrote:

I don't feel synchronization is a goal we should have. MPMDism and workload overlap is going to get more complex. Already we have the first iteration of PP starting later than the first iteration of PME, because the former has more setup after the latter gets its first data. That only affects the timing of the first iteration, which is no big deal.

I agree, but I think we should not change the synchronous timing behaviour of GROMACS in a patch release. Hence, any such changes should go into 5.0.

I think one of the things that has to break as we move towards more task parallelism is the notion of homogeneity in the role of a set of processing units. I think the main iteration loop should look the same on each process, and the data it has should determine what work it should do. That work can get billed to different accounts so we know where things stand, sure.

In this light, the proposed master-branch split of the wallcycle-reporting table into PP and PME sections seems like a backward step. Why shouldn't the table look the same for npme=0 and npme>0? Right now, in choosing which npme value to use, the important thing is the total time. Tweaking the settings will lead to changes in half a dozen of the table rows. Those changes have useful feedback, sure. I do not see what is sacred about the walltime that we should scale raw counts and not the times reported. They're both scaled from the same value! Knowing the PME mesh time relative to the time spent by some number of PME-only nodes is no more useful than knowing the PME mesh time relative to the time spent by all nodes. And it's less useful for evaluating the particular case people have at the moment of whether to use the MPMD, because the timings change scale when you do so.

My impression is that a much more bottom-up redesign of the cycle-timing is required with a full task parallelization. In particular, not only MPMD needs to be considered, but tasks (dispatchers?) will need to record their own timing and the thread/task manager will have to do the accounting per task and per rank.

#8 Updated by Gerrit Code Review Bot about 3 years ago

Gerrit received a related patchset '1' for Issue #1188.
Uploader: Mark Abraham ()
Change-Id: Ic870d981bf0375189601bf8c9bc67bc5d6226497
Gerrit URL: https://gerrit.gromacs.org/3042

#9 Updated by Erik Lindahl almost 3 years ago

  • Tracker changed from Bug to Feature

#10 Updated by Berk Hess almost 3 years ago

  • Status changed from In Progress to Closed
  • Target version changed from 4.6.x to 5.0

The incorrect thread count was already fixed in 4.6.2.
The table has not been split, since that makes it more difficult to see what percentage of the resources was spent on what part of the calculations. But in 5.0 the separate PME node entries are marked with an asterisk and a note on the total time has been added.

Also available in: Atom PDF