Project

General

Profile

Task #2674

Improve domain decomposition for bilayer systems

Added by Kenneth Goossens about 1 year ago. Updated 12 months ago.

Status:
Accepted
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

Hi,

During my umbrella sampling simulations on a ligand-enzyme complex in a lipid bilayer, I encountered a consistently bad domain decomposition, causing a huge drop in performance. In (rough estimate) 1 in 10 runs, the domain decomposition happens smoothly when performing the simulations on 12nodes/336cores. However, in the remaining runs I would get a decomposition in which the cells are very disproportionate and a lot of time is lost on PP ranks waiting for each other to finish. I have tried solving this by using no, yes and auto settings for -dlb to no avail. furthermore, this problem is already present in my system at lower parallellization levels as well (<5nodes/140cores for 155.000 particles).

An important note: this issue seems to be amplified by the harmonic restraints used in umbrella sampling. When these are used, most of the time is spent on calculating COM pull force and constraints according to the log files. In attachment:
Two subsequent umbrella frames using exactly the same mdp file and mdrun options, with respective performances of 80 and 30 ns/day. (umbrella2.log-umbrella 3.log, I have removed some of the step logs because the files were too big)
System at the start of both runs (umbrellacont2_2.gro-umbrellacont3_2.gro)
mdp file (umbrella_acc.mdp)
tpr files

umbrella2.log (3.47 MB) umbrella2.log Kenneth Goossens, 10/05/2018 03:34 PM
umbrella3.log (3.61 MB) umbrella3.log Kenneth Goossens, 10/05/2018 03:34 PM
umbrellacont2_2.gro (10.2 MB) umbrellacont2_2.gro Kenneth Goossens, 10/05/2018 03:37 PM
umbrellacont3_2.gro (10.2 MB) umbrellacont3_2.gro Kenneth Goossens, 10/05/2018 03:37 PM
umbrellacont2_3.tpr (4.83 MB) umbrellacont2_3.tpr Kenneth Goossens, 10/05/2018 03:38 PM
umbrellacont3_3.tpr (4.83 MB) umbrellacont3_3.tpr Kenneth Goossens, 10/05/2018 03:38 PM
umbrella_acc.mdp (2.81 KB) umbrella_acc.mdp Kenneth Goossens, 10/05/2018 03:38 PM

History

#1 Updated by Berk Hess about 1 year ago

  • Status changed from New to Rejected

This is not a bug, but rather a question for the gmx-users list.

The short answer is that you are running at the scaling limit, performance can't get much better, especially when using COM pulling which requires global communication at every step. Using 2 or 4 openmp threads and 2 or 4 times fewer MPI ranks might help performance. Using fewer nodes might also help, especially for the efficiency.

#2 Updated by Kenneth Goossens about 1 year ago

Berk Hess wrote:

This is not a bug, but rather a question for the gmx-users list.

The short answer is that you are running at the scaling limit, performance can't get much better, especially when using COM pulling which requires global communication at every step. Using 2 or 4 openmp threads and 2 or 4 times fewer MPI ranks might help performance. Using fewer nodes might also help, especially for the efficiency.

I'm sorry, I was asked to file a report here coming from the mailing list (https://mailman-1.sys.kth.se/pipermail/gromacs.org_gmx-users/2018-September/122453.html). However, to comment on your suggestion, I have done tests with this setup using more openmp and less mpi ranks, and there was an inverse linear correlation (however, this was on the system without COM pulling). Also, I thought the scaling limit was around 1000 atoms per core, while this system is running about 1000 atoms per core, does the COM pulling have such a significant effect on this? The problem also seems to persist when I am clearly using fewer cores that would be the scaling limit (i.e. 3-4 nodes). However, if this is the wrong place to discuss this, I apologize!

#3 Updated by Berk Hess about 1 year ago

  • Tracker changed from Bug to Task
  • Subject changed from Bad domain decomposition for bilayer systems to Improve domain decomposition for bilayer systems
  • Status changed from Rejected to Accepted
  • Affected version deleted (2018.3)

Sorry, I didn't know someone suggested to file a redmine. I changed it from a bug to a task.

With 308 PP cores you seem to be running at 500 atoms per core, which is close to the scaling limit.
The COM pulling adds global communication, so that (currently) always has a large impact on performance and scaling. We would like to make that non-blocking, which would improve things a lot.

If Szilard thinks the DLB behaves sub-optimal, we should check that.

#4 Updated by Berk Hess about 1 year ago

PS I would guess you would get much better performance by using -dd to specify more domains along x and y and fewer along z.

#5 Updated by Szilárd Páll about 1 year ago

I suggested to file a report because having looked at the logs and code, this seems to be the same issue I thought we tackled in the past.

The simulation running a bit below the usual scaling limit (with PP dominating load given the long cut-off), DLB fails to both to balance load as well as to back off and stop balancing. As a result, based on the limited data shown here, it seems that when DLB gets trapped in the "on" state, performance can a lot lower than otherwise. For that reason, I do think there is something to improve, just not sure yet what.

------------------------------------------------------------------------------------------------------------------
A few more details -- stuck as drafts a few days ago:
What happens is that DLB kicks in and doesn't seem to reduce imbalance much, then it gets turned off. The second time this happens, DLB doesn't get turned off and it ends up trapped in the state we've seen before where it's limited (here in Z), but keeps scaling in other dimensions with no benefit from around a 0.75 domain volume ratio to ~0.17. Later the performance degradation gets registered and DLB is switched off. However, this keeps repeating itself in both nearly >40h runs. The great difference between the two is that while the former managed to lock DLB off and spends only ~6% of the runtime in the "bad" state, the other run (umbrella3.log) ends up stuck in multiple times and ends up finishing with 87% of the run spent with over-scaled domains and ends up being almost 3x slower.

Having looked at the code, I suspect that what happens when the slowdown is not detected early is that DLB is decreasing performance realateively slowly.
What's not clear to me whether it's expected that the performance increase is also slow enough after turning DLB off, so we are not able to catch the anomalies. I think we might have to make tolerances tighter. First I thought that the longer nstlist phases introduced recently might contribute, but if anything, this should just improve things by reducing noise (especially in the non-averaged post-DLB off measurement).

#6 Updated by Szilárd Páll 12 months ago

Berk, in the light of our findings, do you still think this is not a "task" with "future" target?

Also available in: Atom PDF