Project

General

Profile

Bug #1513

Gromacs doesn't work with MPI when hostnames do not contain a numeric ID

Added by Vedran Miletic over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Mark Abraham wrote at #1135:

There's also the silent assumption that gmx_hostname_num() can only work if people name their nodes "prefix.[0-9]+" and I have no idea how general the validity of that assumption is :-)

Running gromacs on machines that have generic hostnames which do not fit the rule mentioned above results in something along the lines of

Using 5 MPI processes
Using 1 OpenMP thread per MPI process

1 GPU detected on host akston:
#0: NVIDIA GeForce GTX 660, compute cap.: 3.0, ECC: no, stat: compatible

1 GPU user-selected for this run.
Mapping of GPU ID to the 5 PP ranks in this node: 0

-------------------------------------------------------
Program gmx mdrun, VERSION 5.1-dev-20140527-eb2cc07
Source code file:
/home/vedranm/workspace/gromacs/src/gromacs/gmxlib/gmx_detect_hardware.cpp,
line: 404

Fatal error:
Incorrect launch configuration: mismatching number of PP MPI processes
and GPUs per node.
mdrun_mpi was started with 5 PP MPI processes per node, but you provided 1 GPU.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

There is hostname hashing implementation (a gmx_hostname_num()) which could be used for splitting the MPI_comm_world into groups of ranks per node.

Associated revisions

Revision d682f2f3 (diff)
Added by Berk Hess over 6 years ago

Replaced gmx_hostname_num by gmx_physicalnode_id_hash

gmx_hostname_num was only used to distinguish physical nodes.
Since it only worked for hostnames with numbers, we replaced it
by a hash of the hostname.

Fixes #1513

Change-Id: I8e60757707386f43269afe0bb38e8500decefcd6

History

#1 Updated by Vedran Miletic over 6 years ago

If someone could provide a high-level description of what should be done to fix this, I could try to do it over the weekend or next week. I have worked with MPI previously, and while this seems tricky, it doesn't seem too hard.

#2 Updated by Szilárd Páll over 6 years ago

  • Subject changed from Gromacs doesn't work with MPI when hostnames are not orderly named to Gromacs doesn't work with MPI when hostnames are not named with a
  • Status changed from New to Accepted

gmx_setup_nodecomm() and gmx_init_intranode_counters() (in src/gromacs/gmxlib/network.c) use gmx_hostname_num() to identify the individual compute-nodes. Here, the hashing implemented by gmx_physicalnode_id_hash() should be used instead.

The caveat is that (AFAICT) simply changing the node identification from ID- to hash-based, this will change the MPI rank mapping. This in turn can have a detrimental effect on the communication performance as communication patterns implemented implicitly rely on the node ordering. That's because AFAIK in typical (most?) cluster setups nodes are topologically "close" if the extracted node IDs are close.

#3 Updated by Szilárd Páll over 6 years ago

  • Subject changed from Gromacs doesn't work with MPI when hostnames are not named with a to Gromacs doesn't work with MPI when hostnames do not contain a numeric ID

#4 Updated by Vedran Miletic over 6 years ago

Szilárd Páll wrote:

gmx_setup_nodecomm() and gmx_init_intranode_counters() (in src/gromacs/gmxlib/network.c) use gmx_hostname_num() to identify the individual compute-nodes. Here, the hashing implemented by gmx_physicalnode_id_hash() should be used instead.

Very simple attempt at this seems to run normally: https://gerrit.gromacs.org/#/c/3528/

#5 Updated by Szilárd Páll over 6 years ago

The commit message should be prefixed with WIP or RFC because that implementation will certainly not get merged. The commit text should also ref this issue.

#6 Updated by Vedran Miletic over 6 years ago

Szilárd Páll wrote:

The commit message should be prefixed with WIP or RFC because that implementation will certainly not get merged. The commit text should also ref this issue.

Done, thanks for the suggestion.

#7 Updated by Szilárd Páll over 6 years ago

Re gerrit 3528:
I do not necessarily agree that we should never be smart about rank placement. Leaving it to the user ensures that a better rank placement will almost never happen.

For instance, if you have four physical nodes and the PME load estimate gives about 25%, it wouldn't be stupid to place all PME ranks on a single node and take the probably small hit in the PP-PME communication (wrt an ideal rank placement with interleaved PP-PME). I guess such a "compact" PME rank placement may even be advantageous with higher node-count too, but that will depend on how much contention-bottleneck does inter-node PME communication get.

Secondly, when using multisim with accelerators (for now only GPUs), unless replicas don't do DD or there is only a single GPU per rank, most of the time it will be more efficient to interleave PP ranks of different replicas.

#8 Updated by Mark Abraham over 6 years ago

Szilárd Páll wrote:

Re gerrit 3528:
I do not necessarily agree that we should never be smart about rank placement. Leaving it to the user ensures that a better rank placement will almost never happen.

I mean a different thing by rank re-ordering (http://www.nersc.gov/users/computational-systems/hopper/performance-and-optimization/reordering-mpi-ranks/) than you mean by rank placement. We already do rank placement (-ddorder interleave, pp_pme, cartesian). Re-ordering within mdrun would be taking the ranking the MPI system uses, and mapping that somehow, before placing our tasks on the result. We should not do re-ordering because we have placement and mpirun can do re-ordering.

For instance, if you have four physical nodes and the PME load estimate gives about 25%, it wouldn't be stupid to place all PME ranks on a single node and take the probably small hit in the PP-PME communication (wrt an ideal rank placement with interleaved PP-PME). I guess such a "compact" PME rank placement may even be advantageous with higher node-count too, but that will depend on how much contention-bottleneck does inter-node PME communication get.

This is what -ddorder pp_pme does, if the mpirun ordering is the usual increasing scheme. For the record, I've never observed pp_pme to be the best choice, but the quality of the intra-node MPI implementation is probably a relevant issue.

Secondly, when using multisim with accelerators (for now only GPUs), unless replicas don't do DD or there is only a single GPU per rank, most of the time it will be more efficient to interleave PP ranks of different replicas.

Yes, there are efficiency gains from interleaving in some cases (e.g. 2 PP ranks per replica, 2 replicas per node, 2 GPUs per node) but I don't think that would ever beat using 1 PP rank per replica and thus 1 GPU (ie. no DD). In what case is interleave better than non-interleave and the alternatives?

If interleaving moves PP ranks to another node, then I could see that latency hit being slower than keeping a whole replica within a node (e.g. 2 PP ranks per replica, 1 replica per node, 2GPUs per node). It seems like the default should be to interleave if the multi-simulation has more than one replica per node and more than one PP rank per replica, and strictly-increasing otherwise. If we would cater to that, I would implement that with the communicator splitting, rather than what I understand by "reordering."

#9 Updated by Berk Hess over 6 years ago

  • Status changed from Accepted to Fix uploaded

#10 Updated by Berk Hess over 6 years ago

  • Status changed from Fix uploaded to Resolved
  • % Done changed from 0 to 100

#11 Updated by Erik Lindahl over 6 years ago

  • Status changed from Resolved to Closed

#12 Updated by Vedran Miletic over 6 years ago

Berk Hess wrote:

Applied in changeset d682f2f3e31f5b4074b06b4393b59fd7b4bdb2b0.

Thanks!

Also available in: Atom PDF