Project

General

Profile

Task #2124

PME GPU user interface suggestions

Added by Aleksei Iupinov over 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
High
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

The assumptions are:

(a1) No PME GPU DD (max 1 PME-only rank for PME on GPU); this should go away at some point.
(a2) Any rank uses max 1 GPU (possibly for multiple tasks); it remains to be seen if multiple GPUs per rank is a sensible thing performance-wise, as there would be a need to call cudaSetDevice() on any asynchronous CUDA operation.
The current NB/PME code does not check if its CUDA contexts are maintained, so it likely will not work correctly with multiple CUDA GPUs per single rank.

Here are my suggestions for modifying the mdrun command line arguments:

1. "-pme" option is added, mostly mirroring the "-nb" option. Possible values are:
- "cpu" - retains the current PME CPU behaviour;
- "gpu" - triggers the new PME GPU behaviour. Will cause a program termination with a corresponding message if any of the current PME GPU limitations are encountered:
----- not a CUDA build;
----- double-precision build;
----- PME order is not 4;
----- multiple PME grids requested (LJ/LB/free energy);
----- multiple PME ranks requested explicitly ("-npme" > 1) (a1).
- "auto" (the default behaviour) - will check if running PME on GPU is possible (according to the limitations listed above) and reasonable (see point 4). If the check passes, PME runs on GPU; otherwise, PME runs on CPU (printing a note).
2. "-npme" (number of PME-only ranks) handling:
- "-pme cpu": doesn't change;
- "-pme gpu": will generate an error for values > 1 (a1), is otherwise respected;
- "-pme auto": will force PME to run on CPU for values > 1 (printing a note) (a1), is otherwise respected.
By 'respected' I mean:
- value of 1 causes PME to run on a separate rank, preferably using a separate GPU if it's available, sharing a GPU with a PP rank otherwise.
- value of 0 causes one of the ranks to perform PME and NB on same GPU. This could likely result in sub-par performance though.
- unspecified argument/-1 (default) is treated as value of 1.
3. "-gpu_id" (explicit GPU IDs to use) - as long as (a2) stands, the only change is treating it not as 1-to-1 sequence of GPUs for PP ranks, but as 1-to-1 sequence of GPUs for all GPU-using ranks (PP/PME/PP+PME).
If there is a PME-only GPU rank, then the string gets 1 digit longer. The most simple for users would be to add that ID at the end of the string (encoding the current assumption that it's always the last rank doing the PME GPU work).
In the future, if there are several parallel GPU tasks (a1) and/or multiple GPUs per rank (a2), this argument should encode not just sequence of ranks, but a certain sequence of tasks, too:
[GPU for PP on rank 0, GPU for PME on rank 0, GPU for PP on rank 1, GPU for PME on rank 1, ...]
4. "-tunepme" - should ideally attain the reasonable performance. A way to define "reasonable" here might be to time PME CPU/GPU runs, possibly for all GPUs available. This could take forever though. Thoughts?


Related issues

Related to GROMACS - Feature #2054: PME on GPUAccepted
Related to GROMACS - Feature #2354: develop configuration file support for control of task layoutNew

Associated revisions

Revision efaf51cc (diff)
Added by Mark Abraham over 2 years ago

Permit gmxtest.pl to test PME on GPUs

Introduce some temporary functionality to check whether mdrun is
capable of running PME on GPUs, which will avoid needing to do a lot
of Jenkins cross-verification (and humans triggering them). (We can
remove the check at a later time if we think we need to.) If the code
can run PME on GPUs, that can be tested in the same way as the GPU
code.

Following the example of the way we re-run tests so that we also test
CPU non-bonded kernels on hardware with GPUs, we arrange to test PME
on both CPU and GPU when testing of GPUs is required. A better harness
could perhaps detect whether GPUs are present (or were used by default
for PME), but the current implementation works for Jenkins, at least.

The continuation tests such as tip4p_continue now support re-running.

Accordingly, extracted reused functionality into
prepare_run_with_different_task_decomposition.

Refs #2124, #1587

Change-Id: I64c45c62f96d06f057a6f886727fc5852b84b9ba

History

#1 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #2124.
Uploader: Mark Abraham ()
Change-Id: regressiontests~master~I64c45c62f96d06f057a6f886727fc5852b84b9ba
Gerrit URL: https://gerrit.gromacs.org/6502

#2 Updated by Aleksei Iupinov over 2 years ago

If I remember correctly, Berk's suggestion for selecting between PME CPU and GPU in "-pme=auto -tunepme" (default) mode was to have another tuning stage which just times PME CPU and GPU runs and chooses the fastest one.

1) I think this stage should also allow for changing PME grid size. I think this is not independent of the CPU/GPU degree of freedom.
2) The difficulty also comes with multiple GPUs available. Then we would have not 2, but at least (n_pp_ranks + 1) options for running PME. Thoughts?

#3 Updated by Aleksei Iupinov over 2 years ago

#4 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '3' for Issue #2124.
Uploader: Aleksei Iupinov ()
Change-Id: regressiontests~master~Ie6f327d139b09b61eeb969655c18d9750c2bdb1d
Gerrit URL: https://gerrit.gromacs.org/6510

#5 Updated by Szilárd Páll over 2 years ago

Aleksei Iupinov wrote:

If I remember correctly, Berk's suggestion for selecting between PME CPU and GPU in "-pme=auto -tunepme" (default) mode was to have another tuning stage which just times PME CPU and GPU runs and chooses the fastest one.

1) I think this stage should also allow for changing PME grid size. I think this is not independent of the CPU/GPU degree of freedom.
2) The difficulty also comes with multiple GPUs available. Then we would have not 2, but at least (n_pp_ranks + 1) options for running PME. Thoughts?

I would not change the current semantics of the implementation: -tunepme means grid/cutoff scanning and adding new features should be done later if there happens to be time after having merged everything into master and having done the testing needed. IMO even eliminating the duplicate x copies is higher priority than another auto-tuning stage, don't you agree?

To your second point, I suggest that for now everything should be simple and dumb.
mdrun -nb gpu -pme auto -ntmpi 3 -npme 1
would be best if it always throws an error and tell the user the manually specify mapping (e.g. -gpu_id 011). Alternatively, the simplest and still safe thing the above could do is to extend the semantics of ver <=2016 and expect a GPU mapping to be generated in the cases where and how previous version did:
  • 012 iff exactly three devices are available
  • maybe also 000 iff exactly one device is available

#6 Updated by Mark Abraham almost 2 years ago

Some more work will arrive for this

#7 Updated by Mark Abraham almost 2 years ago

  • Status changed from New to Resolved

2018 PME interface is feature complete, but we plan to replace -gputasks with -tasks and then maybe a configuration file. Perhaps we should open a new issue for that

#8 Updated by Mark Abraham almost 2 years ago

  • Related to Feature #2354: develop configuration file support for control of task layout added

#9 Updated by Mark Abraham almost 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF