Project

General

Profile

Bug #2004

parallelism selection code needs work

Added by Mark Abraham over 4 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Low
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

[tcbs23 pinning (homedir)] $ gmx mdrun -nsteps 0 -quiet -ntomp 32

Running on 1 node with total 32 cores, 32 logical cores, 2 compatible GPUs
Hardware detected:
  CPU info:
    Vendor: AMD
    Brand:  AMD Opteron(tm) Processor 6376                 
    SIMD instructions most likely to fit this hardware: AVX_128_FMA
    SIMD instructions selected at GROMACS compile time: AVX_128_FMA

  Hardware topology: Full, with devices
  GPU info:
    Number of GPUs detected: 2
    #0: NVIDIA GeForce GTX TITAN, compute cap.: 3.5, ECC:  no, stat: compatible
    #1: NVIDIA GeForce GTX TITAN, compute cap.: 3.5, ECC:  no, stat: compatible

Reading file topol.tpr, VERSION 4.6.4-dev-20131030-4d73877 (single precision)
Note: file tpx version 83, software tpx version 110
Changing nstlist from 10 to 40, rlist from 1.018 to 1.131

Overriding nsteps with value passed on the command line: 0 steps, 0 ps

Using 8 MPI threads
Using 32 OpenMP threads per tMPI thread

WARNING: Oversubscribing the available 32 logical CPU cores with 256 threads.
         This will cause considerable performance loss!
2 compatible GPUs are present, with IDs 0,1
2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 8 PP ranks in this node: 0,0,0,0,1,1,1,1

NOTE: Your choice of number of MPI ranks and amount of resources results in using 32 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 6 threads per rank.

The following is better, but still not great.

[tcbs23 pinning (homedir)] $ export OMP_NUM_THREADS=32
[tcbs23 pinning (homedir)] $ gmx mdrun -nsteps 0 -quiet

Running on 1 node with total 32 cores, 32 logical cores, 2 compatible GPUs
Hardware detected:
  CPU info:
    Vendor: AMD
    Brand:  AMD Opteron(tm) Processor 6376                 
    SIMD instructions most likely to fit this hardware: AVX_128_FMA
    SIMD instructions selected at GROMACS compile time: AVX_128_FMA

  Hardware topology: Full, with devices
  GPU info:
    Number of GPUs detected: 2
    #0: NVIDIA GeForce GTX TITAN, compute cap.: 3.5, ECC:  no, stat: compatible
    #1: NVIDIA GeForce GTX TITAN, compute cap.: 3.5, ECC:  no, stat: compatible

Reading file topol.tpr, VERSION 4.6.4-dev-20131030-4d73877 (single precision)
Note: file tpx version 83, software tpx version 110
Changing nstlist from 10 to 40, rlist from 1.018 to 1.131

The number of OpenMP threads was set by environment variable OMP_NUM_THREADS to 32

Overriding nsteps with value passed on the command line: 0 steps, 0 ps

Using 8 MPI threads
Using 32 OpenMP threads per tMPI thread

WARNING: Oversubscribing the available 32 logical CPU cores with 256 threads.
         This will cause considerable performance loss!
2 compatible GPUs are present, with IDs 0,1
2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 8 PP ranks in this node: 0,0,0,0,1,1,1,1

-------------------------------------------------------
Program:     gmx mdrun, version 2016-beta2-dev-20160708-49b6a93
Source file: src/programs/mdrun/resource-division.cpp (line 555)
MPI rank:    6 (out of 8)

Fatal error:
Your choice of number of MPI ranks and amount of resources results in using 32
OpenMP threads per rank, which is most likely inefficient. The optimum is
usually between 2 and 6 threads per rank. If you want to run with this setup,
specify the -ntomp option. But we suggest to change the number of MPI ranks
(option -ntmpi).

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Doubtless the point could be made that mdrun -ntomp is not how we want users to do things, but we made it possible. The code that chooses the number of thread-MPI ranks needs to either observe the number of OpenMP threads and choose the number of ranks reasonably, or refuse to choose a number of ranks. (mdrun -ntomp 2 again starts 8 ranks, then respects two OpenMP threads per rank, then refuses to pin because we're undersubscribed.)

We should also not say "Your choice of number of MPI ranks" when the user's choice was "give me default behaviour for choosing the number of ranks."


Related issues

Related to GROMACS - Bug #1338: incorrect automatic selection of MPI threads/DD with GPUClosed
Related to GROMACS - Bug #2067: mdrun ignores GPUs being requested if detection fails or is skippedClosed

History

#1 Updated by Mark Abraham over 4 years ago

  • Description updated (diff)

#2 Updated by Mark Abraham over 4 years ago

It could also be better to refer to "thread-MPI ranks," rather than "MPI ranks" but there's probably rather a lot of stuff that needs improvement with an eye to consistency.

#3 Updated by Mark Abraham over 4 years ago

  • Related to Bug #1338: incorrect automatic selection of MPI threads/DD with GPU added

#4 Updated by Mark Abraham over 4 years ago

Also, running spc216 on tcbs23:

NOTE: Parallelization is limited by the small number of atoms,
      only starting 1 thread-MPI ranks.
      You can use the -nt and/or -ntmpi option to optimize the number of threads.

Using 1 MPI thread
Using 16 OpenMP threads 

2 compatible GPUs are present, with IDs 0,1
1 GPU auto-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 0

NOTE: potentially sub-optimal launch configuration, gmx mdrun started with less
      PP thread-MPI thread than GPUs available.
      Each PP thread-MPI thread can use only one GPU, 1 GPU will be used.

NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin thread to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).

NOTE: Thread affinity setting failed. This can cause performance degradation.
      If you think your settings are correct, ask on the gmx-users list.

starting mdrun 'Water'

The thread-affinity decision needs to be made more aware of how the decision about the number of threads was made - where mdrun has limited the number of threads (perhaps for several reasons), I think mdrun -pin auto should set thread affinity. Thus mdrun -ntmpi -1 -pin auto should always pin, but probably if the user set any of the number of ranks, GPUs or OpenMP threads (by any means) then mdrun -pin auto should keep its current behaviour and not set affinities if not all cores are filled.

Obviously this is a corner case, but as nodes get fatter, a "small" simulation gets larger.

#5 Updated by Mark Abraham over 4 years ago

Also, as another example, the rnase_dodec_vsites benchmark can't scale past 3x3x2=18 ranks, so starting it with thread-MPI on a node with 24 cores fails to find a suitable DD. We should extract the code that deduces the cell-size limit from the DD setup code, and use that as input into the selection of the number of ranks. In this case, it might choose 6 PME-only ranks, or 2 OpenMP threads per 12 ranks (particularly with verlet+RF).

#6 Updated by Mark Abraham over 3 years ago

  • Related to Bug #2067: mdrun ignores GPUs being requested if detection fails or is skipped added

#7 Updated by Mark Abraham over 3 years ago

The intended behaviour of gmx mdrun -nt 1 -gpu_id 01 is currently unclear.

#8 Updated by Mark Abraham over 3 years ago

If the intent of -nt is to specify the number of threads, then when -ntomp is also given, then the number of thread-MPI ranks has been implicitly specified and that many ranks should be started. However, on my desktop (2 GPUs, 4 IVB), a CUDA + thread-MPI build does:

  • gmx mdrun -nt 8 -ntomp 1 gives a fatal error ("The total number of threads requested (8) does not match the thread-MPI ranks (2) times the OpenMP threads (1) requested") after choosing 2 ranks (presumably based on the GPU detection) - I think it should start 8 thread-MPI ranks
  • gmx mdrun -nt 4 -ntomp 1 fails similar to the above - I think it should start 4 thread-MPI ranks
  • gmx mdrun -nt 4 -ntomp 2 runs, having chosen two ranks - this is fine, but seems fortuitous
  • gmx mdrun -nt 2 -ntomp 1 -pin on runs, having chosen two ranks - this is fine, but seems fortuitous

#9 Updated by Erik Lindahl almost 3 years ago

  • Status changed from New to Resolved

I think Mark's recent change to the hardware assignment code and some of our other cleanup has fixed this. I'll move it to resolved now; feel free to add any stuff or reopen, but if there are no comments in a couple of days we can formally close it.

#10 Updated by Mark Abraham almost 3 years ago

I think it would be good to go through the list of issues I identified here and see if anything remains, but I won't have time this week, and it probably only makes sense once a few more e.g. affinity fixes are in.

#11 Updated by Mark Abraham over 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF