Allow to force pinning
We try to be nice and don't pin if affinity is already set. This seems in general good, but sometime it is very non-obvious why the affinity is non-default. An example is the QLogic Infiniband OpenMPI back-end, whose developers thought it is a good idea to set the affinity if no-one else has set it. Thus even if no OpenMPI affinity options are set it still sets the affinity. And it can only be deactivated by a environment-variable (IPATH_NO_CPUAFFINITY). Needless to say it took me forever to find that. Adding an option "-pin force" (or renaming the current option to "auto" and adding a "yes"), which always sets the affinity even if it is already set, would help users in those cases.
I set it as bug, because even though I'm suggesting to add an option, it is really trying to fix a usability problem.
Off topic: The QLogic documentation points out that ideally the affinity is set as early as possible so that the MPI library affinity is correct too. Not sure how we could improve that.
thread affinity now uses some topology information
The order of logical cores on x86 is hardware and software dependent.
The cpuid topology reports this and this information is now used.
The mdrun -pinht option is generalized for SMT to -pinstride.
The mdrun -pinoffset option is now in logical (iso phyiscal) cores.
Thread-MPI no longer sets affinity, it's now all done in one place.
The option -pin is now an enum, default auto: only on when using all
cores and when no external affinity has been set.
A big NOTE is printed with auto when no pinning is used.
Option -pin on can now override thread affinity set outside mdrun.
All thread affinity code has been moved from runner.c to
Updated the mdrun manual for pinning also active without OpenMP.
#1 Updated by Berk Hess about 7 years ago
- Status changed from New to In Progress
I actually wrote the same thing in a comment in my patch.
I suggested to make -pin on (try to) override any already present pinning. This might be as simple as skipping all affinity detection checks when -pin is set to on. But I didn't want to make even more changes just before the release.
#3 Updated by Roland Schulz about 7 years ago
No it isn't fixed. In fact it is so bad that I though it was hanging. One single call to construct_vsites takes 0.1s. So even for DHFR it doesn't get to step 0 in less than 10min. It is possible to override processor affinity in general? Or is something we cannot be fixed, because as soon as the QLogic driver has set the affinity, we cannot override it anymore?
#5 Updated by Szilárd Páll about 7 years ago
As far as I know it is/should be always possible to override the affinities. However, I know of one caseI think there are two reasonable solutions:
-pin ona "soft" option and not let it override the external affinity & add a "hard" option e.g
-pin ona "hard" option that does override externally set affinity, but issues a big warning when we are actually overriding.
The reason why I think we should either add "force" option or a warning is that we sometimes default to pinning and other times to no pinning, but not pinning often causes considerable performance loss. Therefore, we advise people (note to self: this advice and explanation on it is missing from the wiki) to do pin manually*. This means that many people will (and should) probably always add
-pin on to their command line to avoid realizing the performance loss when it's too late. However, we should avoid encouraging people to always override external affinities - which will piss off sysadmins of machines where node-sharing is set up properly and the job scheduler does sets the process affinities.
*Slightly off-topic: perhaps we should even add an environment variable or a cmake option that turns the default pinning on regardless of whether the full machines is used or not (it has already happened to me numerous times that I realized only after running a bunch of benchmarks that I should have set -pin on).
#11 Updated by Mark Abraham over 6 years ago
I'm in favor of the current default behaviour wrt affinities. It seems a reasonable compromise in the era of multicore-oblivious kernels.
I am happy to add
mdrun -pin forcethat will override external affinity settings, with suitable warning that this is actually taking place (ie. Szilard's option 2 from post 7).
#13 Updated by Szilárd Páll over 6 years ago
Actually, as I suspected back in Janueary, things are not as simple as they seemed. Something does not work very well when trying to override affinities set through the OpenMP interface (GOMP_CPU_AFFINITY/KMP_AFFINITY). Initially I was testing on a small number of cores, and while there seemed to be a small (1-2%) performance difference, I thought it was just fluctuation. However, I've just done some more tests and and measured huge performance degradation.
Running on 8-core Intel Sandy Bridge E, gcc 4.7.
mdrun -ntmpi 1 -ntomp 8 -pin force: 42.5 ns/day and
hwloc-ps -t 23858 PU:0 PU:2 PU:4 PU:6 PU:8 PU:10 PU:12 PU:14 /.../mdrun 23858 PU:0 23863 PU:2 23864 PU:4 23865 PU:6 23866 PU:8 23867 PU:10 23868 PU:12 23869 PU:14
taskset 0x1 $mdrun -ntmpi 1 -ntomp 8 -pin force: 33.6 ns/day and
hwloc-ps -t 23793 PU:0 PU:2 PU:4 PU:6 PU:8 PU:10 PU:12 PU:14 /.../mdrun 23793 PU:0 23798 PU:2 23799 PU:4 23800 PU:6 23801 PU:8 23802 PU:10 23803 PU:12 23804 PU:14
GOMP_CPU_AFFINITY=0 $mdrun -ntmpi 1 -ntomp 8 -pin force33.5 ns/day and
hwloc-ps -t 23962 PU:0 PU:2 PU:4 PU:6 PU:8 PU:10 PU:12 PU:14 /.../mdrun 23962 PU:0 23966 PU:2 23967 PU:4 23968 PU:6 23969 PU:8 23970 PU:10 23971 PU:12 23972 PU:14
Obviously, the affinity patterns looks OK (and are identical), so I can't really explain the performance difference with anything other than some memory/cache affinity issue or something else related to threads and OpenMP because the same thing is not observable with tMPI-only (e.g.
mdrun -ntmpi 8 -ntomp 1).
I've reproduced the same behavior on AMD Piledriver amd Magny Cours as well.
#14 Updated by Szilárd Páll over 6 years ago
Additionally, what's weird is that with
-ntmpi 4 -ntomp 2 I get bizarre affinity patterns reported by hwloc (slightly different when overriding and when not), but the affinity masks queried inside mdrun look fine.
mdrun -ntmpi 4 -ntomp 2 -pin force
Current Process 24432 rank 0 -> thread 0 -> 0, 1 CPU(s) in the mask thread 1 -> 1, 1 CPU(s) in the mask Current Process 24432 rank 3 -> thread 0 -> 6, 1 CPU(s) in the mask thread 1 -> 7, 1 CPU(s) in the mask Current Process 24432 rank 1 -> thread 0 -> 2, 1 CPU(s) in the mask thread 1 -> 3, 1 CPU(s) in the mask Current Process 24432 rank 2 -> thread 0 -> 4, 1 CPU(s) in the mask thread 1 -> 5, 1 CPU(s) in the mask
hwloc-ps -t 24432 PU:0 PU:2 PU:4 PU:6 PU:8 PU:10 PU:12 PU:14 /.../mdrun 24432 PU:0 24436 PU:4 24437 PU:8 24438 PU:12 24439 PU:6 24440 PU:2 24441 PU:10 24442 PU:14
GOMP_CPU_AFFINITY=0 mdrun -ntmpi 4 -ntomp 2 -pin force
Current Process 24353 rank 2 -> thread 0 -> 4, 1 CPU(s) in the mask thread 1 -> 5, 1 CPU(s) in the mask Current Process 24353 rank 1 -> thread 0 -> 2, 1 CPU(s) in the mask thread 1 -> 3, 1 CPU(s) in the mask Current Process 24353 rank 3 -> thread 0 -> 6, 1 CPU(s) in the mask thread 1 -> 7, 1 CPU(s) in the mask Current Process 24353 rank 0 -> thread 0 -> 0, 1 CPU(s) in the mask thread 1 -> 1, 1 CPU(s) in the mask
$ hwloc-ps -t 24353 PU:0 PU:2 PU:4 PU:6 PU:8 PU:10 PU:12 PU:14 /.../mdrun 24353 PU:0 24358 PU:4 24359 PU:8 24360 PU:12 24361 PU:2 24362 PU:10 24363 PU:14 24364 PU:6
#15 Updated by Szilárd Páll over 6 years ago
Szilárd Páll wrote:
Additionally, what's weird is that with
-ntmpi 4 -ntomp 2I get bizarre affinity patterns reported by hwloc (slightly different when overriding and when not), but the affinity masks queried inside mdrun look fine.
It turns out that this is normal, OpenMP seems to create a thread pool but assign thread logical IDs in a different order.
#23 Updated by Roland Schulz over 5 years ago
Szilárd Páll wrote:
No, the above issues still persist.
Why issues (plural)? Note 15 says that the problem described in note 14 is normal. Thus I understand your notes, that the only remaining issue is the one described in note 13. Is this correct?
Did you try the issue described in 13 with more than one compiler/OpenMP-runtime? The way you describe it, it could be that some of the OpenMP internal storage might be allocated on the wrong section of the NUMA memory.
#27 Updated by Erik Lindahl about 2 years ago
I had a discussion at Intel's HPCDevCon with an engineer who in his talk used Gromacs as a bad example of an application that went ahead and tried to set affinities in a situation where it shouldn't have, instead of simply obeying the affinities that have been set. Both Mark & I agreed this might have been some earlier version, a bug, or misunderstanding.
Given that many people submit things in batch scripts or other scenarios when the warnings might not be visible right away (not to mention we issue a lot of performance warnings, so they might not even take it that serious), I think it's a bad idea to override existing affinities. They were set for a reason; that reason might have been really stupid, but it might also be something we have no idea about, such as nodes being shared.
We could note that there are affinities set, and that the user must disable those for the Gromacs auto-affinity-setting to do its job, but let's not assume existing affinities are stupid and ignore them.