Project

General

Profile

Feature #1122

Allow to force pinning

Added by Roland Schulz almost 8 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
-
Difficulty:
uncategorized
Close

Description

We try to be nice and don't pin if affinity is already set. This seems in general good, but sometime it is very non-obvious why the affinity is non-default. An example is the QLogic Infiniband OpenMPI back-end, whose developers thought it is a good idea to set the affinity if no-one else has set it. Thus even if no OpenMPI affinity options are set it still sets the affinity. And it can only be deactivated by a environment-variable (IPATH_NO_CPUAFFINITY). Needless to say it took me forever to find that. Adding an option "-pin force" (or renaming the current option to "auto" and adding a "yes"), which always sets the affinity even if it is already set, would help users in those cases.

I set it as bug, because even though I'm suggesting to add an option, it is really trying to fix a usability problem.

Off topic: The QLogic documentation points out that ideally the affinity is set as early as possible so that the MPI library affinity is correct too. Not sure how we could improve that.


Related issues

Related to GROMACS - Bug #1633: mdrun -nsteps -1 reports silly numbersClosed10/30/2014

Associated revisions

Revision 4bebcd8f (diff)
Added by Berk Hess almost 8 years ago

thread affinity now uses some topology information

The order of logical cores on x86 is hardware and software dependent.
The cpuid topology reports this and this information is now used.
The mdrun -pinht option is generalized for SMT to -pinstride.
The mdrun -pinoffset option is now in logical (iso phyiscal) cores.
Thread-MPI no longer sets affinity, it's now all done in one place.
The option -pin is now an enum, default auto: only on when using all
cores and when no external affinity has been set.
A big NOTE is printed with auto when no pinning is used.
Option -pin on can now override thread affinity set outside mdrun.
Fixes #1122
All thread affinity code has been moved from runner.c to
gmx_thread_affinity.c.
Updated the mdrun manual for pinning also active without OpenMP.

Change-Id: Ibf0fe5882688de80c223640502c68e6170d4d044

History

#1 Updated by Berk Hess almost 8 years ago

  • Status changed from New to In Progress

I actually wrote the same thing in a comment in my patch.
I suggested to make -pin on (try to) override any already present pinning. This might be as simple as skipping all affinity detection checks when -pin is set to on. But I didn't want to make even more changes just before the release.

#2 Updated by Berk Hess over 7 years ago

Roland, did my commit allow you to force the affinity in your case?
We still need to check in which case we can override OpenMP affinity.

#3 Updated by Roland Schulz over 7 years ago

No it isn't fixed. In fact it is so bad that I though it was hanging. One single call to construct_vsites takes 0.1s. So even for DHFR it doesn't get to step 0 in less than 10min. It is possible to override processor affinity in general? Or is something we cannot be fixed, because as soon as the QLogic driver has set the affinity, we cannot override it anymore?

#4 Updated by Berk Hess over 7 years ago

  • Category set to mdrun
  • Assignee set to Szilárd Páll

This should be simple to fix, but I don't know which OpenMP affinity settings we can override.

#5 Updated by Szilárd Páll over 7 years ago

As far as I know it is/should be always possible to override the affinities. However, I know of one case

I think there are two reasonable solutions:
  1. keep -pin on a "soft" option and not let it override the external affinity & add a "hard" option e.g -pin force;
  2. make -pin on a "hard" option that does override externally set affinity, but issues a big warning when we are actually overriding.

The reason why I think we should either add "force" option or a warning is that we sometimes default to pinning and other times to no pinning, but not pinning often causes considerable performance loss. Therefore, we advise people (note to self: this advice and explanation on it is missing from the wiki) to do pin manually*. This means that many people will (and should) probably always add -pin on to their command line to avoid realizing the performance loss when it's too late. However, we should avoid encouraging people to always override external affinities - which will piss off sysadmins of machines where node-sharing is set up properly and the job scheduler does sets the process affinities.

*Slightly off-topic: perhaps we should even add an environment variable or a cmake option that turns the default pinning on regardless of whether the full machines is used or not (it has already happened to me numerous times that I realized only after running a bunch of benchmarks that I should have set -pin on).

#6 Updated by Mark Abraham over 7 years ago

  • Target version deleted (4.6.1)
  • Affected version set to 4.6

Is this resolved?

#7 Updated by Szilárd Páll over 7 years ago

No, it is not. None of my above suggestions has been implemented yet - mainly because I got no feedback.

#8 Updated by Mark Abraham over 7 years ago

  • Target version set to 4.6.x

#9 Updated by Szilárd Páll about 7 years ago

  • Tracker changed from Bug to Feature

#10 Updated by Szilárd Páll about 7 years ago

The more I think of this, the more it seems to be a feature that we have to proceed with caution about. I'm going to ask for feedback on the developers' list.

#11 Updated by Mark Abraham about 7 years ago

I'm in favor of the current default behaviour wrt affinities. It seems a reasonable compromise in the era of multicore-oblivious kernels.

I am happy to add

mdrun -pin force
that will override external affinity settings, with suitable warning that this is actually taking place (ie. Szilard's option 2 from post 7).

#12 Updated by Szilárd Páll about 7 years ago

  • Status changed from In Progress to Feedback wanted

#13 Updated by Szilárd Páll about 7 years ago

Actually, as I suspected back in Janueary, things are not as simple as they seemed. Something does not work very well when trying to override affinities set through the OpenMP interface (GOMP_CPU_AFFINITY/KMP_AFFINITY). Initially I was testing on a small number of cores, and while there seemed to be a small (1-2%) performance difference, I thought it was just fluctuation. However, I've just done some more tests and and measured huge performance degradation.

Running on 8-core Intel Sandy Bridge E, gcc 4.7.

  • mdrun -ntmpi 1 -ntomp 8 -pin force: 42.5 ns/day and
    hwloc-ps -t
    23858    PU:0 PU:2 PU:4 PU:6 PU:8 PU:10 PU:12 PU:14 /.../mdrun
     23858    PU:0        
     23863    PU:2        
     23864    PU:4        
     23865    PU:6        
     23866    PU:8        
     23867    PU:10        
     23868    PU:12        
     23869    PU:14        
    
  • taskset 0x1 $mdrun -ntmpi 1 -ntomp 8 -pin force: 33.6 ns/day and
    hwloc-ps -t
    23793    PU:0 PU:2 PU:4 PU:6 PU:8 PU:10 PU:12 PU:14 /.../mdrun
     23793    PU:0        
     23798    PU:2        
     23799    PU:4        
     23800    PU:6        
     23801    PU:8        
     23802    PU:10        
     23803    PU:12        
     23804    PU:14
    
  • GOMP_CPU_AFFINITY=0 $mdrun -ntmpi 1 -ntomp 8 -pin force 33.5 ns/day and
    hwloc-ps -t
    23962    PU:0 PU:2 PU:4 PU:6 PU:8 PU:10 PU:12 PU:14 /.../mdrun
     23962    PU:0        
     23966    PU:2        
     23967    PU:4        
     23968    PU:6        
     23969    PU:8        
     23970    PU:10        
     23971    PU:12        
     23972    PU:14
    

Obviously, the affinity patterns looks OK (and are identical), so I can't really explain the performance difference with anything other than some memory/cache affinity issue or something else related to threads and OpenMP because the same thing is not observable with tMPI-only (e.g. mdrun -ntmpi 8 -ntomp 1).

I've reproduced the same behavior on AMD Piledriver amd Magny Cours as well.

#14 Updated by Szilárd Páll about 7 years ago

Additionally, what's weird is that with -ntmpi 4 -ntomp 2 I get bizarre affinity patterns reported by hwloc (slightly different when overriding and when not), but the affinity masks queried inside mdrun look fine.

mdrun -ntmpi 4 -ntomp 2 -pin force

Current Process 24432 rank  0 ->
    thread  0 -> 0,        1 CPU(s) in the mask
    thread  1 -> 1,        1 CPU(s) in the mask

Current Process 24432 rank  3 -> 
    thread  0 -> 6,        1 CPU(s) in the mask
    thread  1 -> 7,        1 CPU(s) in the mask

Current Process 24432 rank  1 -> 
    thread  0 -> 2,        1 CPU(s) in the mask
    thread  1 -> 3,        1 CPU(s) in the mask

Current Process 24432 rank  2 -> 
    thread  0 -> 4,        1 CPU(s) in the mask
    thread  1 -> 5,        1 CPU(s) in the mask
hwloc-ps -t
24432    PU:0 PU:2 PU:4 PU:6 PU:8 PU:10 PU:12 PU:14        /.../mdrun
 24432    PU:0        
 24436    PU:4        
 24437    PU:8        
 24438    PU:12        
 24439    PU:6        
 24440    PU:2        
 24441    PU:10        
 24442    PU:14

GOMP_CPU_AFFINITY=0 mdrun -ntmpi 4 -ntomp 2 -pin force

Current Process 24353 rank  2 ->
    thread  0 -> 4,        1 CPU(s) in the mask
    thread  1 -> 5,        1 CPU(s) in the mask

Current Process 24353 rank  1 -> 
    thread  0 -> 2,        1 CPU(s) in the mask
    thread  1 -> 3,        1 CPU(s) in the mask

Current Process 24353 rank  3 -> 
    thread  0 -> 6,        1 CPU(s) in the mask
    thread  1 -> 7,        1 CPU(s) in the mask

Current Process 24353 rank  0 -> 
    thread  0 -> 0,        1 CPU(s) in the mask
    thread  1 -> 1,        1 CPU(s) in the mask
$ hwloc-ps -t
24353    PU:0 PU:2 PU:4 PU:6 PU:8 PU:10 PU:12 PU:14        /.../mdrun
 24353    PU:0        
 24358    PU:4        
 24359    PU:8        
 24360    PU:12        
 24361    PU:2        
 24362    PU:10        
 24363    PU:14        
 24364    PU:6

#15 Updated by Szilárd Páll about 7 years ago

Szilárd Páll wrote:

Additionally, what's weird is that with -ntmpi 4 -ntomp 2 I get bizarre affinity patterns reported by hwloc (slightly different when overriding and when not), but the affinity masks queried inside mdrun look fine.

It turns out that this is normal, OpenMP seems to create a thread pool but assign thread logical IDs in a different order.

#16 Updated by Berk Hess almost 7 years ago

Could the last observation be caused by the bug reported in #1360 ?

#17 Updated by Szilárd Páll almost 7 years ago

It could. I'll rebase change 2633 and see if it makes a difference.

#18 Updated by Szilárd Páll almost 7 years ago

No, #1360 is not related which is not surprising as nothing indicated that multiple threads were pinned to the same hardware thread.

#19 Updated by Rossen Apostolov almost 7 years ago

Is this solved already?

#20 Updated by Szilárd Páll almost 7 years ago

No, the above issues still persist.

#21 Updated by Szilárd Páll over 6 years ago

  • Status changed from Feedback wanted to Blocked, need info

#22 Updated by Szilárd Páll over 6 years ago

  • Description updated (diff)

#23 Updated by Roland Schulz over 6 years ago

Szilárd Páll wrote:

No, the above issues still persist.

Why issues (plural)? Note 15 says that the problem described in note 14 is normal. Thus I understand your notes, that the only remaining issue is the one described in note 13. Is this correct?

Did you try the issue described in 13 with more than one compiler/OpenMP-runtime? The way you describe it, it could be that some of the OpenMP internal storage might be allocated on the wrong section of the NUMA memory.

#24 Updated by Mark Abraham about 6 years ago

  • Target version changed from 4.6.x to 5.x

#25 Updated by Szilárd Páll almost 6 years ago

  • Related to Bug #1633: mdrun -nsteps -1 reports silly numbers added

#26 Updated by Mark Abraham over 4 years ago

  • Target version deleted (5.x)

We now have -pin on, is that solving this issue?

#27 Updated by Erik Lindahl almost 3 years ago

I had a discussion at Intel's HPCDevCon with an engineer who in his talk used Gromacs as a bad example of an application that went ahead and tried to set affinities in a situation where it shouldn't have, instead of simply obeying the affinities that have been set. Both Mark & I agreed this might have been some earlier version, a bug, or misunderstanding.

Given that many people submit things in batch scripts or other scenarios when the warnings might not be visible right away (not to mention we issue a lot of performance warnings, so they might not even take it that serious), I think it's a bad idea to override existing affinities. They were set for a reason; that reason might have been really stupid, but it might also be something we have no idea about, such as nodes being shared.

We could note that there are affinities set, and that the user must disable those for the Gromacs auto-affinity-setting to do its job, but let's not assume existing affinities are stupid and ignore them.

#28 Updated by Mark Abraham almost 2 years ago

  • Status changed from Blocked, need info to Resolved

Assuming resolved

#29 Updated by Mark Abraham almost 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF