implement the two frequently used thread-pinning strategies
Linux default HT cores at the end, Win interleaved) and enable swithching between them (through a hidden command line option?) @runner.c:set_cpu_affinity()
Added basic CPU topology information to cpuid code
We can now detect the locality of hardware threads, cores,
and packages for Intel and AMD CPUs under Linux and Windows.
In particular, this provides an array with locality order
for logical processors that can be used to optimize placement.
Refs #1086, #1101.
#1 Updated by Szilárd Páll over 6 years ago
- Category set to mdrun
A bit more details are needed for this, i think, in case if someone could pitch in with the code.With Intel HT, which we can partially detect now for the (probably) most frequent case, when nranks x nthreads = ncpus, there are two CPU layouts that we know of:
- physical cores first, HT siblings after on Linux (on a 4-core CPU 0123 0123);
- interleaved on WIndows (on a 4-core CPU 00112233).
With NUMA hardware, especially when using OpenMP (which is the default with up to 16 threads/process on Intel) correct pinning is crucial. The current implementation uses interleaved HT pinning by default which will only work on Linux with kernels configured the standard way, but will be incorrect on Windows and on Linux with custom kernel configuration.
The suggested solution is to implement both HT pinning layouts and use the appropriate ones on different platforms and facilitate selecting the layout by e.g an advanced environment variable.
#4 Updated by Erik Lindahl over 6 years ago
Have a look at https://gerrit.gromacs.org/#/c/2000/ .
There is now a routine to get basic CPU topology information, including a locality-sorted list of processors, from the cpuid code. Currently this works for Intel and AMD processors under Linux and Windows (not Mac OS X, since we cannot to thread pinning there). The topology is only complicated when we have to detect it on the fly, so if there are other systems (BlueGene?) where we have a known static order of things it should only be 5 minutes of work to add an #ifdef for that case.
#5 Updated by Mark Abraham over 6 years ago
BlueGeneQ is very probably pre-pinned, but at least for group kernels (where I understand OpenMP is not useful for PP processes) I expect BlueGeneQ will require an MPI process per thread.
With Verlet kernels we will have more scope to use OpenMP on BlueGeneQ, because both PME and PP nodes can use it.