Project

General

Profile

Bug #1991

Automatic launch configuration does not work as expected if Hyper Threading is disabled in the Kernel

Added by Jiri Kraus over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

When Hyper Threading (HT) is disabled in the Kernel (not BIOS) GROMACS detects that only the physical cores are online, but decides to launch as many threads as there a logical cores. This results in a over subscription of the available hardware and sub optimal performance:

64 CPUs configured, but only 32 of them are online.
This can happen on embedded platforms (e.g. ARM) where the OS shuts some cores
off to save power, and will turn them back on later when the load increases.
However, this will likely mean GROMACS cannot pin threads to those cores. You
will likely see much better performance by forcing all cores to be online, and
making sure they run at their full clock frequency.

Number of logical cores detected (64) does not match the number reported by OpenMP (32).
Consider setting the launch configuration manually!

Running on 1 node with total 32 cores, 64 logical cores, 2 compatible GPUs
Hardware detected:
  CPU info:
    Vendor: GenuineIntel
    Brand:  Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX2_256
  GPU info:
    Number of GPUs detected: 2
    #0: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible
    #1: NVIDIA Tesla K80, compute cap.: 3.7, ECC:  no, stat: compatible

Reading file /home-2/award/GROMACS/water-cut1.0_GMX50_bare_5.1.2/0024/topol.tpr, VERSION 5.1.2 (single precision)
Changing nstlist from 10 to 20, rlist from 1.067 to 1.172

Non-default thread affinity set, disabling internal thread affinity

Overriding nsteps with value passed on the command line: 2000 steps, 4 ps

Using 16 MPI threads
Using 4 OpenMP threads per tMPI thread
> 

Furthermore manually configuring the launch does not get rid of the warning.

run_gromacs_512_hsw214_k80_875MHz_ECC_off_output_cpu_gpu_variations_pin_off.txt (5.64 KB) run_gromacs_512_hsw214_k80_875MHz_ECC_off_output_cpu_gpu_variations_pin_off.txt Console output of a full run showing the above error message. Jiri Kraus, 06/15/2016 03:26 PM

Associated revisions

Revision f16daabd (diff)
Added by Berk Hess over 3 years ago

Use sysconf(_SC_NPROCESSORS_ONLN)

If we're not on ARM and sysconf(_SC_NPROCESSORS_ONLN) doesn't match
sysconf(_SC_NPROCESSORS_CONF), we should use the former, as that is
what the correct count on x86 with hyperthreading disabled in the kernel.

Added some comments on assumptions and future possible problems.

Fixes #1991.

Change-Id: Id851b8acfbd6b9a2837e8c0e4340b2267a35a20a

History

#1 Updated by Szilárd Páll over 3 years ago

So I see at least two issues here:
  • ignoring what the OS tells us and incorrectly interpreting the hardware support reporting the number of hardware threads;
  • interpreting the mismatch as cores being offline (even without checking for the arch)

#2 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1991.
Uploader: Berk Hess ()
Change-Id: Id536e517419bb33294693d91b6f010d0d5342352
Gerrit URL: https://gerrit.gromacs.org/5960

#3 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1991.
Uploader: Berk Hess ()
Change-Id: Id851b8acfbd6b9a2837e8c0e4340b2267a35a20a
Gerrit URL: https://gerrit.gromacs.org/5961

#4 Updated by Berk Hess over 3 years ago

  • Status changed from New to Fix uploaded
  • Target version set to 5.1.3

I uploaded a fix, but left most of the warning message in place. It is still a useful hint that you can get better performance on x86 with HT turned on. I can change or remove the warning message of x86, if we think that is useful.

#5 Updated by Berk Hess over 3 years ago

Some googling shows that hwloc handles this case correctly.
So for release-2016 we still might want to check how our cpuinfo code behaves.

#6 Updated by Mark Abraham over 3 years ago

  • Status changed from Fix uploaded to Resolved

#7 Updated by Mark Abraham over 3 years ago

  • Status changed from Resolved to Closed

#8 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1991.
Uploader: Szilárd Páll ()
Change-Id: I98c4a1b4e6ff7adfd195561d161175bb1dbd7b51
Gerrit URL: https://gerrit.gromacs.org/6003

Also available in: Atom PDF