Project

General

Profile

Bug #2388

inconsistent pinning behavior due to missing SMT info on AMD Zen

Added by Szilárd Páll about 1 month ago. Updated 15 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The mdrun native hardware detection does recognize hardware thread order on the AMD Zen uarch processors, but does not detect the SMT to correctly assign the hardware threads to cores. As a result (besides the reporting being incorrect), only at most half of the cores are used when a run is launched with #threads<#hwthreads/2. On Intel with HT in such ases the default stride is switched to 2 to spread threads across cores when the total thread count is <=#cores.

  • Native detection
      Hardware topology: Basic                                                                                          
        Sockets, cores, and logical processors:                                                                         
          Socket  0: [   0] [  16] [   1] [  17] [   2] [  18] [   3] [  19] [   4] [  20] [   5] [  21] [   6] [  22] [   7] [  23] [   8] [  24] [   9] [  25] [  10] [  26] [  11] [  27] [  12] [  28] [  13] [  29] [  14] [  30] [  15] [  31]                                                                                                            
    
  • Detection with hwloc:
    Hardware topology: Full, with devices                                                                              
      Sockets, cores, and logical processors:                                                                          
        Socket  0: [   0  16] [   1  17] [   2  18] [   3  19] [   4  20] [   5  21] [   6  22] [   7  23] [   8  24] [ 9  25] [  10  26] [  11  27] [  12  28] [  13  29] [  14  30] [  15  31]                                          
    
_test-native_1x16_thrp01.log (22.7 KB) _test-native_1x16_thrp01.log native detection Szilárd Páll, 01/19/2018 05:11 PM
_test-hwloc_1x16_thrp01.log (23.5 KB) _test-hwloc_1x16_thrp01.log hwloc detection Szilárd Páll, 01/19/2018 05:11 PM
_test-native_1x16_thrp01_gmx16.log (22.1 KB) _test-native_1x16_thrp01_gmx16.log native detection with r2016 Szilárd Páll, 01/20/2018 02:08 AM

History

#1 Updated by Szilárd Páll about 1 month ago

Having briefly looked at CpuInfo::detect(), I'm not sure whether this is a technical limitation of cpuid on AMD, but if it is not possible/hard to correct, I suggest we make the assumption that if the logical processor indexing suggests that SMT is on, we switch to stride 2 as we do on Intel. IIUC, this should be safe and as long as sibling + index indicates #cores stride, the kernel has to be configured in a very strange manner for the the assumption to not be correct.

#2 Updated by Mark Abraham about 1 month ago

Does 2016 have an issue?

#3 Updated by Szilárd Páll about 1 month ago

  • Description updated (diff)

Mark Abraham wrote:

Does 2016 have an issue?

I don't think the code has changed, so it should. IIUC there is an assumption made hat when there is no detailed topology info, stride should always be 1.

#5 Updated by Szilárd Páll about 1 month ago

PS: behavior confirmed with '16.

#6 Updated by Szilárd Páll about 1 month ago

  • Affected version - extra info set to 2016.4

#7 Updated by Erik Lindahl 20 days ago

  • Tracker changed from Bug to Feature
  • Affected version - extra info deleted (2016.4)
  • Affected version deleted (2018)

Well, it's not technically a bug since the hardware info module properly detects that we can't see it, and correctly specifies that only basic topology information is available. It would be nice to have it for this type of hardware too, but that's a new feature I'm not sure how easy it is to implement (depends on whether it can be extracted from cpuid).

Overall, isn't the best solution to simply recommend people to use hwloc? I'm skeptic to start assuming we have SMT in cases where it has not been properly detected, because such assumptions have historically come back and bitten us in hard ways.

#8 Updated by Szilárd Páll 15 days ago

  • Tracker changed from Feature to Bug
  • Subject changed from SMT info AMD Zen lacking with native hardware detection to inconsistent pinning behavior due to missing SMT info on AMD Zen
  • Affected version set to 2018

This is not a feature request. The described issue leads to inconsistent behavior -- both between the hwloc and no-hwloc build on the same machine and between different x86 platforms (in that in this single case out of four use-cases a different set of threads will be used by default). There is no practical difference between Intel HT and AMD SMT, so we should not implement entirely different default behavior unless something makes it impossible to do the same detection we do on Intel -- that's why I asked whether the Intel-specific cpuid can be extended.

Suggesting people to use hwloc won't solve the inconsistency either (though removing the Intel-only topology detection and related assumptions would perhaps improve things a bit).

Also available in: Atom PDF