Project

General

Profile

Bug #2880

2019.1 Multiple errors with AVX512 on tests

Added by Alexandre Strube 9 months ago. Updated about 2 months ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
testing
Target version:
-
Affected version - extra info:
2019.1
Affected version:
Difficulty:
uncategorized
Close

Description

Intel Compiler 2019.0.117
ParaStationMPI 5.2.1-1
hwloc 1.11.11
mkl 2019.0.117
numactl 2.0.12
CMake 3.13.0

Cpu: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz

When one forces compilation with AVX256, everything works. But letting CMAKE autodiscover it points to avx512 (which is present on this cpu). This leads to the following error on tests:

Program received signal SIGSEGV, Segmentation fault.
0x000000000060fe74 in __intel_skx_avx512_memcpy ()

The following tests FAILED:
      1 - TestUtilsUnitTests (SEGFAULT)
      3 - MdlibUnitTest (SEGFAULT)
      4 - AppliedForcesUnitTest (SEGFAULT)
      5 - ListedForcesTest (SEGFAULT)
      6 - CommandLineUnitTests (SEGFAULT)
      7 - DomDecTests (SEGFAULT)
      8 - EwaldUnitTests (SEGFAULT)
     10 - GpuUtilsUnitTests (SEGFAULT)
     11 - HardwareUnitTests (SEGFAULT)
     12 - MathUnitTests (SEGFAULT)
     13 - MdrunUtilityUnitTests (SEGFAULT)
     15 - OnlineHelpUnitTests (SEGFAULT)
     16 - OptionsUnitTests (SEGFAULT)
     17 - RandomUnitTests (SEGFAULT)
     19 - TableUnitTests (SEGFAULT)
     20 - TaskAssignmentUnitTests (SEGFAULT)
     21 - UtilityUnitTests (SEGFAULT)
     26 - SimdUnitTests (SEGFAULT)
     27 - CompatibilityHelpersTests (SEGFAULT)
     28 - GmxAnaTest (SEGFAULT)
     29 - GmxPreprocessTests (SEGFAULT)
     30 - Pdb2gmxTest (SEGFAULT)
     31 - CorrelationsTest (SEGFAULT)
     32 - AnalysisDataUnitTests (SEGFAULT)
     33 - SelectionUnitTests (SEGFAULT)
     34 - TrajectoryAnalysisUnitTests (SEGFAULT)
     35 - EnergyAnalysisUnitTests (SEGFAULT)
     36 - ToolUnitTests (SEGFAULT)
     37 - MdrunTests (SEGFAULT)
     40 - MdrunMpiTests (SEGFAULT)
Errors while running CTest
make[3]: *** [CMakeFiles/run-ctest-nophys] Error 8
make[2]: *** [CMakeFiles/run-ctest-nophys.dir/all] Error 2
make[1]: *** [CMakeFiles/check.dir/rule] Error 2
make: *** [check] Error 2


Related issues

Related to GROMACS - Bug #2921: hwloc test makes invalid assumptionsClosed

History

#1 Updated by Alexandre Strube 9 months ago

The same errors happen in 2019

#2 Updated by Mark Abraham 9 months ago

Thanks for the report. We test this version of icc 19 in Jenkins, but not on this SIMD level. We have noticed icc codegen issues, particularly the 201x.0.abc versions.

Does a recent gcc pass tests in AVX512 mode on this node? If so, that would suggest a problem in icc. (This is likely to run comparably fast as the icc compilation, too.)

#3 Updated by Roland Schulz 9 months ago

Could you test the 2019 Update 2 version of ICC?

#4 Updated by Roland Schulz 8 months ago

  • Status changed from New to Feedback wanted

#5 Updated by Mark Abraham 8 months ago

Have you had a chance to try anything else Alexandre?

#6 Updated by Alexandre Strube 8 months ago

Mark Abraham wrote:

Have you had a chance to try anything else Alexandre?

Yes. I have other errors with GCC, which I need to investigate further.

I haven't had the chance to test with the newer Intel compiler yet.

#7 Updated by Alexandre Strube 8 months ago

Mark Abraham wrote:

Have you had a chance to try anything else Alexandre?

Yes and no.

I managed to compile with 2019.3.199, but because of library dependencies on our supercomputers, I was only able to run with 2019.0.117. This makes the runtime load point to old libraries:

For example, ldd utility-mpi-test after being compiled with 2019.3 but with the env switched back to 2019.0:

linux-vdso.so.1 =>  (0x00007ffd0efc8000)
libgromacs_mpi.so.4 => /p/project/ccstao/cstao05/gromacs/gromacs_build/000001/000004_build/work/gromacs-2019.1/build/lib/libgromacs_mpi.so.4 (0x00002b2517f4b000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b2518cc3000)
libmpi.so.12 => /gpfs/software/juwels/stages/2018b/software/psmpi/5.2.1-1-iccifort-2019.0.117-GCC-7.3.0/lib/libmpi.so.12 (0x00002b2518edf000)
libm.so.6 => /lib64/libm.so.6 (0x00002b251918a000)
libstdc++.so.6 => /gpfs/software/juwels/stages/2018b/software/GCCcore/7.3.0/lib64/libstdc++.so.6 (0x00002b2517d58000)
libiomp5.so => /gpfs/software/juwels/stages/2018b/software/imkl/2019.0.117-ipsmpi-2018b/lib/intel64/libiomp5.so (0x00002b251948c000)
libgcc_s.so.1 => /gpfs/software/juwels/stages/2018b/software/GCCcore/7.3.0/lib64/libgcc_s.so.1 (0x00002b2517ef2000)
libc.so.6 => /lib64/libc.so.6 (0x00002b251986f000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b2519c3c000)
libhwloc.so.5 => /gpfs/software/juwels/stages/2018b/software/hwloc/1.11.11-GCCcore-7.3.0/lib/libhwloc.so.5 (0x00002b2519e40000)
librt.so.1 => /lib64/librt.so.1 (0x00002b2519e80000)
libfftw3f.so.3 => /gpfs/software/juwels/stages/2018b/software/FFTW/3.3.8-ipsmpi-2018b/lib/libfftw3f.so.3 (0x00002b251a088000)
libimf.so => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libimf.so (0x00002b251a3af000)
libsvml.so => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libsvml.so (0x00002b251a94f000)
libirng.so => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libirng.so (0x00002b251c2f2000)
libintlc.so.5 => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libintlc.so.5 (0x00002b251c664000)
/lib64/ld-linux-x86-64.so.2 (0x00002b2517d27000)
libpscom.so.2 => /gpfs/software/juwels/stages/2018b/software/pscom/Default/lib/libpscom.so.2 (0x00002b251c8d6000)
libifport.so.5 => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libifport.so.5 (0x00002b251d2ff000)
libifcoremt.so.5 => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libifcoremt.so.5 (0x00002b251d52d000)
libnuma.so.1 => /gpfs/software/juwels/stages/2018b/software/numactl/2.0.12-GCCcore-7.3.0/lib/libnuma.so.1 (0x00002b2517f0f000)

#8 Updated by Mark Abraham 8 months ago

  • Description updated (diff)

#9 Updated by Mark Abraham 8 months ago

On JUWELS, I get all tests passing with 19.0.0.117 Build 20180804, apart from hardware-tests. (I think that may be a bug in hwloc-1.11.11. That hwloc-info only finds 78 logical processors. With support for that hwloc compiled in, our hardware detection misidentifies everything about the last two logical processors in the second 20-core socket. TCBLAB dev-purley-02 is a very similar node and with hwloc-1.11.9 in GROMACS and hwloc-info it finds all 80 logical processors, so perhaps Juelich have done something creative.)

On dev-purley02 I get all test passing with 19.0.3.199 Build 20190206.

I notice that your problems occur with a ParaStation MPI build. We don't test with that. I note that JUWELS has open issues about that e.g. https://apps.fz-juelich.de/jsc/hps/juwels/known-issues.html#collectives-in-intel-mpi-2019-can-lead-to-hanging-processes-or-segmentation-faults. Something broken about MPI dynamic linking could produce your symptoms. Does the IntelMPI/2018.4.274 work? Does a non-MPI build work?

#10 Updated by Alexandre Strube 8 months ago

Hi Mark,

I saw your ticket and Sebastian's response (he's the best one at the support team here).

With which MPI do you usually test?

The thing is, with the ICC 2019 Update 2, I only have hwloc 2.0.3, not hwloc 1.

I get to this:

ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_distances_get'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_topology_set_io_types_filter'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_obj_type_is_dcache'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_distances_release'

#11 Updated by Mark Abraham 7 months ago

  • Related to Bug #2921: hwloc test makes invalid assumptions added

#12 Updated by Mark Abraham 7 months ago

Alexandre Strube wrote:

Hi Mark,

I saw your ticket and Sebastian's response (he's the best one at the support team here).

With which MPI do you usually test?

OpenMPI. Anything that is correctly set up to link should work. My guess is that something about the infrastructure on JUWELS is not set up well enough, or your CMake cache was generated partly at different times and so contains mutually incompatible libraries that were detected at different points (or similar; see below).

The thing is, with the ICC 2019 Update 2, I only have hwloc 2.0.3, not hwloc 1.

I get to this:

ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_distances_get'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_topology_set_io_types_filter'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_obj_type_is_dcache'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_distances_release'

hwloc_distances_get is found in hwloc 2.x but not in 1.x. The cmake cache is inconsistent.

The simplest way to get this right is to make a fresh login, load the right modules, then run cmake in a fresh build directory :-) Get a non-MPI build working and passing tests first.

#13 Updated by Mark Abraham 7 months ago

  • Assignee set to Mark Abraham
  • Priority changed from Normal to Low
  • Target version changed from 2019.2 to 2019.3

On the available information, this is not a GROMACS problem, so if we find one, we will consider fixing it in 2019.3.

#14 Updated by Paul Bauer 5 months ago

Is this still considered as something that needs fixing or can this be closed now?

#15 Updated by Alexandre Strube 5 months ago

Hi Paul,

keep it open for now, I need to check this again this week when I'm back at work.

Thanks!

#16 Updated by Paul Bauer 5 months ago

  • Target version changed from 2019.3 to 2019.4

bumped to next patch release then

#17 Updated by Paul Bauer about 2 months ago

  • Target version changed from 2019.4 to 2019.5

and another bump

#18 Updated by Mark Abraham about 2 months ago

  • Status changed from Feedback wanted to Closed
  • Target version deleted (2019.5)

Let us know if there's further issues identified!

#19 Updated by Szilárd Páll 16 days ago

  • Related to Task #3195: assess nightly master failures added

#20 Updated by Szilárd Páll 16 days ago

  • Related to deleted (Task #3195: assess nightly master failures)

Also available in: Atom PDF