Project

General

Profile

Bug #2880

2019.1 Multiple errors with AVX512 on tests

Added by Alexandre Strube about 2 months ago. Updated 14 days ago.

Status:
Feedback wanted
Priority:
Low
Assignee:
Category:
testing
Target version:
Affected version - extra info:
2019.1
Affected version:
Difficulty:
uncategorized
Close

Description

Intel Compiler 2019.0.117
ParaStationMPI 5.2.1-1
hwloc 1.11.11
mkl 2019.0.117
numactl 2.0.12
CMake 3.13.0

Cpu: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz

When one forces compilation with AVX256, everything works. But letting CMAKE autodiscover it points to avx512 (which is present on this cpu). This leads to the following error on tests:

Program received signal SIGSEGV, Segmentation fault.
0x000000000060fe74 in __intel_skx_avx512_memcpy ()

The following tests FAILED:
      1 - TestUtilsUnitTests (SEGFAULT)
      3 - MdlibUnitTest (SEGFAULT)
      4 - AppliedForcesUnitTest (SEGFAULT)
      5 - ListedForcesTest (SEGFAULT)
      6 - CommandLineUnitTests (SEGFAULT)
      7 - DomDecTests (SEGFAULT)
      8 - EwaldUnitTests (SEGFAULT)
     10 - GpuUtilsUnitTests (SEGFAULT)
     11 - HardwareUnitTests (SEGFAULT)
     12 - MathUnitTests (SEGFAULT)
     13 - MdrunUtilityUnitTests (SEGFAULT)
     15 - OnlineHelpUnitTests (SEGFAULT)
     16 - OptionsUnitTests (SEGFAULT)
     17 - RandomUnitTests (SEGFAULT)
     19 - TableUnitTests (SEGFAULT)
     20 - TaskAssignmentUnitTests (SEGFAULT)
     21 - UtilityUnitTests (SEGFAULT)
     26 - SimdUnitTests (SEGFAULT)
     27 - CompatibilityHelpersTests (SEGFAULT)
     28 - GmxAnaTest (SEGFAULT)
     29 - GmxPreprocessTests (SEGFAULT)
     30 - Pdb2gmxTest (SEGFAULT)
     31 - CorrelationsTest (SEGFAULT)
     32 - AnalysisDataUnitTests (SEGFAULT)
     33 - SelectionUnitTests (SEGFAULT)
     34 - TrajectoryAnalysisUnitTests (SEGFAULT)
     35 - EnergyAnalysisUnitTests (SEGFAULT)
     36 - ToolUnitTests (SEGFAULT)
     37 - MdrunTests (SEGFAULT)
     40 - MdrunMpiTests (SEGFAULT)
Errors while running CTest
make[3]: *** [CMakeFiles/run-ctest-nophys] Error 8
make[2]: *** [CMakeFiles/run-ctest-nophys.dir/all] Error 2
make[1]: *** [CMakeFiles/check.dir/rule] Error 2
make: *** [check] Error 2


Related issues

Related to GROMACS - Bug #2921: hwloc test makes invalid assumptionsClosed

History

#1 Updated by Alexandre Strube about 2 months ago

The same errors happen in 2019

#2 Updated by Mark Abraham about 2 months ago

Thanks for the report. We test this version of icc 19 in Jenkins, but not on this SIMD level. We have noticed icc codegen issues, particularly the 201x.0.abc versions.

Does a recent gcc pass tests in AVX512 mode on this node? If so, that would suggest a problem in icc. (This is likely to run comparably fast as the icc compilation, too.)

#3 Updated by Roland Schulz about 2 months ago

Could you test the 2019 Update 2 version of ICC?

#4 Updated by Roland Schulz about 2 months ago

  • Status changed from New to Feedback wanted

#5 Updated by Mark Abraham 22 days ago

Have you had a chance to try anything else Alexandre?

#6 Updated by Alexandre Strube 21 days ago

Mark Abraham wrote:

Have you had a chance to try anything else Alexandre?

Yes. I have other errors with GCC, which I need to investigate further.

I haven't had the chance to test with the newer Intel compiler yet.

#7 Updated by Alexandre Strube 21 days ago

Mark Abraham wrote:

Have you had a chance to try anything else Alexandre?

Yes and no.

I managed to compile with 2019.3.199, but because of library dependencies on our supercomputers, I was only able to run with 2019.0.117. This makes the runtime load point to old libraries:

For example, ldd utility-mpi-test after being compiled with 2019.3 but with the env switched back to 2019.0:

linux-vdso.so.1 =>  (0x00007ffd0efc8000)
libgromacs_mpi.so.4 => /p/project/ccstao/cstao05/gromacs/gromacs_build/000001/000004_build/work/gromacs-2019.1/build/lib/libgromacs_mpi.so.4 (0x00002b2517f4b000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b2518cc3000)
libmpi.so.12 => /gpfs/software/juwels/stages/2018b/software/psmpi/5.2.1-1-iccifort-2019.0.117-GCC-7.3.0/lib/libmpi.so.12 (0x00002b2518edf000)
libm.so.6 => /lib64/libm.so.6 (0x00002b251918a000)
libstdc++.so.6 => /gpfs/software/juwels/stages/2018b/software/GCCcore/7.3.0/lib64/libstdc++.so.6 (0x00002b2517d58000)
libiomp5.so => /gpfs/software/juwels/stages/2018b/software/imkl/2019.0.117-ipsmpi-2018b/lib/intel64/libiomp5.so (0x00002b251948c000)
libgcc_s.so.1 => /gpfs/software/juwels/stages/2018b/software/GCCcore/7.3.0/lib64/libgcc_s.so.1 (0x00002b2517ef2000)
libc.so.6 => /lib64/libc.so.6 (0x00002b251986f000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b2519c3c000)
libhwloc.so.5 => /gpfs/software/juwels/stages/2018b/software/hwloc/1.11.11-GCCcore-7.3.0/lib/libhwloc.so.5 (0x00002b2519e40000)
librt.so.1 => /lib64/librt.so.1 (0x00002b2519e80000)
libfftw3f.so.3 => /gpfs/software/juwels/stages/2018b/software/FFTW/3.3.8-ipsmpi-2018b/lib/libfftw3f.so.3 (0x00002b251a088000)
libimf.so => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libimf.so (0x00002b251a3af000)
libsvml.so => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libsvml.so (0x00002b251a94f000)
libirng.so => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libirng.so (0x00002b251c2f2000)
libintlc.so.5 => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libintlc.so.5 (0x00002b251c664000)
/lib64/ld-linux-x86-64.so.2 (0x00002b2517d27000)
libpscom.so.2 => /gpfs/software/juwels/stages/2018b/software/pscom/Default/lib/libpscom.so.2 (0x00002b251c8d6000)
libifport.so.5 => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libifport.so.5 (0x00002b251d2ff000)
libifcoremt.so.5 => /gpfs/software/juwels/stages/2018b/software/ifort/2019.0.117-GCC-7.3.0/compilers_and_libraries_2019.0.117/linux/compiler/lib/intel64/libifcoremt.so.5 (0x00002b251d52d000)
libnuma.so.1 => /gpfs/software/juwels/stages/2018b/software/numactl/2.0.12-GCCcore-7.3.0/lib/libnuma.so.1 (0x00002b2517f0f000)

#8 Updated by Mark Abraham 17 days ago

  • Description updated (diff)

#9 Updated by Mark Abraham 17 days ago

On JUWELS, I get all tests passing with 19.0.0.117 Build 20180804, apart from hardware-tests. (I think that may be a bug in hwloc-1.11.11. That hwloc-info only finds 78 logical processors. With support for that hwloc compiled in, our hardware detection misidentifies everything about the last two logical processors in the second 20-core socket. TCBLAB dev-purley-02 is a very similar node and with hwloc-1.11.9 in GROMACS and hwloc-info it finds all 80 logical processors, so perhaps Juelich have done something creative.)

On dev-purley02 I get all test passing with 19.0.3.199 Build 20190206.

I notice that your problems occur with a ParaStation MPI build. We don't test with that. I note that JUWELS has open issues about that e.g. https://apps.fz-juelich.de/jsc/hps/juwels/known-issues.html#collectives-in-intel-mpi-2019-can-lead-to-hanging-processes-or-segmentation-faults. Something broken about MPI dynamic linking could produce your symptoms. Does the IntelMPI/2018.4.274 work? Does a non-MPI build work?

#10 Updated by Alexandre Strube 17 days ago

Hi Mark,

I saw your ticket and Sebastian's response (he's the best one at the support team here).

With which MPI do you usually test?

The thing is, with the ICC 2019 Update 2, I only have hwloc 2.0.3, not hwloc 1.

I get to this:

ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_distances_get'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_topology_set_io_types_filter'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_obj_type_is_dcache'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_distances_release'

#11 Updated by Mark Abraham 14 days ago

  • Related to Bug #2921: hwloc test makes invalid assumptions added

#12 Updated by Mark Abraham 14 days ago

Alexandre Strube wrote:

Hi Mark,

I saw your ticket and Sebastian's response (he's the best one at the support team here).

With which MPI do you usually test?

OpenMPI. Anything that is correctly set up to link should work. My guess is that something about the infrastructure on JUWELS is not set up well enough, or your CMake cache was generated partly at different times and so contains mutually incompatible libraries that were detected at different points (or similar; see below).

The thing is, with the ICC 2019 Update 2, I only have hwloc 2.0.3, not hwloc 1.

I get to this:

ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_distances_get'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_topology_set_io_types_filter'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_obj_type_is_dcache'
ld: ../../lib/libgromacs_mpi.so.4.0.0: undefined reference to `hwloc_distances_release'

hwloc_distances_get is found in hwloc 2.x but not in 1.x. The cmake cache is inconsistent.

The simplest way to get this right is to make a fresh login, load the right modules, then run cmake in a fresh build directory :-) Get a non-MPI build working and passing tests first.

#13 Updated by Mark Abraham 14 days ago

  • Assignee set to Mark Abraham
  • Priority changed from Normal to Low
  • Target version changed from 2019.2 to 2019.3

On the available information, this is not a GROMACS problem, so if we find one, we will consider fixing it in 2019.3.

Also available in: Atom PDF