Project

General

Profile

Bug #2352

Unit test failures on Knights landing with Intel compiler

Added by Paul Bauer almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Compiling Gromacs 2018-beta2 on the Knights Landing nodes on the Kebnekaise cluster (https://www.hpc2n.umu.se/resources/hardware/kebnekaise) leads to a number of failures in the unit tests, listed below and attached in the log files.

The most recurring failures are in the TableUnitTest (that should already be addressed) and in the test "SimdFloatingpointUtilTest.loadUNDuplicate4" and "HardwareTopologyTest.ProcessorSelfconsistency"

The former fails with
/home/p/pabau/pfs/gromacs/beta2018_2/src/gromacs/simd/tests/simd_floatingpoint_util.cpp:939: Failure
Failing SIMD comparison between v0 and v1
Ref. values: { 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4 }
Test values: { 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4 }
in Singleprecision and
[ RUN ] SimdFloatingpointUtilTest.loadUNDuplicate4
/home/p/pabau/pfs/gromacs/beta2018_2/src/gromacs/simd/tests/simd_floatingpoint_util.cpp:939: Failure
Failing SIMD comparison between v0 and v1
Ref. values: { 1, 2, 1, 2, 1, 2, 1, 2 }
Test values: { 1, 1, 1, 1, 2, 2, 2, 2 }
in Doubleprecision.

The latter fails with
HardwareUnitTests fails with
/home/p/pabau/pfs/gromacs/beta2018_2/src/gromacs/hardware/tests/hardwaretopology.cpp:121: Failure
Value of: s
Actual: 0
Expected: hwTop.machine().logicalProcessors[idx].socketRankInMachine
Which is: 48
/home/p/pabau/pfs/gromacs/beta2018_2/src/gromacs/hardware/tests/hardwaretopology.cpp:122: Failure
Value of: c
Actual: 2
Expected: hwTop.machine().logicalProcessors[idx].coreRankInSocket
Which is: 0
logical:-1
/home/p/pabau/pfs/gromacs/beta2018_2/src/gromacs/hardware/tests/hardwaretopology.cpp:123: Failure
Value of: t
Actual: 0
and so on ...

The build files are archived and I can upload them later.

overview_knl.txt (3.48 KB) overview_knl.txt Short overview over compilers and failures Paul Bauer, 12/14/2017 01:32 PM
knl_bugs_log.tar.xz (42 KB) knl_bugs_log.tar.xz Archive with folders and log files from build process Paul Bauer, 12/14/2017 01:33 PM
slurm-3238678.out (297 KB) slurm-3238678.out Log file from new build using intel 2018.1 Paul Bauer, 12/17/2017 12:10 PM
2018beta3_knl_logs.tar.xz (64.1 KB) 2018beta3_knl_logs.tar.xz Archive with build log files for intel 17 and 18 Paul Bauer, 12/19/2017 03:52 PM

History

#1 Updated by Erik Lindahl almost 2 years ago

To the best of my knowledge, the test values seem to be correct and the reference incorrect for the load operation.

There is already a workaround for bad icc-18 vectorization on line 932 in tests/simd_floatingpoint_util.cpp. Could you try removing the icc version check and see what happens if we apply it to all versions of icc?

#2 Updated by Roland Schulz almost 2 years ago

No need to remove the version check. According to the summary it only appears in ICC18. Which matches my testing and is the reason for the work-around. The work-around isn't in 2018beta2. If you want you could retest with the latest release-2018 git branch.

I haven't seen the HardwareUnitTests failure. Erik, do you have access to that cluster and can check why this test fails?

#3 Updated by Erik Lindahl almost 2 years ago

Yes, I think so at least. Will look into it tomorrow - sorry, a bit too tired tonight! (Never drink and derive :-)

#4 Updated by Paul Bauer almost 2 years ago

I'll check the latest release now then, to have a look if the other issues are also fixed there.
Cheers!

#5 Updated by Paul Bauer almost 2 years ago

I checked now with 2018-beta2-dev-20171215-7927f23 and still see the same errors, are the changes mentioned already merged in release-2018?

#6 Updated by Roland Schulz almost 2 years ago

The table fix is not yet submitted it can be downloaded from https://gerrit.gromacs.org/c/7302/. No one has fixed the HardwareUnitTests yet. But the fix for simd_floatingpoint_util is merged and part of the version you tested (7927f23). Do you still see that one error? If so can you paste here the "gmx -version" output for one of the configurations which show that error?

#7 Updated by Paul Bauer almost 2 years ago

So, the output I get from gmx -version for the git hash is 7927f23d8eb2f9a1c30d31ce3dac98973803b09e, and the test failures are exactly the same as before. I attached an example log file from one build.

Cheers
Paul

#8 Updated by Roland Schulz almost 2 years ago

It indeed wasn't fixed. There was a typo in the work-around. Sorry. If you want you can download the corrected fix here: https://gerrit.gromacs.org/c/7359/

#9 Updated by Paul Bauer almost 2 years ago

As mentioned on Gerrit, the new commit fixes the unit test failures on Kebnekaise

#10 Updated by Mark Abraham almost 2 years ago

  • Target version set to 2018

table test update now submitted

#11 Updated by Paul Bauer almost 2 years ago

  • Status changed from New to Fix uploaded

I guess this can be marked as resolved then as soon as the final release candidate is out. I also did no longer see the HardwareUnitTests, but have to test more configurations to make sure.

#12 Updated by Mark Abraham almost 2 years ago

Why not say it'll be resolved when we're sure we've tried such a build and test?

#13 Updated by Paul Bauer almost 2 years ago

I did one build with the affected version of the compiler that gave me all of the observed failures in the unit tests and did not observe them when building the version Roland had on Gerrit. I have resubmitted the automated builds now to make sure that nothing new has come up.

#14 Updated by Mark Abraham almost 2 years ago

There's still the issue with hardware unit tests, am I right?

#15 Updated by Paul Bauer almost 2 years ago

Yes, but I though this should be kept unrelated to this issue. I attached the newest build logs from Kebnekaise, and the hardware test failures seem to be a bit random, but I don't understand the code being tested there to say anything more.
All other failures have been resolved now.

#16 Updated by Erik Lindahl almost 2 years ago

This appears to be a problem in hwloc rather than Gromacs. Åke Sandgren even spontaneously said it's important to use 1.11.8. I'll see if I can detect it, and require at least 1.11.8 when we are on KNL.

#17 Updated by Mark Abraham almost 2 years ago

Erik Lindahl wrote:

This appears to be a problem in hwloc rather than Gromacs. Åke Sandgren even spontaneously said it's important to use 1.11.8. I'll see if I can detect it, and require at least 1.11.8 when we are on KNL.

Since hwloc is still optional, I suggest we make it default off for AVX_512_whatever_it_is if the hwloc version (which we can already compare with) is too old.

#18 Updated by Erik Lindahl almost 2 years ago

  • Status changed from Fix uploaded to Resolved

Even hwloc 1.11.5 works great for me when testing on kebnekaise-knl.hpc2n.umu.se with either icc 17 or 18, so it might rather have been something transient or kernel-related on the node where Paul tested.

In any case, all unit tests work fine on KNL now with latest release-2018, so I think we can close this issue.

#19 Updated by Erik Lindahl almost 2 years ago

PS: There might of course still be an issue with some KNL nodes or kernels, but that's not a bug in Gromacs.

#20 Updated by Erik Lindahl almost 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF