Project

General

Profile

Bug #2734

regressiontests/kernel core dumps on ppc64le

Added by Christoph Junghans 12 months ago. Updated 4 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
testing
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

GROMACS:      gmx mdrun, version 2018.3 (double precision)
Executable:   /builddir/build/BUILD/gromacs-2018.3/serial_d/bin/gmx_d
Data prefix:  /builddir/build/BUILD/gromacs-2018.3 (source tree)
Working dir:  /builddir/build/BUILD/gromacs-2018.3/serial_d/tests
Command line:
  gmx_d mdrun -h
Thanx for Using GROMACS - Have a Nice Day
sh: line 1:   833 Aborted                 (core dumped) gmx_d mdrun -nb cpu -notunepme > mdrun.out 2>&1
Abnormal return value for ' gmx_d mdrun    -nb cpu   -notunepme >mdrun.out 2>&1' was -1
FAILED. Check mdrun.out, md.log file(s) in nb_kernel_ElecEwSh_VdwLJSh_GeomW4P1 for nb_kernel_ElecEwSh_VdwLJSh_GeomW4P1
1 out of 142 kernel tests FAILED

Details here: https://koji.fedoraproject.org/koji/taskinfo?taskID=30691834

build.log (6.6 MB) build.log build log on ppc64le Christoph Junghans, 06/15/2019 04:10 PM

Related issues

Related to GROMACS - Bug #2746: regressiontests/freeenergy coulandvdwsequential_vdw failing on Power8Closed
Related to GROMACS - Bug #2747: nb_kernel_ElecEwSw_VdwBhamSw_GeomW4W4 regressiontest failing on Power8Closed
Related to GROMACS - Task #3057: re-enable fusion on Power8/9New
Related to GROMACS - Bug #3116: regressiontests/freeenergy core dumps on ppc64leNew

Associated revisions

Revision 4a7281ef (diff)
Added by Szilárd Páll 8 months ago

Disable instruction fusion on Power8

The -mpower8-fusion flag seems to be the source of incorrect code; not
confirmed, but likely a codegen issue that also affects Power9 with the
similar flag used.

Fixes #2747 #2746 #2734

Change-Id: I56f50e54db47f4fe30c42488f4c4f79ac474518a

Revision 1ce795fe (diff)
Added by Szilárd Páll 8 months ago

Disable instruction fusion on Power8

The -mpower8-fusion flag seems to be the source of incorrect code; not
confirmed, but likely a codegen issue that also affects Power9 with the
similar flag used.

Fixes #2747 #2746 #2734

Change-Id: I56f50e54db47f4fe30c42488f4c4f79ac474518a

History

#1 Updated by Paul Bauer 12 months ago

I added some information for the build, could you try to run the failing test on its own to see where it crashes? Thanks!

Compiler: gcc-8.2.1
BLAS: openblas
LAPACK: openblas
SIMD: None
Doulbe: ON
fftw: 3.3.8

#2 Updated by Christoph Junghans 12 months ago

Interestingly with work with GMX_SIMD=IBM_VSX on ppc64le.

As this is inside an non-interactive rpm build, what exactly do I need to run?

#3 Updated by Paul Bauer 12 months ago

Very interesting, indeed.
Best would be running only the failing test in a memory checker, but I don't think this is possible if you can't go physically on the machine. Will check with people here if someone can try to reproduce this.

#4 Updated by Christoph Junghans 12 months ago

Yeah, no interactive mode, sorry!

#5 Updated by Dominik Mierzejewski 12 months ago

Paul Bauer wrote:

Very interesting, indeed.
Best would be running only the failing test in a memory checker, but I don't think this is possible if you can't go physically on the machine. Will check with people here if someone can try to reproduce this.

I have a ppc64le VM if anyone wants to debug this hands-on. Just send me your public ssh key.

#6 Updated by Paul Bauer 12 months ago

I tried reproducing this on the VM that Dominik helpfully provided, with the current head of release-2018, using the same cmake instructions.
Running in valgrind shows some invalid reads when running the code, but it didn't crash for me so far.

#7 Updated by Paul Bauer 12 months ago

Ok, tried more things but can't get the test to crash. The invalid reads where because I didn't load the correct libgromacs for each build, and don't show up when done correctly. This was now again with the current head of release-2018.

#8 Updated by Paul Bauer 12 months ago

  • Status changed from New to Feedback wanted
  • Target version changed from 2018.4 to 2019

I'll retarget this on 2019, because I was unable to reproduce the issue on the similar VM with the build configuration used during the package build.

#9 Updated by Szilárd Páll 12 months ago

I can't repro crashes on Power8 either, but I did produce a bunch of failing regressiontests, see #2746, #2747. There may be something here that sometimes causes only wrong results and occasionally crashes too.

@Christoph: have the runs been repeated in the fedora system, do you see incorrect results in some cases?

#10 Updated by Christoph Junghans 12 months ago

I was just trying to build the rpm package and this issue came up in the `%check` block. Maybe Dominik has another idea.

#11 Updated by Paul Bauer 10 months ago

  • Target version changed from 2019 to 2020

this is very likely to be postponed, because it is not clear what the actual issue is

#12 Updated by Mark Abraham 10 months ago

  • Related to Bug #2746: regressiontests/freeenergy coulandvdwsequential_vdw failing on Power8 added

#13 Updated by Mark Abraham 10 months ago

  • Related to Bug #2747: nb_kernel_ElecEwSw_VdwBhamSw_GeomW4W4 regressiontest failing on Power8 added

#14 Updated by Gerrit Code Review Bot 9 months ago

Gerrit received a related patchset '1' for Issue #2734.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2019~I56f50e54db47f4fe30c42488f4c4f79ac474518a
Gerrit URL: https://gerrit.gromacs.org/9104

#15 Updated by Gerrit Code Review Bot 9 months ago

Gerrit received a related patchset '1' for Issue #2734.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2018~I56f50e54db47f4fe30c42488f4c4f79ac474518a
Gerrit URL: https://gerrit.gromacs.org/9105

#16 Updated by Mark Abraham 9 months ago

  • Status changed from Feedback wanted to Fix uploaded
  • Target version changed from 2020 to 2019.1

#17 Updated by Szilárd Páll 8 months ago

  • Status changed from Fix uploaded to Feedback wanted

@Christoph: can you check if the change uploaded fixes the failing tests?

#18 Updated by Christoph Junghans 8 months ago

Can I test this in 2019.1, the rpm already has too many patches in it?

#19 Updated by Mark Abraham 8 months ago

Christoph Junghans wrote:

Can I test this in 2019.1, the rpm already has too many patches in it?

Sure, that sounds great.

#20 Updated by Szilárd Páll 8 months ago

  • Status changed from Feedback wanted to Resolved

#21 Updated by Paul Bauer 8 months ago

  • Status changed from Resolved to Closed

#22 Updated by Szilárd Páll 8 months ago

Not really ready to close until we get feedback whether the issue is solved, but I guess leaving it on "Feedback wanted" will mean it remains a release blocker?

#23 Updated by Mark Abraham 8 months ago

  • Status changed from Closed to Feedback wanted
  • Target version changed from 2019.1 to 2019.2

Good idea. Postponed

#24 Updated by Szilárd Páll 8 months ago

  • Status changed from Feedback wanted to Resolved

#25 Updated by Paul Bauer 7 months ago

  • Status changed from Resolved to Closed

#26 Updated by Christoph Junghans 4 months ago

  • File build.log build.log added
  • Status changed from Closed to In Progress
  • Target version changed from 2019.2 to future
  • Affected version changed from 2018.3 to 2019.3

It is back in 2019.3:

GROMACS:      gmx mdrun, version 2019.3
Executable:   /builddir/build/BUILD/gromacs-2019.3/serial/bin/gmx
Data prefix:  /builddir/build/BUILD/gromacs-2019.3 (source tree)
Working dir:  /builddir/build/BUILD/gromacs-2019.3/serial/tests
Command line:
  gmx mdrun -h
Thanx for Using GROMACS - Have a Nice Day
sh: line 1: 16588 Aborted                 (core dumped) gmx mdrun -nb cpu -notunepme > mdrun.out 2>&1
Abnormal return value for ' gmx mdrun    -nb cpu   -notunepme >mdrun.out 2>&1' was -1
FAILED. Check mdrun.out, md.log file(s) in nb_kernel_ElecEwSw_VdwBhamSw_GeomW4W4 for nb_kernel_ElecEwSw_VdwBhamSw_GeomW4W4
1 out of 142 kernel tests FAILED

See https://koji.fedoraproject.org/koji/taskinfo?taskID=35545387 and attached build.log

#27 Updated by Szilárd Páll 2 months ago

  • Related to Task #3057: re-enable fusion on Power8/9 added

#28 Updated by Christoph Junghans 19 days ago

  • Related to Bug #3116: regressiontests/freeenergy core dumps on ppc64le added

Also available in: Atom PDF