Project

General

Profile

Bug #2746

regressiontests/freeenergy coulandvdwsequential_vdw failing on Power8

Added by Szilárd Páll 9 months ago. Updated 5 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Affected version - extra info:
2019-beta3-dev-20181108-d536de3
Affected version:
Difficulty:
uncategorized
Close

Description

FAILED. Check checkforce.out (2 errors) file(s) in coulandvdwsequential_vdw for coulandvdwsequential_vdw
Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.
$ bin/gmx -version
             :-) GROMACS - gmx, 2019-beta3-dev-20181108-d536de3 (-:

                            GROMACS is written by:
     Emile Apol      Rossen Apostolov      Paul Bauer     Herman J.C. Berendsen
    Par Bjelkmar      Christian Blau   Viacheslav Bolnykh     Kevin Boyd    
 Aldert van Buuren   Rudi van Drunen     Anton Feenstra    Gerrit Groenhof  
   Anca Hamuraru    Vincent Hindriksen  M. Eric Irrgang    Aleksei Iupinov  
 Christoph Junghans     Joe Jordan     Dimitrios Karkoulis    Peter Kasson   
     Jiri Kraus      Carsten Kutzner      Per Larsson      Justin A. Lemkul 
   Viveca Lindahl    Magnus Lundborg     Erik Marklund       Pascal Merz    
 Pieter Meulenhoff    Teemu Murtola       Szilard Pall       Sander Pronk   
   Roland Schulz      Michael Shirts    Alexey Shvetsov     Alfons Sijbers  
   Peter Tieleman    Teemu Virolainen  Christian Wennberg    Maarten Wolf   
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2018, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx, version 2019-beta3-dev-20181108-d536de3
Executable:   /home/pszilard/gromacs-19/build_p8_gcc7_fftw337/bin/gmx
Data prefix:  /home/pszilard/gromacs-19 (source tree)
Working dir:  /home/pszilard/gromacs-19/build_p8_gcc7_fftw337
Command line:
  gmx -version

GROMACS version:    2019-beta3-dev-20181108-d536de3
GIT SHA1 hash:      d536de3b5125b79d4222768e356c4914e0758d5a
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        disabled
SIMD instructions:  IBM_VSX
FFT library:        fftw-3.3.8
RDTSCP usage:       disabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.8
Tracing support:    disabled
C compiler:         /home/pszilard/programs/gcc/7.3/bin/gcc GNU 7.3.0
C compiler flags:   -mcpu=power8 -mpower8-vector -mpower8-fusion -mdirect-move  -mvsx    -Werror=format-overflow -Wundef -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds 
C++ compiler:       /home/pszilard/programs/gcc/7.3/bin/g++ GNU 7.3.0
C++ compiler flags: -mcpu=power8 -mpower8-vector -mpower8-fusion -mdirect-move  -mvsx    -std=c++11  -Wformat-overflow -Wundef -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wall  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds 
checkforce.out (390 KB) checkforce.out Szilárd Páll, 11/08/2018 03:49 PM

Related issues

Related to GROMACS - Bug #2734: regressiontests/kernel core dumps on ppc64leIn Progress
Related to GROMACS - Bug #2747: nb_kernel_ElecEwSw_VdwBhamSw_GeomW4W4 regressiontest failing on Power8Closed

Associated revisions

Revision 4a7281ef (diff)
Added by Szilárd Páll 5 months ago

Disable instruction fusion on Power8

The -mpower8-fusion flag seems to be the source of incorrect code; not
confirmed, but likely a codegen issue that also affects Power9 with the
similar flag used.

Fixes #2747 #2746 #2734

Change-Id: I56f50e54db47f4fe30c42488f4c4f79ac474518a

Revision 1ce795fe (diff)
Added by Szilárd Páll 5 months ago

Disable instruction fusion on Power8

The -mpower8-fusion flag seems to be the source of incorrect code; not
confirmed, but likely a codegen issue that also affects Power9 with the
similar flag used.

Fixes #2747 #2746 #2734

Change-Id: I56f50e54db47f4fe30c42488f4c4f79ac474518a

History

#1 Updated by Szilárd Páll 9 months ago

Possibly related to this or #2747, with gcc 8:

FAILED. Check checkpot.out (2 errors), checkforce.out (1869 errors) file(s) in coulandvdwsequential_coul for coulandvdwsequential_coul
FAILED. Check checkpot.out (11 errors), checkforce.out (404 errors) file(s) in expanded for expanded
$ cat  tests/regressiontests-release-2019-51d7202/freeenergy/coulandvdwsequential_coul/checkpot.out 
comparing energy file ./reference_s.edr and ener.edr

There are 51 and 52 terms in the energy files

enm[12] (- - Conserved En.)
There are 10 terms to compare in the energy files

Coul. recip.     step  40:      -3352.61,  step  40:       163097
Potential        step  40:      -30117.5,  step  40:       136332

Files read successfully
$ cat  tests/regressiontests-release-2019-51d7202/freeenergy/expanded/checkpot.out 
comparing energy file ./reference_s.edr and ener.edr

There are 40 terms in the energy files

There are 11 terms to compare in the energy files

Coul. recip.     step   0:       321.338,  step   0:      336.118
Coul. recip.     step   1:       325.129,  step   1:      325.502
Coul. recip.     step   2:       325.259,  step   2:      325.845
Coul. recip.     step   3:       322.506,  step   3:      323.083
Coul. recip.     step   4:       318.435,  step   4:      318.805
Coul. recip.     step   6:        313.25,  step   6:      312.932
Coul. recip.     step   7:       314.874,  step   7:      314.292
Coul. recip.     step   8:       320.291,  step   8:      319.605
Coul. recip.     step   9:        329.26,  step   9:      328.631
Coul. recip.     step  10:       340.449,  step  10:      339.993
Coul. recip.     step  68:       321.619,  step  68:      321.962

Files read successfully

#2 Updated by Szilárd Páll 9 months ago

Update: several free neergy tests fail intermittently with gcc 8 too, also with GMX_SIMD=None and GMX_FFT_LIBRARY=fftpack, but this time some bonded energy terms are not matching:

 $ cat tests/regressiontests-release-2019-51d7202/freeenergy/coulandvdwtogether/checkpot.out
comparing energy file ./reference_s.edr and ener.edr

There are 49 terms in the energy files

There are 10 terms to compare in the energy files

Angle            step  16:       8.28424,  step  16:      8.33762
Angle            step  17:        7.8723,  step  17:      7.92783
Angle            step  18:       7.31895,  step  18:      7.37155

Files read successfully

$ less tests/regressiontests-release-2019-51d7202/freeenergy/coulandvdwsequential_coul/checkpot.out 
comparing energy file ./reference_s.edr and ener.edr

There are 51 and 52 terms in the energy files

enm[12] (- - Conserved En.)
There are 10 terms to compare in the energy files

Coul. recip.     step  12:       -3356.1,  step  12:      46807.7
Potential        step  12:      -30089.6,  step  12:      20074.2
Coul. recip.     step  20:      -3350.21,  step  20:      -3345.8
Coul. recip.     step  21:      -3349.77,  step  21:     -3344.29
Coul. recip.     step  22:      -3349.32,  step  22:     -3343.03
Potential        step  22:      -30046.6,  step  22:     -30014.8
Coul. recip.     step  23:      -3348.88,  step  23:     -3342.08
Potential        step  23:      -30034.9,  step  23:     -30001.2
Coul. recip.     step  24:      -3348.49,  step  24:     -3341.54
Potential        step  24:      -30025.1,  step  24:     -29991.2
Coul. recip.     step  25:      -3348.19,  step  25:     -3341.42
Potential        step  25:      -30018.2,  step  25:     -29985.7
Coul. recip.     step  26:      -3348.01,  step  26:     -3341.73
Coul. recip.     step  27:         -3348,  step  27:     -3342.43
Coul. recip.     step  28:      -3348.15,  step  28:     -3343.46
Coul. recip.     step  29:      -3348.47,  step  29:     -3344.65
Coul. recip.     step  37:      -3351.75,  step  37:     -3347.99
Coul. recip.     step  38:      -3351.97,  step  38:     -3347.57
Ryckaert-Bell.   step  39:       4.96147,  step  39:      4.91132
Coul. recip.     step  39:      -3352.24,  step  39:     -3347.31
Ryckaert-Bell.   step  40:       5.04869,  step  40:      4.99534
Coul. recip.     step  40:      -3352.61,  step  40:     -3347.35
Potential        step  40:      -30117.5,  step  40:     -30086.2

Files read successfully

#3 Updated by Mark Abraham 7 months ago

  • Target version set to 2019.1

I suggest we stop supporting power 8. There's essentially zero HPC usage, so it just isn't a priority.

Setting a target so that we make a decision about the support offered.

#4 Updated by Mark Abraham 7 months ago

  • Related to Bug #2734: regressiontests/kernel core dumps on ppc64le added

#5 Updated by Mark Abraham 7 months ago

  • Related to Bug #2747: nb_kernel_ElecEwSw_VdwBhamSw_GeomW4W4 regressiontest failing on Power8 added

#6 Updated by Szilárd Páll 7 months ago

Mark Abraham wrote:

I suggest we stop supporting power 8. There's essentially zero HPC usage, so it just isn't a priority.

Setting a target so that we make a decision about the support offered.

None of this is an effort to support Power8 just as our testing on other non-mainstream platforms isn't support for those. With the same reasoning neither ARMv7, even ARMv8, anything 32-bit, anything Intel older Intel Sandy Bridge (or Ivy), let alone Windows should be a priority.

Portability does not mean the code can in theory be ported (and if it happens to not work we claim it is unsupported), but that it actually does work across different platforms that meet the common requirements for the codebase to compile and function correctly. We're using vanilla GNU toolchain on a vanilla ppc64 Linux distribution, so nothing custom or vendor-specific is involved that would point to an effort beyond ensuring portability, hence explicit Power8 platform support is not a concern here, I think.

Of course, if such observations do not reproduce in other cases, we can flag this as a "known issue and consider it solved.

PS: ORNL and other US labs do use Power8 GPU clusters for some testing, e.g. without a live project affiliation AFAIK even to ORNL employees only Summitdev (Power S822LC, that is Power8 + P100) is available.

#7 Updated by Gerrit Code Review Bot 5 months ago

Gerrit received a related patchset '1' for Issue #2746.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2019~I56f50e54db47f4fe30c42488f4c4f79ac474518a
Gerrit URL: https://gerrit.gromacs.org/9104

#8 Updated by Gerrit Code Review Bot 5 months ago

Gerrit received a related patchset '1' for Issue #2746.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2018~I56f50e54db47f4fe30c42488f4c4f79ac474518a
Gerrit URL: https://gerrit.gromacs.org/9105

#9 Updated by Mark Abraham 5 months ago

Szilárd Páll wrote:

Mark Abraham wrote:

I suggest we stop supporting power 8. There's essentially zero HPC usage, so it just isn't a priority.

Setting a target so that we make a decision about the support offered.

None of this is an effort to support Power8 just as our testing on other non-mainstream platforms isn't support for those. With the same reasoning neither ARMv7, even ARMv8, anything 32-bit, anything Intel older Intel Sandy Bridge (or Ivy), let alone Windows should be a priority.

Indeed, none of them are priorities. ARMv7 I would drop in a heartbeat except that we happen to have one already. 32-bit was agreed years ago was not supported. Older Intel happens to be even easier to test so unlikely to be dropped. The Windows port is much more about that our code compiles with a different compiler, C++ and C standard library, and has fewer hidden assumptions about POSIX systems - which is sustainable only if it is easy to test.

Portability does not mean the code can in theory be ported

Portable means "able to be ported," not "has been tested and does not require porting." GROMACS is a portable code because we have designed it that way, and also because we have taken care to support a range of platforms.

(and if it happens to not work we claim it is unsupported),

Was that a constructive thing to say?

but that it actually does work across different platforms

For us to claim that it "actually does work" requires that we've spent the time to test it, which is one of the prerequisites for saying something is "supported." How often we prioritise the time to test it depends on how much effort that will cost us compared to the benefit our users derive (directly or indirectly).

that meet the common requirements for the codebase to compile and function correctly.

We're using vanilla GNU toolchain on a vanilla ppc64 Linux distribution, so nothing custom or vendor-specific is involved that would point to an effort beyond ensuring portability, hence explicit Power8 platform support is not a concern here, I think.

Of course, if such observations do not reproduce in other cases, we can flag this as a "known issue and consider it solved.

PS: ORNL and other US labs do use Power8 GPU clusters for some testing, e.g. without a live project affiliation AFAIK even to ORNL employees only Summitdev (Power S822LC, that is Power8 + P100) is available.

Good to know. Hopefully the comment about "Initial access to the summitdev system will be limited to the OLCF CAAR teams" is now out of date.

#10 Updated by Mark Abraham 5 months ago

  • Status changed from New to Fix uploaded

#11 Updated by Mark Abraham 5 months ago

  • Status changed from Fix uploaded to Resolved

#12 Updated by Mark Abraham 5 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF