Bug #2746
regressiontests/freeenergy coulandvdwsequential_vdw failing on Power8
Description
FAILED. Check checkforce.out (2 errors) file(s) in coulandvdwsequential_vdw for coulandvdwsequential_vdw Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.
$ bin/gmx -version :-) GROMACS - gmx, 2019-beta3-dev-20181108-d536de3 (-: GROMACS is written by: Emile Apol Rossen Apostolov Paul Bauer Herman J.C. Berendsen Par Bjelkmar Christian Blau Viacheslav Bolnykh Kevin Boyd Aldert van Buuren Rudi van Drunen Anton Feenstra Gerrit Groenhof Anca Hamuraru Vincent Hindriksen M. Eric Irrgang Aleksei Iupinov Christoph Junghans Joe Jordan Dimitrios Karkoulis Peter Kasson Jiri Kraus Carsten Kutzner Per Larsson Justin A. Lemkul Viveca Lindahl Magnus Lundborg Erik Marklund Pascal Merz Pieter Meulenhoff Teemu Murtola Szilard Pall Sander Pronk Roland Schulz Michael Shirts Alexey Shvetsov Alfons Sijbers Peter Tieleman Teemu Virolainen Christian Wennberg Maarten Wolf and the project leaders: Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel Copyright (c) 1991-2000, University of Groningen, The Netherlands. Copyright (c) 2001-2018, The GROMACS development team at Uppsala University, Stockholm University and the Royal Institute of Technology, Sweden. check out http://www.gromacs.org for more information. GROMACS is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. GROMACS: gmx, version 2019-beta3-dev-20181108-d536de3 Executable: /home/pszilard/gromacs-19/build_p8_gcc7_fftw337/bin/gmx Data prefix: /home/pszilard/gromacs-19 (source tree) Working dir: /home/pszilard/gromacs-19/build_p8_gcc7_fftw337 Command line: gmx -version GROMACS version: 2019-beta3-dev-20181108-d536de3 GIT SHA1 hash: d536de3b5125b79d4222768e356c4914e0758d5a Precision: single Memory model: 64 bit MPI library: thread_mpi OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64) GPU support: disabled SIMD instructions: IBM_VSX FFT library: fftw-3.3.8 RDTSCP usage: disabled TNG support: enabled Hwloc support: hwloc-1.11.8 Tracing support: disabled C compiler: /home/pszilard/programs/gcc/7.3/bin/gcc GNU 7.3.0 C compiler flags: -mcpu=power8 -mpower8-vector -mpower8-fusion -mdirect-move -mvsx -Werror=format-overflow -Wundef -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds C++ compiler: /home/pszilard/programs/gcc/7.3/bin/g++ GNU 7.3.0 C++ compiler flags: -mcpu=power8 -mpower8-vector -mpower8-fusion -mdirect-move -mvsx -std=c++11 -Wformat-overflow -Wundef -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wall -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
Related issues
Associated revisions
History
#1 Updated by Szilárd Páll about 2 years ago
Possibly related to this or #2747, with gcc 8:
FAILED. Check checkpot.out (2 errors), checkforce.out (1869 errors) file(s) in coulandvdwsequential_coul for coulandvdwsequential_coul FAILED. Check checkpot.out (11 errors), checkforce.out (404 errors) file(s) in expanded for expanded
$ cat tests/regressiontests-release-2019-51d7202/freeenergy/coulandvdwsequential_coul/checkpot.out comparing energy file ./reference_s.edr and ener.edr There are 51 and 52 terms in the energy files enm[12] (- - Conserved En.) There are 10 terms to compare in the energy files Coul. recip. step 40: -3352.61, step 40: 163097 Potential step 40: -30117.5, step 40: 136332 Files read successfully
$ cat tests/regressiontests-release-2019-51d7202/freeenergy/expanded/checkpot.out comparing energy file ./reference_s.edr and ener.edr There are 40 terms in the energy files There are 11 terms to compare in the energy files Coul. recip. step 0: 321.338, step 0: 336.118 Coul. recip. step 1: 325.129, step 1: 325.502 Coul. recip. step 2: 325.259, step 2: 325.845 Coul. recip. step 3: 322.506, step 3: 323.083 Coul. recip. step 4: 318.435, step 4: 318.805 Coul. recip. step 6: 313.25, step 6: 312.932 Coul. recip. step 7: 314.874, step 7: 314.292 Coul. recip. step 8: 320.291, step 8: 319.605 Coul. recip. step 9: 329.26, step 9: 328.631 Coul. recip. step 10: 340.449, step 10: 339.993 Coul. recip. step 68: 321.619, step 68: 321.962 Files read successfully
#2 Updated by Szilárd Páll about 2 years ago
Update: several free neergy tests fail intermittently with gcc 8 too, also with GMX_SIMD=None and GMX_FFT_LIBRARY=fftpack, but this time some bonded energy terms are not matching:
$ cat tests/regressiontests-release-2019-51d7202/freeenergy/coulandvdwtogether/checkpot.out comparing energy file ./reference_s.edr and ener.edr There are 49 terms in the energy files There are 10 terms to compare in the energy files Angle step 16: 8.28424, step 16: 8.33762 Angle step 17: 7.8723, step 17: 7.92783 Angle step 18: 7.31895, step 18: 7.37155 Files read successfully
$ less tests/regressiontests-release-2019-51d7202/freeenergy/coulandvdwsequential_coul/checkpot.out comparing energy file ./reference_s.edr and ener.edr There are 51 and 52 terms in the energy files enm[12] (- - Conserved En.) There are 10 terms to compare in the energy files Coul. recip. step 12: -3356.1, step 12: 46807.7 Potential step 12: -30089.6, step 12: 20074.2 Coul. recip. step 20: -3350.21, step 20: -3345.8 Coul. recip. step 21: -3349.77, step 21: -3344.29 Coul. recip. step 22: -3349.32, step 22: -3343.03 Potential step 22: -30046.6, step 22: -30014.8 Coul. recip. step 23: -3348.88, step 23: -3342.08 Potential step 23: -30034.9, step 23: -30001.2 Coul. recip. step 24: -3348.49, step 24: -3341.54 Potential step 24: -30025.1, step 24: -29991.2 Coul. recip. step 25: -3348.19, step 25: -3341.42 Potential step 25: -30018.2, step 25: -29985.7 Coul. recip. step 26: -3348.01, step 26: -3341.73 Coul. recip. step 27: -3348, step 27: -3342.43 Coul. recip. step 28: -3348.15, step 28: -3343.46 Coul. recip. step 29: -3348.47, step 29: -3344.65 Coul. recip. step 37: -3351.75, step 37: -3347.99 Coul. recip. step 38: -3351.97, step 38: -3347.57 Ryckaert-Bell. step 39: 4.96147, step 39: 4.91132 Coul. recip. step 39: -3352.24, step 39: -3347.31 Ryckaert-Bell. step 40: 5.04869, step 40: 4.99534 Coul. recip. step 40: -3352.61, step 40: -3347.35 Potential step 40: -30117.5, step 40: -30086.2 Files read successfully
#3 Updated by Mark Abraham about 2 years ago
- Target version set to 2019.1
I suggest we stop supporting power 8. There's essentially zero HPC usage, so it just isn't a priority.
Setting a target so that we make a decision about the support offered.
#4 Updated by Mark Abraham about 2 years ago
- Related to Bug #2734: regressiontests/kernel core dumps on ppc64le added
#5 Updated by Mark Abraham about 2 years ago
- Related to Bug #2747: nb_kernel_ElecEwSw_VdwBhamSw_GeomW4W4 regressiontest failing on Power8 added
#6 Updated by Szilárd Páll about 2 years ago
Mark Abraham wrote:
I suggest we stop supporting power 8. There's essentially zero HPC usage, so it just isn't a priority.
Setting a target so that we make a decision about the support offered.
None of this is an effort to support Power8 just as our testing on other non-mainstream platforms isn't support for those. With the same reasoning neither ARMv7, even ARMv8, anything 32-bit, anything Intel older Intel Sandy Bridge (or Ivy), let alone Windows should be a priority.
Portability does not mean the code can in theory be ported (and if it happens to not work we claim it is unsupported), but that it actually does work across different platforms that meet the common requirements for the codebase to compile and function correctly. We're using vanilla GNU toolchain on a vanilla ppc64 Linux distribution, so nothing custom or vendor-specific is involved that would point to an effort beyond ensuring portability, hence explicit Power8 platform support is not a concern here, I think.
Of course, if such observations do not reproduce in other cases, we can flag this as a "known issue and consider it solved.
PS: ORNL and other US labs do use Power8 GPU clusters for some testing, e.g. without a live project affiliation AFAIK even to ORNL employees only Summitdev (Power S822LC, that is Power8 + P100) is available.
#7 Updated by Gerrit Code Review Bot almost 2 years ago
Gerrit received a related patchset '1' for Issue #2746.
Uploader: Szilárd Páll (pall.szilard@gmail.com)
Change-Id: gromacs~release-2019~I56f50e54db47f4fe30c42488f4c4f79ac474518a
Gerrit URL: https://gerrit.gromacs.org/9104
#8 Updated by Gerrit Code Review Bot almost 2 years ago
Gerrit received a related patchset '1' for Issue #2746.
Uploader: Szilárd Páll (pall.szilard@gmail.com)
Change-Id: gromacs~release-2018~I56f50e54db47f4fe30c42488f4c4f79ac474518a
Gerrit URL: https://gerrit.gromacs.org/9105
#9 Updated by Mark Abraham almost 2 years ago
Szilárd Páll wrote:
Mark Abraham wrote:
I suggest we stop supporting power 8. There's essentially zero HPC usage, so it just isn't a priority.
Setting a target so that we make a decision about the support offered.
None of this is an effort to support Power8 just as our testing on other non-mainstream platforms isn't support for those. With the same reasoning neither ARMv7, even ARMv8, anything 32-bit, anything Intel older Intel Sandy Bridge (or Ivy), let alone Windows should be a priority.
Indeed, none of them are priorities. ARMv7 I would drop in a heartbeat except that we happen to have one already. 32-bit was agreed years ago was not supported. Older Intel happens to be even easier to test so unlikely to be dropped. The Windows port is much more about that our code compiles with a different compiler, C++ and C standard library, and has fewer hidden assumptions about POSIX systems - which is sustainable only if it is easy to test.
Portability does not mean the code can in theory be ported
Portable means "able to be ported," not "has been tested and does not require porting." GROMACS is a portable code because we have designed it that way, and also because we have taken care to support a range of platforms.
(and if it happens to not work we claim it is unsupported),
Was that a constructive thing to say?
but that it actually does work across different platforms
For us to claim that it "actually does work" requires that we've spent the time to test it, which is one of the prerequisites for saying something is "supported." How often we prioritise the time to test it depends on how much effort that will cost us compared to the benefit our users derive (directly or indirectly).
that meet the common requirements for the codebase to compile and function correctly.
We're using vanilla GNU toolchain on a vanilla ppc64 Linux distribution, so nothing custom or vendor-specific is involved that would point to an effort beyond ensuring portability, hence explicit Power8 platform support is not a concern here, I think.
Of course, if such observations do not reproduce in other cases, we can flag this as a "known issue and consider it solved.
PS: ORNL and other US labs do use Power8 GPU clusters for some testing, e.g. without a live project affiliation AFAIK even to ORNL employees only Summitdev (Power S822LC, that is Power8 + P100) is available.
Good to know. Hopefully the comment about "Initial access to the summitdev system will be limited to the OLCF CAAR teams" is now out of date.
#10 Updated by Mark Abraham almost 2 years ago
- Status changed from New to Fix uploaded
#11 Updated by Mark Abraham almost 2 years ago
- Status changed from Fix uploaded to Resolved
#12 Updated by Mark Abraham almost 2 years ago
- Status changed from Resolved to Closed
#13 Updated by Szilárd Páll over 1 year ago
- Related to Task #3057: re-enable fusion on Power8/9 added
Disable instruction fusion on Power8
The -mpower8-fusion flag seems to be the source of incorrect code; not
confirmed, but likely a codegen issue that also affects Power9 with the
similar flag used.
Fixes #2747 #2746 #2734
Change-Id: I56f50e54db47f4fe30c42488f4c4f79ac474518a