Project

General

Profile

Bug #2897

rotation/flex2 can still fail on cpu-only run on OpenCL build

Added by Mark Abraham 8 months ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The NVIDIA OpenCL config can still fail intermittently in master. It did so for me on rotation/flex2, though it is probably unrelated to the rotation module.

The console log had

11:49:59 Abnormal return value for 'mpirun -np 2 -wdir /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/regressiontests/rotation/flex2 gmx mdrun    -nb cpu -ntomp 2  -notunepme >mdrun.out 2>&1' was -1
11:49:59 FAILED. Check mdrun.out, md.log file(s) in flex2 for flex2
11:50:24 1 out of 12 rotation tests FAILED
11:50:24 All 0 extra tests PASSED

stdout/stderr was

GROMACS:      gmx mdrun, version 2020-dev-20190318-0bcba28
Executable:   /home/jenkins/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/gmx
Data prefix:  /home/jenkins/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs (source tree)
Working dir:  /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/regressiontests/rotation/flex2
Command line:
  gmx mdrun -nb cpu -ntomp 2 -notunepme

The current CPU can measure timings more accurately than the code in
gmx mdrun was configured to use. This might affect your simulation
speed as accurate timings are needed for load-balancing.
Please consider rebuilding gmx mdrun with the GMX_USE_RDTSCP=ON CMake option.
Reading file topol.tpr, VERSION 2020-dev-20190318-0bcba28 (single precision)
Can not increase nstlist because an NVE ensemble is used

Using 2 MPI processes
Using 2 OpenMP threads per MPI process

NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin threads to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).
starting mdrun 'Good gRace! Old Maple Actually Chews Slate'
25 steps,      0.1 ps.
[bs-nix1310:18331] *** Process received signal ***
[bs-nix1310:18331] Signal: Floating point exception (8)
[bs-nix1310:18331] Signal code:  (7)
[bs-nix1310:18331] Failing at address: 0x7fb101d00150
[bs-nix1310:18331] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fb100153330]
[bs-nix1310:18331] [ 1] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(_Z19dd_partition_systemP8_IO_FILERKN3gmx8MDLoggerElPK9t_commrecbiP7t_stateRK10gmx_mtop_tPK10t_inputrecS9_PNS1_12PaddedVectorINS1_11BasicVectorIfEENS1_9AllocatorISI_NS1_23AlignedAllocationPolicyEEEEEPNS1_7MDAtomsEP14gmx_localtop_tP10t_forcerecP11gmx_vsite_tPNS1_11ConstraintsEP6t_nrnbP13gmx_wallcycleb+0x459) [0x7fb101d00150]
[bs-nix1310:18331] [ 2] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(_ZN3gmx10Integrator5do_mdEv+0x2bfa) [0x7fb1022ce22a]
[bs-nix1310:18331] [ 3] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(_ZN3gmx10Integrator3runEjb+0x1b3) [0x7fb1022c97d1]
[bs-nix1310:18331] [ 4] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(_ZN3gmx8Mdrunner8mdrunnerEv+0x385f) [0x7fb1022efccf]
[bs-nix1310:18331] [ 5] gmx() [0x40d337]
[bs-nix1310:18331] [ 6] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(+0xf51315) [0x7fb101c9c315]
[bs-nix1310:18331] [ 7] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x39d) [0x7fb101c9de5b]
[bs-nix1310:18331] [ 8] gmx() [0x40ad3c]
[bs-nix1310:18331] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fb0ff4fff45]
[bs-nix1310:18331] [10] gmx() [0x40abd9]
[bs-nix1310:18331] *** End of error message ***

which could perhaps suggest that a garbage force produced a garbage update which broke the attempt to repartition.

The md.log ended with

Enforced rotation: group 0 type 'flex2'

Linking all bonded interactions to atoms

Intra-simulation communication will occur every 1 steps.
There are: 4 Atoms
Atom distribution over 2 domains: av 2 stddev 0 min 2 max 2
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  rest
Initial temperature: 0 K

Started mdrun on rank 0 Mon Mar 18 11:49:53 2019

           Step           Time
              0        0.00200

   Energies (kJ/mol)
        LJ (SR)  Disper. corr.   Coulomb (SR)   COM Pull En.      Potential
    0.00000e+00    8.39657e-02    0.00000e+00    1.70650e+02    1.70734e+02
    Kinetic En.   Total Energy    Temperature Pres. DC (bar) Pressure (bar)
    5.25263e-02    1.70786e+02    1.40388e+00   -1.14549e-05    1.12425e-03

History

#1 Updated by Mark Abraham 8 months ago

  • Related to Bug #2737: AMD OpenCl failes release build in complex tests added

#2 Updated by Mark Abraham 8 months ago

  • Related to Bug #2702: PME gather reduction race in OpenCL (and CUDA) added

#3 Updated by Szilárd Páll 8 months ago

This is a CPU run (see the -nb cpu on the command line above), so not related to any GPU suff (besides it also does DD so PME can't be running on the GPU).

Perhaps we should stress-test the rotation case by running repeatedly and see if we can observe the failure?

#4 Updated by Szilárd Páll 8 months ago

  • Related to deleted (Bug #2737: AMD OpenCl failes release build in complex tests)

#5 Updated by Szilárd Páll 8 months ago

  • Related to deleted (Bug #2702: PME gather reduction race in OpenCL (and CUDA))

#6 Updated by Mark Abraham 8 months ago

  • Subject changed from rotation/flex2 can still fail on OpenCL to rotation/flex2 can still fail on cpu-only run on OpenCL build

#7 Updated by Szilárd Páll 8 months ago

Do we know anything about whether this only reproduces in a GPU/OpenCL build?

Also available in: Atom PDF