rotation/flex2 can still fail on cpu-only run on OpenCL build
The NVIDIA OpenCL config can still fail intermittently in master. It did so for me on rotation/flex2, though it is probably unrelated to the rotation module.
The console log had
11:49:59 Abnormal return value for 'mpirun -np 2 -wdir /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/regressiontests/rotation/flex2 gmx mdrun -nb cpu -ntomp 2 -notunepme >mdrun.out 2>&1' was -1 11:49:59 FAILED. Check mdrun.out, md.log file(s) in flex2 for flex2 11:50:24 1 out of 12 rotation tests FAILED 11:50:24 All 0 extra tests PASSED
GROMACS: gmx mdrun, version 2020-dev-20190318-0bcba28 Executable: /home/jenkins/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/gmx Data prefix: /home/jenkins/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs (source tree) Working dir: /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/regressiontests/rotation/flex2 Command line: gmx mdrun -nb cpu -ntomp 2 -notunepme The current CPU can measure timings more accurately than the code in gmx mdrun was configured to use. This might affect your simulation speed as accurate timings are needed for load-balancing. Please consider rebuilding gmx mdrun with the GMX_USE_RDTSCP=ON CMake option. Reading file topol.tpr, VERSION 2020-dev-20190318-0bcba28 (single precision) Can not increase nstlist because an NVE ensemble is used Using 2 MPI processes Using 2 OpenMP threads per MPI process NOTE: The number of threads is not equal to the number of (logical) cores and the -pin option is set to auto: will not pin threads to cores. This can lead to significant performance degradation. Consider using -pin on (and -pinoffset in case you run multiple jobs). starting mdrun 'Good gRace! Old Maple Actually Chews Slate' 25 steps, 0.1 ps. [bs-nix1310:18331] *** Process received signal *** [bs-nix1310:18331] Signal: Floating point exception (8) [bs-nix1310:18331] Signal code: (7) [bs-nix1310:18331] Failing at address: 0x7fb101d00150 [bs-nix1310:18331] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fb100153330] [bs-nix1310:18331] [ 1] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(_Z19dd_partition_systemP8_IO_FILERKN3gmx8MDLoggerElPK9t_commrecbiP7t_stateRK10gmx_mtop_tPK10t_inputrecS9_PNS1_12PaddedVectorINS1_11BasicVectorIfEENS1_9AllocatorISI_NS1_23AlignedAllocationPolicyEEEEEPNS1_7MDAtomsEP14gmx_localtop_tP10t_forcerecP11gmx_vsite_tPNS1_11ConstraintsEP6t_nrnbP13gmx_wallcycleb+0x459) [0x7fb101d00150] [bs-nix1310:18331] [ 2] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(_ZN3gmx10Integrator5do_mdEv+0x2bfa) [0x7fb1022ce22a] [bs-nix1310:18331] [ 3] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(_ZN3gmx10Integrator3runEjb+0x1b3) [0x7fb1022c97d1] [bs-nix1310:18331] [ 4] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(_ZN3gmx8Mdrunner8mdrunnerEv+0x385f) [0x7fb1022efccf] [bs-nix1310:18331] [ 5] gmx() [0x40d337] [bs-nix1310:18331] [ 6] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(+0xf51315) [0x7fb101c9c315] [bs-nix1310:18331] [ 7] /mnt/workspace/Matrix_PreSubmit_master/8c9ee12f/gromacs/bin/../lib/libgromacs.so.5(_ZN3gmx24CommandLineModuleManager3runEiPPc+0x39d) [0x7fb101c9de5b] [bs-nix1310:18331] [ 8] gmx() [0x40ad3c] [bs-nix1310:18331] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fb0ff4fff45] [bs-nix1310:18331]  gmx() [0x40abd9] [bs-nix1310:18331] *** End of error message ***
which could perhaps suggest that a garbage force produced a garbage update which broke the attempt to repartition.
The md.log ended with
Enforced rotation: group 0 type 'flex2' Linking all bonded interactions to atoms Intra-simulation communication will occur every 1 steps. There are: 4 Atoms Atom distribution over 2 domains: av 2 stddev 0 min 2 max 2 Center of mass motion removal mode is Linear We have the following groups for center of mass motion removal: 0: rest Initial temperature: 0 K Started mdrun on rank 0 Mon Mar 18 11:49:53 2019 Step Time 0 0.00200 Energies (kJ/mol) LJ (SR) Disper. corr. Coulomb (SR) COM Pull En. Potential 0.00000e+00 8.39657e-02 0.00000e+00 1.70650e+02 1.70734e+02 Kinetic En. Total Energy Temperature Pres. DC (bar) Pressure (bar) 5.25263e-02 1.70786e+02 1.40388e+00 -1.14549e-05 1.12425e-03
#3 Updated by Szilárd Páll 6 months ago
This is a CPU run (see the
-nb cpu on the command line above), so not related to any GPU suff (besides it also does DD so PME can't be running on the GPU).
Perhaps we should stress-test the rotation case by running repeatedly and see if we can observe the failure?