Project

General

Profile

Bug #1431

rotation/flex-t regressiontest failing on BG/Q

Added by Mark Abraham over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Affected version - extra info:
and master
Affected version:
Difficulty:
uncategorized
Close

Description

This test fails on BlueGene/Q in single precision release-4-6 and master branch, despite all other tests passing. It passes in double precision. The failure is at step 0, and pretty hard! :)

Enforced rotation: group 0 type 'flex-t'
Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  rest
There are: 4 Atoms
Max number of connections per atom is 0
Total number of connections is 0
Initial temperature: 0 K

Started mdrun on node 0 Thu Jan  1 01:00:00 1970
           Step           Time         Lambda
              0        0.00200        0.00000

Grid: 20 x 20 x 20 cells
   Energies (kJ/mol)
        LJ (SR)  Disper. corr.   Coulomb (SR)   COM Pull En.      Potential
    0.00000e+00   -1.76597e-04    0.00000e+00            nan            nan
    Kinetic En.   Total Energy    Temperature Pres. DC (bar) Pressure (bar)
            nan            nan            nan   -1.14549e-05            nan

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6.6-dev-20140129-049b857-local
Source code file: ../src/mdlib/pull_rotation.c, line: 2501

Fatal error:
Enforced rotation: No reference data for first slab (n=-2147483648), unable to proceed.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

I imagine the error message is a symptom, not a cause. Attached tarball of the regressiontests stdout and all the rotation test results, in case they are useful. But without more understanding of the rotation code, I have no idea what could be wrong.

flex-t-fail.tgz (506 KB) flex-t-fail.tgz Mark Abraham, 02/04/2014 01:19 AM

Associated revisions

Revision b2076858 (diff)
Added by Carsten Kutzner over 3 years ago

Avoid cross product with zero vector in rotational pulling.

Fixes #1431 (rotation/flex-t regression test failing on BG/Q)

In do_flex_lowlevel() we checked (by mistake!) for xj-xcn being
zero, although we need to check for yj0-ycn being zero, since
we use yj0-ycn in a cross product in the following lines of code.
I now also replaced the direct check (0 == norm(...)) by checking
what gmx_numzero(norm(...)) returns. The latter replacement
was also applied in the do_flex2_lowlevel() routine. Note that there
the check for small xj-xcn was and is actually correct.

Change-Id: I972b6d67a81e30f297db286cd2224f66753a20aa

History

#1 Updated by Carsten Kutzner over 3 years ago

  • Assignee changed from Berk Hess to Carsten Kutzner

#2 Updated by Carsten Kutzner over 3 years ago

Actually for me all the tests pass on BG. But I compiled with different settings.
Maybe that can be narrowed down to the compiler, or FFT, or MPI library ...

Mine:
Log file opened on Fri Feb 21 16:03:50 2014
Host: juqueen3.zam.kfa-juelich.de pid: 2282 nodeid: 0 nnodes: 1
Gromacs version: VERSION 4.6.6-dev-20140211-dd2f13a
GIT SHA1 hash: dd2f13a0570ba46b9126ae14a7088ffb7da6990e
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled
GPU support: disabled
invsqrt routine: gmx_software_invsqrt(x)
CPU acceleration: NONE
FFT library: fftpack (built-in)
Large file support: enabled
RDTSCP usage: disabled
Built on: Fri Feb 21 15:51:47 CET 2014
Built by: [CMAKE]
Build OS/arch: Linux 2.6.32-431.3.1.el6.ppc64 ppc64
Build CPU vendor: IBM
Build CPU brand: POWER7 (architected), altivec supported
Build CPU family: 0 Model: 0 Stepping: 0
Build CPU features: CannotDetect
C compiler: /usr/bin/cc GNU cc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
C compiler flags: -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wall -Wno-unused -Wunused-value -fno-inline -g

Yours:
Log file opened on Tue Feb 4 01:02:22 2014
Host: R63-ID-J01.zam.kfa-juelich.de pid: 1 nodeid: 0 nnodes: 1
Gromacs version: VERSION 4.6.6-dev-20140129-049b857-local
GIT SHA1 hash: 049b8574b450eb7f2bc7131d7dabe98af64594dc
Branched from: unknown
Precision: single
Memory model: 64 bit
MPI library: MPI
OpenMP support: enabled
GPU support: disabled
invsqrt routine: (1.0/sqrtf(x))
CPU acceleration: IBM_QPX
FFT library: fftw-3.3.3-fma
Large file support: enabled
RDTSCP usage: disabled
Built on: Mon Feb 3 20:20:13 CET 2014
Built by: [CMAKE]
Build OS/arch: Linux 2.6.32-431.el6.ppc64 ppc64
Build CPU vendor: Unknown, cross-compiled
Build CPU brand: Unknown, cross-compiled
Build CPU family: 0 Model: 0 Stepping: 0
Build CPU features:
C compiler: /bgsys/drivers/ppcfloor/comm/xl.ndebug/bin/mpixlc_r XL IBM XL C/C++ for Blue Gene, V12.1
C compiler flags: -qlanglvl=extc99 -qarch=auto -qtune=auto -qthreaded -qalias=noansi -qhalt=e -O3 -DNDEBUG -qsuppress=1500-036

#3 Updated by Mark Abraham over 3 years ago

OK, that's something. But one would never want to compile with gcc for BG/Q unless desperate. I can probably use the XLC platform file to try fftpack and software_invsqrt.

Is there any use of SIMD or 1/sqrt that is particular to flex-t and no other code?

#4 Updated by Carsten Kutzner over 3 years ago

No, there isn't. Note that flex and flex-t even use the same compute kernel.
Maybe something goes wrong with the constant conversion in line 4064?

#5 Updated by Mark Abraham over 3 years ago

Of which file / branch? :-) there's a repo link above, if possible

#6 Updated by Carsten Kutzner over 3 years ago

file pull_rotation.c, branch 4.6, I mean the svmul().

Could you provide your compile/install script then I could use that to trigger the bug.
I am having trouble to compile v. 4.6 with mpixlc_r ...

#7 Updated by Carsten Kutzner over 3 years ago

To further narrow it down a bit, here is another version that passes all rotation tests. This time with the bgxlc compiler, but still without CPU acceleration and with fftpack.

Log file opened on Fri Feb 21 18:27:50 2014
Host: juqueen3.zam.kfa-juelich.de pid: 22335 nodeid: 0 nnodes: 1
Gromacs version: VERSION 4.6.6-dev-20140211-dd2f13a
GIT SHA1 hash: dd2f13a0570ba46b9126ae14a7088ffb7da6990e
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: disabled
GPU support: disabled
invsqrt routine: gmx_software_invsqrt(x)
CPU acceleration: NONE
FFT library: fftpack (built-in)
Large file support: enabled
RDTSCP usage: disabled
Built on: Fri Feb 21 18:23:23 CET 2014
Built by: [CMAKE]
Build OS/arch: Linux 2.6.32-431.3.1.el6.ppc64 ppc64
Build CPU vendor: IBM
Build CPU brand: POWER7 (architected), altivec supported
Build CPU family: 0 Model: 0 Stepping: 0
Build CPU features: CannotDetect
C compiler: /usr/bin/bgxlc XL IBM XL C/C++ for Blue Gene, V12.1
C compiler flags: -qlanglvl=extc99 -qarch=auto -qtune=auto -qthreaded -qalias=noansi -qhalt=e -O -DNDEBUG

#8 Updated by Mark Abraham over 3 years ago

Carsten Kutzner wrote:

file pull_rotation.c, branch 4.6, I mean the svmul().

Could you provide your compile/install script then I could use that to trigger the bug.
I am having trouble to compile v. 4.6 with mpixlc_r ...

There's a platform file in the repo, and the install guide covers it last I checked. No batteries required! :-)

#9 Updated by Mark Abraham over 3 years ago

As a guess, I'd be concerned that stuff related to http://redmine.gromacs.org/projects/gromacs/repository/revisions/release-4-6/entry/src/mdlib/pull_rotation.c#L2241 might be unstable. There's probably a 1/rsqrt in the norm. Can any of the quantities lead unluckily to a zero vector in single and not in double?

#10 Updated by Carsten Kutzner over 3 years ago

Maybe it is something like that.
But the very block of code you point to should not be the problem, since bCalcPotFit evaluates to FALSE for the regression tests (because fit-type != potential in the .mdp files).

I still have not been able to reproduce the bug on BG/Q. Now I compiled with QPX, but still all tests pass.

#11 Updated by Carsten Kutzner over 3 years ago

Hi Mark,
For some reason I cannot get the 4.6 version to compile using the platform script.

cmake .. \
-DGMX_MPI=ON \
-DGMX_PREFER_STATIC_LIBS=ON \
-DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/BlueGeneQ-static-XL-CXX.cmake

It always wants to link to .so versions of MPI libs etc, although the .a versions are there.

With master, it works, although I had no luck ;) in getting the flex test to fail. Not sure how to proceed now.
Can I somewhere look up the exact compilation settings Jenkins used for 4.6 / master, where the flex test fails?

Thanks,
Carsten

#12 Updated by Mark Abraham over 3 years ago

Carsten Kutzner wrote:

Hi Mark,
For some reason I cannot get the 4.6 version to compile using the platform script.

cmake .. \
-DGMX_MPI=ON \
-DGMX_PREFER_STATIC_LIBS=ON \
-DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/BlueGeneQ-static-XL-CXX.cmake

Should be OK, unless the CXX file is broken and the C one works? But I doubt that.

It always wants to link to .so versions of MPI libs etc, although the .a versions are there.

That's always just worked for me, without needing GMX_PREFER_STATIC_LIBS

With master, it works, although I had no luck ;) in getting the flex test to fail. Not sure how to proceed now.
Can I somewhere look up the exact compilation settings Jenkins used for 4.6 / master, where the flex test fails?

The tarball here will have some of that stuff in the top of the .log file, which might indicate where a relevant difference could be.

#13 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1431.
Uploader: Carsten Kutzner ()
Change-Id: I972b6d67a81e30f297db286cd2224f66753a20aa
Gerrit URL: https://gerrit.gromacs.org/3261

#14 Updated by Carsten Kutzner over 3 years ago

  • Status changed from New to Fix uploaded

#15 Updated by Carsten Kutzner over 3 years ago

  • Status changed from Fix uploaded to Resolved
  • % Done changed from 0 to 100

#16 Updated by Carsten Kutzner over 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF