Project

General

Profile

Bug #1856

division-by-sero error on SPARC64

Added by Yu Yamamori about 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Affected version - extra info:
5.0.6 is also affected
Affected version:
Difficulty:
uncategorized
Close

Description

When we run "mdrun" on SPARC64 machine (Riken "K" computer, Japan), we encountered a possible bug in GROMACS 5.0,6 and 5.1.
The program crashed with a division-by-zero error.

  • Steps to reproduce:

1) Compile Gromacs (5.1) on K as follows.

cd gromacs-5.1/

mkdir build_mpi_d

cd build_mpi_d

cmake .. ¥
     -DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/Toolchain-Fujitsu-Sparc64-mpi.cmake ¥
     -DCMAKE_INSTALL_PREFIX=/my/gromacs/install/directry ¥
     -DCMAKE_PREFIX_PATH=/home/apps/fftw/3.3.3 ¥
     -DGMX_MPI=ON ¥
     -DGMX_BUILD_MDRUN_ONLY=ON

make && make install

2) Submit a job to run “mdrun" as an attachment file(job_nvt.sh).

pjsub job.sh

3) "mdrun" crashes with following error messages (see the attachment file for complete ones).

[f04-036:15780] *** Process received signal ***
[f04-036:15780] Signal: Floating point exception (8)
[f04-036:15780] Signal code: Floating point divide-by-zero (3)
[f04-036:15780] Failing at address: 0x806c60
[f04-036:15780] [ 0] gmx-5.1_mpi_d/bin/mdrun_mpi_d[0x806c60]
[f04-036:15780] [ 1] gmx-5.1_mpi_d/bin/mdrun_mpi_d[0x11482c]
[f04-036:15780] [ 2] gmx-5.1_mpi_d/bin/mdrun_mpi_d[0x1274fc]
[f04-036:15780] [ 3] gmx-5.1_mpi_d/bin/mdrun_mpi_d[0x11c4b0]


We have also tried some variations of compiling options including "-DGMX_DOUBLE=ON" or "-DGMX_OPENMP=ON", but neither of them worked.
  • A possible fix:

Patching the following line in src/gromacs/mdlib/tgroup.c:108 seems to fix the problem, but we are not sure whether this is a correct solution.

from
ekind->bNEMD=(opts->ngacc > 1 || norm(opts->acc[0]) > 0);
to
ekind->bNEMD=(opts->ngacc > 1 && norm(opts->acc[0]) > 0);


Regards,
job.sh (1.1 KB) job.sh Yu Yamamori, 11/18/2015 01:54 AM
nvt.e4506631 (13.6 KB) nvt.e4506631 error message Yu Yamamori, 11/18/2015 01:54 AM
nvt.mdp (841 Bytes) nvt.mdp Yu Yamamori, 11/19/2015 02:03 AM
ubqwTIP3P_ff99SB-ildn_c_nvt.log (23.5 KB) ubqwTIP3P_ff99SB-ildn_c_nvt.log log file (in the case of fixing as above) Yu Yamamori, 11/19/2015 03:15 AM
ubqwTIP3P_ff99SB-ildn_c_min.gro (1.16 MB) ubqwTIP3P_ff99SB-ildn_c_min.gro initial coordinate file Yu Yamamori, 11/19/2015 03:15 AM
ubqwTIP3P_ff99SB-ildn_c_nvt.mdp (11.4 KB) ubqwTIP3P_ff99SB-ildn_c_nvt.mdp Yu Yamamori, 11/19/2015 03:15 AM
ubqwTIP3P_ff99SB-ildn.top (351 KB) ubqwTIP3P_ff99SB-ildn.top topology file(ubiqutin in water, Amber99SB-ildn, TIP3P) Yu Yamamori, 11/19/2015 03:15 AM
ubqwTIP3P_ff99SB-ildn.itp (18.5 KB) ubqwTIP3P_ff99SB-ildn.itp Yu Yamamori, 11/19/2015 03:15 AM
ubqwTIP3P_ff99SB-ildn_c_nvt.tpr (1.66 MB) ubqwTIP3P_ff99SB-ildn_c_nvt.tpr Yu Yamamori, 11/19/2015 03:15 AM
ubqwTIP3P_ff99SB-ildn_c_nvt.log (23.5 KB) ubqwTIP3P_ff99SB-ildn_c_nvt.log log file (in the case of "norm2" fixing) Yu Yamamori, 12/14/2015 04:12 AM

Associated revisions

Revision 7416722c (diff)
Added by Carsten Kutzner almost 4 years ago

Changed norm(...) > 0 condition to norm2(...) > 0 in init_ekindata()

Fixes issue #1856 on K computer

Change-Id: Ib4f301f17124077bb6cc2aa6b955d01ccdcaec1b

History

#1 Updated by Berk Hess about 4 years ago

  • Status changed from New to Accepted

Do you have acc-grps and/or accelerate set in your mdp file?

#2 Updated by Yu Yamamori about 4 years ago

I did not use both of acc-grps and accelerate option in my mdp file.

I attach the actual mdp file (because of test run, the number of steps is 100).

#4 Updated by Shun Sakuraba about 4 years ago

I helped Yamamori debugging this problem on K computer, so let me explain some about the problem.

The problem (in my understanding) is that the expression

ekind->bNEMD=(opts->ngacc > 1 || norm(opts->acc[0]) > 0);

is an out-of-bounds or an uninitialized access, because the number of accelerated groups is 0. (though I am not sure how it passes through all GROMACS tests if this assumption is true.) That's why we changed this to &&.
Usually (i.e. in x86s) this will not cause any problem, as the pointer itself may be valid and norm() can be computed. It returns some value without causing any problem, and the program continues to run.
My hypothesis is that with Fujitsu compiler, the built-in sqrt routine in norm() uses division (I'm not sure why), leading to zero-division error if strange input data is used.

#5 Updated by Erik Lindahl about 4 years ago

Hi,

I would rather change it from using norm() to use norm2() instead. This avoids the entire square root calculation, which is anyway what we should do when we don't absolutely need the distance.

Can you check if that works?

#6 Updated by Yu Yamamori about 4 years ago

Dear all,

I have checked the fix ("from norm() to norm2()") on K, and it has worked well on K.

src/gromacs/mdlib/tgroup.c:108

from
ekind->bNEMD=(opts->ngacc > 1 || norm(opts->acc[0]) > 0);
to
ekind->bNEMD=(opts->ngacc > 1 || norm2(opts->acc[0]) > 0);

The compile options and input files/mdp options of the test run are the same as the previous ones.

I update a log file of new run.

#7 Updated by Gerrit Code Review Bot almost 4 years ago

Gerrit received a related patchset '1' for Issue #1856.
Uploader: Carsten Kutzner ()
Change-Id: Ib4f301f17124077bb6cc2aa6b955d01ccdcaec1b
Gerrit URL: https://gerrit.gromacs.org/5643

#8 Updated by Carsten Kutzner almost 4 years ago

Bug fixed in 5.0, fix should propagate to 5.1 soon.

#9 Updated by Carsten Kutzner almost 4 years ago

  • Status changed from Accepted to Closed

#10 Updated by Mark Abraham almost 4 years ago

  • Target version set to 5.0.8

#11 Updated by Mark Abraham almost 4 years ago

I have merged the fix to release-5-1 branch, so this will appear in 5.1.3

Also available in: Atom PDF