division-by-sero error on SPARC64
When we run "mdrun" on SPARC64 machine (Riken "K" computer, Japan), we encountered a possible bug in GROMACS 5.0,6 and 5.1.
The program crashed with a division-by-zero error.
- Steps to reproduce:
1) Compile Gromacs (5.1) on K as follows.
cd gromacs-5.1/ mkdir build_mpi_d cd build_mpi_d cmake .. ¥ -DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/Toolchain-Fujitsu-Sparc64-mpi.cmake ¥ -DCMAKE_INSTALL_PREFIX=/my/gromacs/install/directry ¥ -DCMAKE_PREFIX_PATH=/home/apps/fftw/3.3.3 ¥ -DGMX_MPI=ON ¥ -DGMX_BUILD_MDRUN_ONLY=ON make && make install
2) Submit a job to run “mdrun" as an attachment file(job_nvt.sh).
3) "mdrun" crashes with following error messages (see the attachment file for complete ones).
[f04-036:15780] *** Process received signal *** [f04-036:15780] Signal: Floating point exception (8) [f04-036:15780] Signal code: Floating point divide-by-zero (3) [f04-036:15780] Failing at address: 0x806c60 [f04-036:15780] [ 0] gmx-5.1_mpi_d/bin/mdrun_mpi_d[0x806c60] [f04-036:15780] [ 1] gmx-5.1_mpi_d/bin/mdrun_mpi_d[0x11482c] [f04-036:15780] [ 2] gmx-5.1_mpi_d/bin/mdrun_mpi_d[0x1274fc] [f04-036:15780] [ 3] gmx-5.1_mpi_d/bin/mdrun_mpi_d[0x11c4b0]
We have also tried some variations of compiling options including "-DGMX_DOUBLE=ON" or "-DGMX_OPENMP=ON", but neither of them worked.
- A possible fix:
Patching the following line in src/gromacs/mdlib/tgroup.c:108 seems to fix the problem, but we are not sure whether this is a correct solution.
from ekind->bNEMD=(opts->ngacc > 1 || norm(opts->acc) > 0); to ekind->bNEMD=(opts->ngacc > 1 && norm(opts->acc) > 0);
#3 Updated by Yu Yamamori almost 5 years ago
- File ubqwTIP3P_ff99SB-ildn_c_nvt.log ubqwTIP3P_ff99SB-ildn_c_nvt.log added
- File ubqwTIP3P_ff99SB-ildn_c_min.gro ubqwTIP3P_ff99SB-ildn_c_min.gro added
- File ubqwTIP3P_ff99SB-ildn_c_nvt.tpr ubqwTIP3P_ff99SB-ildn_c_nvt.tpr added
- File ubqwTIP3P_ff99SB-ildn_c_nvt.mdp ubqwTIP3P_ff99SB-ildn_c_nvt.mdp added
- File ubqwTIP3P_ff99SB-ildn.top ubqwTIP3P_ff99SB-ildn.top added
- File ubqwTIP3P_ff99SB-ildn.itp ubqwTIP3P_ff99SB-ildn.itp added
I also update the set of input files and log files.
#4 Updated by Shun Sakuraba almost 5 years ago
I helped Yamamori debugging this problem on K computer, so let me explain some about the problem.
The problem (in my understanding) is that the expression
ekind->bNEMD=(opts->ngacc > 1 || norm(opts->acc) > 0);
is an out-of-bounds or an uninitialized access, because the number of accelerated groups is 0. (though I am not sure how it passes through all GROMACS tests if this assumption is true.) That's why we changed this to
Usually (i.e. in x86s) this will not cause any problem, as the pointer itself may be valid and norm() can be computed. It returns some value without causing any problem, and the program continues to run.
My hypothesis is that with Fujitsu compiler, the built-in
sqrtroutine in norm() uses division (I'm not sure why), leading to zero-division error if strange input data is used.
#6 Updated by Yu Yamamori almost 5 years ago
I have checked the fix ("from norm() to norm2()") on K, and it has worked well on K.
from ekind->bNEMD=(opts->ngacc > 1 || norm(opts->acc) > 0); to ekind->bNEMD=(opts->ngacc > 1 || norm2(opts->acc) > 0);
The compile options and input files/mdp options of the test run are the same as the previous ones.
I update a log file of new run.