Project

General

Profile

Bug #1866

gmx tools dump core for fatal errors

Added by Viveca Lindahl about 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
core library
Target version:
-
Affected version - extra info:
2016
Affected version:
Difficulty:
uncategorized
Close

Description

Recently I've been getting a core dump when a gmx tool exits with a fatal error, e.g. if I put a typo in my mdp file and run gmx grompp. I've gotten it also e.g. for make_ndx. This is mostly an annoyance... but I guess it's not what is supposed to happen. Here is an example backtrace:

#0 0x00007f654bffdcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007f654bffdcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f654c0010d8 in __GI_abort () at abort.c:89
#2 0x0000000000701dcf in gmx_exit_on_fatal_error (exitType=ExitType_Abort, returnValue=1) at /data/viveca/gromacs/src/gromacs/utility/fatalerror.cpp:251
#3 0x0000000000701ea5 in gmx_fatal_mpi_va(int, const char *, int, gmx_bool, gmx_bool, const char *, typedef __va_list_tag __va_list_tag *) (file=0x1ccf698 "/data/viveca/gromacs/src/gromacs/gmxpreprocess/readir.cpp", line=2640, bMaster=1, bFinalize=0,
fmt=0x1cd5fd0 "Group %s referenced in the .mdp file was not found in the index file.\nGroup names must match either [moleculetype] names or custom index group\nnames, in which case you must supply an index file to the"..., ap=0x7ffd5262d428) at /data/viveca/gromacs/src/gromacs/utility/fatalerror.cpp:270
#4 0x0000000000701f63 in gmx_fatal (f_errno=0, file=0x1ccf698 "/data/viveca/gromacs/src/gromacs/gmxpreprocess/readir.cpp", line=2640,
fmt=0x1cd5fd0 "Group %s referenced in the .mdp file was not found in the index file.\nGroup names must match either [moleculetype] names or custom index group\nnames, in which case you must supply an index file to the"...) at /data/viveca/gromacs/src/gromacs/utility/fatalerror.cpp:277

gmx-core-dump-test.tgz (456 Bytes) gmx-core-dump-test.tgz Viveca Lindahl, 12/03/2015 10:37 AM

Associated revisions

Revision 1c83c6c0 (diff)
Added by Roland Schulz over 3 years ago

Don't use abort for fatal errors

Fixes #1866

Change-Id: I44330ef3769c684fbd0c8ffd7e8987d733006a67

History

#1 Updated by Mark Abraham about 4 years ago

I suspect this is a build tree out of sync with the source code leading to linking failure, because your make command hasn't triggered a call to cmake $srcdir when it would have been useful (I forget why). Try that, and/or nuke the build tree and run cmake fresh.

#2 Updated by Viveca Lindahl about 4 years ago

I tried starting from a clean (empty) build tree but still get a core dump.

#3 Updated by Mark Abraham about 4 years ago

OK. Can you please attach some files and an actual command, so we can try to reproduce it?

#4 Updated by Viveca Lindahl about 4 years ago

gmx grompp -f empty.mdp -c conf.gro -p simple.top

should do it (I used master commit 86ede59).

#5 Updated by Szilárd Páll about 4 years ago

I've also noticed similar stuff recently during my experiments on a cluster (DD-induced fatal error termination was spewing core files), but have not had time to look into it. I did assume that something was messed up with my environment or build setup, but perhaps it was not.

#6 Updated by Alexey Shvetsov about 4 years ago

I also see similar behavior (on any gmx_fatal call) tools will dump core.

#7 Updated by Teemu Murtola almost 4 years ago

  • Category set to core library

gmx_fatal() calls abort() in most cases, which may trigger a core dump. Or is this about something else?

#8 Updated by Szilárd Páll almost 4 years ago

Teemu Murtola wrote:

gmx_fatal() calls abort() in most cases, which may trigger a core dump. Or is this about something else?

In the cases where I observed it was gmx_fatal that terminated the run. Is the core dump on gmx_fatal an intended behavior or the side-effect of the change in the gmx_fatal's implementation?

#9 Updated by Teemu Murtola almost 4 years ago

I would probably classify it as a side effect; I don't care that much whether there is a core dump or not, but the earlier implementation that called std::exit() wasn't thread-safe, so we can't go back there. There are probably some C99/C++11 alternatives that could work better.

#10 Updated by Roland Schulz over 3 years ago

std::quick_exit seems to be intended as a solution for this? Anyone knows whether this would work?

#11 Updated by Teemu Murtola over 3 years ago

It should work, if it works on the compilers/standard libraries we want.

#12 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1866.
Uploader: Roland Schulz ()
Change-Id: I44330ef3769c684fbd0c8ffd7e8987d733006a67
Gerrit URL: https://gerrit.gromacs.org/5861

#13 Updated by Roland Schulz over 3 years ago

  • Status changed from New to Fix uploaded

Ubuntu 12.04 has libc 2.15 but quick_exit was only added in 2.16. So I added a cmake check and it falls back to abort if quick_exit is missing. Note that the user can use "ulimit -c 0" if they have a old libc and want to disable core dump.

#14 Updated by Szilárd Páll over 3 years ago

Roland Schulz wrote:

Ubuntu 12.04 has libc 2.15 but quick_exit was only added in 2.16. So I added a cmake check and it falls back to abort if quick_exit is missing. Note that the user can use "ulimit -c 0" if they have a old libc and want to disable core dump.

Sounds good. Although note that it is not a matter of wanting to disable core dump, I think. IMO there is simply no reason to dump core upon fatal error, nobody is interested in debugging why mdrun refused to start with incorrect options.

#15 Updated by Teemu Murtola over 3 years ago

Szilárd Páll wrote:

Although note that it is not a matter of wanting to disable core dump, I think. IMO there is simply no reason to dump core upon fatal error, nobody is interested in debugging why mdrun refused to start with incorrect options.

I think you've missed something here. In standard C/C++ before C99/C++11, there are only two ways to exit the process in mid-execution: exit() and abort(). exit() is not thread-safe (in particular, not for C++ code). abort() can produce core dumps. Is it also your opinion that there is no reason to avoid crashes, deadlocks or other undefined behavior that can occur if we don't use abort()?

#16 Updated by Szilárd Páll over 3 years ago

  • Affected version - extra info set to 2016

Teemu Murtola wrote:

I think you've missed something here. In standard C/C++ before C99/C++11, there are only two ways to exit the process in mid-execution: exit() and abort(). exit() is not thread-safe (in particular, not for C++ code). abort() can produce core dumps. Is it also your opinion that there is no reason to avoid crashes, deadlocks or other undefined behavior that can occur if we don't use abort()?

I did not miss that technical point. However, the behavior of a GROMACS tool upon encountering an error and the details of the implementation of this event are, although related, separate issues. Even if the implementation is technically sound, if it ends up being annoying in practice use, it can become a usability concern or just a minor, constant annoyance. That's why I noted that the question should IMO not be whether a user wants to disable core dumps, but rather how/whether the unnecessary side-effect of the current implementation can be fixed.

#17 Updated by Teemu Murtola over 3 years ago

Szilárd Páll wrote:

I did not miss that technical point. However, the behavior of a GROMACS tool upon encountering an error and the details of the implementation of this event are, although related, separate issues. Even if the implementation is technically sound, if it ends up being annoying in practice use, it can become a usability concern or just a minor, constant annoyance. That's why I noted that the question should IMO not be whether a user wants to disable core dumps, but rather how/whether the unnecessary side-effect of the current implementation can be fixed.

And the answer is that without C99/C++11, there is no way to fix this without compromising other behavior. So the choice was really three-fold: exit(), abort(), or dependency on previously-unused C99/C++11 features. When I mentioned the last alternative, the discussion just died several months ago. And based on previous experience, introducing additional C++11 dependencies is, honestly, potentially so much hassle down the road that I will rather spend my time on something else than drive adoption of this stuff.

#18 Updated by Roland Schulz over 3 years ago

  • Status changed from Fix uploaded to Resolved

#19 Updated by Erik Lindahl over 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF