Project

General

Profile

Bug #1918

unexpected behavior of termination upon signal/maxh

Added by Szilárd Páll almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I can't seem to get correct termination of mdrun to happen with either SIGTERM/SIGINT signal received nor with -maxh. It seems that at the first signal received (and "stopping at the next NS step" is printed), the termination instead happens at the second NS step. The same behavior seems to be present with -maxh.

When two signals are received, instead of stopping at the next step, termination happens at the next NS step.

Attached stderr outputs of the three aforementioned cases.

test_sigterm_stderr.out (7.17 KB) test_sigterm_stderr.out Szilárd Páll, 03/13/2016 01:44 AM
test_sigterm_2x_stderr.out (3.67 KB) test_sigterm_2x_stderr.out Szilárd Páll, 03/13/2016 01:44 AM
test_maxh0.001_stderr.out (7.21 KB) test_maxh0.001_stderr.out Szilárd Páll, 03/13/2016 01:44 AM

Associated revisions

Revision 54733837 (diff)
Added by Berk Hess over 3 years ago

Update mdrun termination info

Both the mdrun help and stderr/log messages reported that mdrun
would stop at the next step or next nstlist step, which has not been
correct for some time.

Fixes #1918.

Change-Id: Ia890ffb3ccfdbdbed3d003b149c3cbb55e5c1818

History

#1 Updated by Mark Abraham almost 4 years ago

One signal is supposed to trigger an orderly shutdown, like -maxh and -nsteps. Two signals doesn't do an orderly shutdown, and IIRC the main feature there is that no checkpoint is written. You've observed what I understand to be the intended behaviour. I think that the message string should be less precise (perhaps it dates from an earlier era).

Coordinating all ranks to receive the internal version of that signal (via gs in md.cpp) only works at an NS step, and takes place after writing trajectory-like output for that MD step. Actually writing a checkpoint can only happen after global communication, which happens only at the NS step after that. We could write more to code to
  • permit us to retroactively write the checkpoint we would have written before the velocity update, or
  • to terminate by writing a checkpoint at nstlist+1 steps, or
  • to add another global synchronization point just to do this check,

but none of that seems valuable.

#2 Updated by Szilárd Páll almost 4 years ago

The problem is that neither of the termination behaviors match what the messages claim:
  • "stopping at the next step" (two signals) simply does not happen, mdrun terminates at the next NS step.
  • "stopping at the next NS step" after one signal is also a false claim mdrun terminates at the next-next NS step (same for -maxh).

#3 Updated by Berk Hess almost 4 years ago

I noticed this change in behavior as well.
I think it's fine to stop at the next NS step (but the documentation should match).
But I don't see a reason why we should stop at the next-next NS step. We should be able to stop at the next NS step, checkpointing is done after signal communication.

#4 Updated by Szilárd Páll almost 4 years ago

Berk Hess wrote:

I noticed this change in behavior as well.
I think it's fine to stop at the next NS step (but the documentation should match).

Do you mean that the double signal case should simply be merged with the single one? That would mean "swallowing" a TERM/INT. The slight inconvenience of that is that often you end up doing a third "Ctrl+C" which then kills mdrun without writing a full log.

But I don't see a reason why we should stop at the next-next NS step. We should be able to stop at the next NS step, checkpointing is done after signal communication.

I thought so.

#5 Updated by Mark Abraham almost 4 years ago

Szilárd Páll wrote:

The problem is that neither of the termination behaviors match what the messages claim:
  • "stopping at the next step" (two signals) simply does not happen, mdrun terminates at the next NS step.
  • "stopping at the next NS step" after one signal is also a false claim mdrun terminates at the next-next NS step (same for -maxh).

Yes, I already agreed the message should be different. :-)

#6 Updated by Mark Abraham almost 4 years ago

Berk Hess wrote:

I noticed this change in behavior as well.
I think it's fine to stop at the next NS step (but the documentation should match).
But I don't see a reason why we should stop at the next-next NS step. We should be able to stop at the next NS step, checkpointing is done after signal communication.

No, any writing has already been done for that step, and will not happen until an NS step (by construction of bCPT). For example in master branch, the compute_globals call that reduces gs is on line 1534. do_md_trajectory_writing handles the writing of the checkpoint, and it was already called on line 1282.

#7 Updated by Szilárd Páll almost 4 years ago

Mark Abraham wrote:

Yes, I already agreed the message should be different. :-)

Yeah, but I'm arguing that an existing long-time feature that used to work is broken (those console messages are not just an accident, I think) and we should at least consider fixing (some of) them rather than alter the code printout/documentation to the new "broken" reality. :)

#8 Updated by Mark Abraham almost 4 years ago

Szilárd Páll wrote:

Mark Abraham wrote:

Yes, I already agreed the message should be different. :-)

Yeah, but I'm arguing that an existing long-time feature that used to work is broken (those console messages are not just an accident, I think)

Yeah, but they probably date from when we only had PD and maybe back then global synchronization every step was a "feature" ;-)

and we should at least consider fixing (some of) them rather than alter the code printout/documentation to the new "broken" reality. :)

As I already said in the first comment, I do not think we should introduce global synchronization (which must happen every step in order that a SIGTERM received at any rank is handled on all ranks at the next step), rather than handle life at the next NS step(s).

#9 Updated by Szilárd Páll almost 4 years ago

Mark Abraham wrote:

Szilárd Páll wrote:

Mark Abraham wrote:

Yes, I already agreed the message should be different. :-)

Yeah, but I'm arguing that an existing long-time feature that used to work is broken (those console messages are not just an accident, I think)

Yeah, but they probably date from when we only had PD and maybe back then global synchronization every step was a "feature" ;-)

That could could be the case.

As I already said in the first comment, I do not think we should introduce global synchronization (which must happen every step in order that a SIGTERM received at any rank is handled on all ranks at the next step), rather than handle life at the next NS step(s).

Sorry, I skipped through your reply to Berk.

It seems that it should be feasible to implement MPI_put()-based signalling (without unnecessary barriers), but admittedly I do not see enough benefit that would warrant it.

#10 Updated by Gerrit Code Review Bot almost 4 years ago

Gerrit received a related patchset '1' for Issue #1918.
Uploader: Berk Hess ()
Change-Id: Ia890ffb3ccfdbdbed3d003b149c3cbb55e5c1818
Gerrit URL: https://gerrit.gromacs.org/5723

#11 Updated by Erik Lindahl over 3 years ago

  • Status changed from New to Resolved

#12 Updated by Erik Lindahl over 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF