Project

General

Profile

Task #2375

Clarify execution phases for MD simulation

Added by Eric Irrgang over 1 year ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Difficulty:
uncategorized
Close

Description

In support of a roadmap to an API layer between the user interface and the MD simulation machinery, one early set of tasks is to tighten up various changes of execution state and clarify the seams for client-library interaction.

Basic features of API-driven simulation are to

  • allow calling code (client) to provide or modify initial state
  • provide or modify simulation parameters and runtime parameters
  • directly access trajectory data during or after a simulation without expensive filesystem I/O
    • API abstraction for trajectory output and/or checkpointing * API-provided MDModule or call-back access to system snapshots * abstract or at least convertible representation of initial and final simulation state before and after performing the specified MD integration.

The above can be extracted to a parent issue at some point, but the current issue is intended to address the first bullet: define a roadmap to allow API client code to provide and/or clearly understand the initial state of a simulation. It is also an excuse to start clarifying phases of program execution such that non-user-interface aspects of mdrun can be compartmentalized into the library, allowing consistent semantics between CLI and other API-driven work. This probably involves work to encapsulate or reconsider the ownership relationships of things like modules, command-line options, and initialization defaults.

To start discussion, I would propose the following sequence of changes to submit.

0. Minor updates when state is loaded by `read_tpx` to more informatively indicate it is preliminary, pending checkpoint loads.
1. Remove or rework dependent code that requires late checkpoint loads. Potentially includes changes to what is included in checkpoint, such as parallelism runtime details.
2. Move checkpoint load earlier and clearly establish initial state.

Other changes to be addressed in separate Redmine issues include modernizing and extracting the command-line arguments (a whole other can of worms) behind something like the MdpOptionsProvider interface, though, as above with the checkpoint data, these may require discussion of what is a simulation parameter versus an execution parameter.


Related issues

Related to GROMACS - Task #3040: Refactor Restraint moduleNew
Blocked by GROMACS - Feature #2605: Library access to MD runnerClosed

Associated revisions

Revision 88c7ed2d (diff)
Added by Mark Abraham 4 months ago

Introduce tri-state enums for restarts

Both the user choice for appending and the decision about how to
implement a restart are good to express as a three-way enumeration of
mutually-exclusive possibilities, rather than booleans.

Checkpoint restarts also need to consider whether KE quantities
need to be recomputed, which is now stored in t_ekinstate alongside
the data from which it was computed.

Together, these eliminate the ContinuationOptions struct.

Several booleans in implementation objects were renamed to be
consistent with the StartingBehavior enumeration values, so that the
code is easier to understand.

Moved the call to handleRestart out of updateFromCommandLine now that
it no longer needed to be there.

Used namespaces for handlerestart.cpp

Refs #2804, #2375

Change-Id: I1128b94e947c6ef355a1b137b8978faa227ab1a0

Revision 916efac1 (diff)
Added by Mark Abraham 4 months ago

Rewrite starting behavior

Several necessary checks were deferred until the time the checkpoint
was read, which made this feature hard to implement and
understand. Having moved the code, if it will not be possible to
append, we can tell the user immediately.

Fixed a bug where mdrun -append would start from the .tpr
configuration when the checkpoint file was missing.

Opening the logfile in the non-appending cases is now
closely associated with the logic for how the restart
works.

Refs #2804, #2375

Change-Id: I83a846958619e72ddc9a5e9bae49a9b71221ad24

Revision b7b078e2 (diff)
Added by Mark Abraham 12 days ago

Fix multi-sim restart handling in corner cases

If different simulations would have different starting behaviour,
e.g. some checkpoint files are found and some are not, then we should
not allow a restart, and do so with a useful error message.

Refs #2375

Change-Id: I8845784e8310ab6ca81db189e4a42754add03def

History

#1 Updated by Erik Lindahl over 1 year ago

For the first step, I think it will be much simpler to move/modify existing code so that all reading of input files and possible modifications of the initial state happens before we call a main entry point.

There is also initial load balancing and other things affecting parameters, but there I think we should separate things properly into:

a) Things that we need to allow to change during any part in a simulation (e.g. redoing load balancing). For these there have to be proper re-initialization routines.

b) Things that we only want to alter as part of some initial testing/balancing. Here I think we should rather use special calls and make sure we can set up completely new simulations very fast.

This way we will be able to move to a cleaner "start simulation" API, and start by having a simple call where we simply obey the settings provided by the user, rather than hoping to redesign everything right away.

#2 Updated by Eric Irrgang over 1 year ago

Erik Lindahl wrote:

For the first step, I think it will be much simpler to move/modify existing code so that all reading of input files and possible modifications of the initial state happens before we call a main entry point.

I would like that, but that's a very big step that I am trying to break down into smaller steps.

b) Things that we only want to alter as part of some initial testing/balancing. Here I think we should rather use special calls and make sure we can set up completely new simulations very fast.

I am trying to tackle something like "b" first. I am basically separating the tasks of managing parameters and managing data. Managing data seems simpler and more immediately useful, but will require some shuffling of where and how some parameters are managed. I will start one or more separate Redmine issues regarding managing options / parameters, etc.

#3 Updated by Mark Abraham over 1 year ago

I have a patch series in preparation that should lead to being able to load the checkpoint immediately after the .tpr is read. This is more than mere beautification or nice organization - the checkpoint coordinates are needed for DD to do a good job.

As a side effect, we will be able to handle appending restarts better, in particular for handling opening output files more simply.

#4 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '3' for Issue #2375.
Uploader: M. Eric Irrgang ()
Change-Id: gromacs~master~I1eb8ea75fdaea77e0ce03f2d312c44db8df16f28
Gerrit URL: https://gerrit.gromacs.org/8141

#5 Updated by Eric Irrgang about 1 year ago

#6 Updated by Eric Irrgang about 1 year ago

  • Related to deleted (Feature #2605: Library access to MD runner)

#7 Updated by Eric Irrgang about 1 year ago

#8 Updated by Mark Abraham 5 months ago

  • Description updated (diff)

Layout fix for the multilevel bullets in the description

#9 Updated by Eric Irrgang 2 months ago

  • Related to Task #3040: Refactor Restraint module added

Also available in: Atom PDF