Project

General

Profile

Feature #3242

Please do not remove the -nsteps flag

Added by Victor Rusu 2 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Difficulty:
uncategorized
Close

Description

Dear GROMACS team,

As part of a support team in an HPC center, one of my duties is to help users to run jobs on our systems.
In my daily work, one of the most used flags is the -nsteps. This allows me to easily test the user input files for different systems and different architectures with a reasonable simulation time without having to modify the "original" user input per system.

Please, can you not remove the

-nsteps
flags from the command line?

The related warning on GROMACS is this:

The -nsteps functionality is deprecated, and may be removed in a future version. Consider using gmx convert-tpr -nsteps or changing the appropriate .mdp file field.

Thanks.


Related issues

Related to GROMACS - Task #3256: Remove -nsteps option from mdrunNew
Related to GROMACS - Task #2569: announce deprecations in GROMACS 2019Closed
Related to GROMACS - Task #3289: Distinguish identifying and non-identifying inputs to API operations.New

History

#1 Updated by Eric Irrgang 2 months ago

Can you clarify your use case? Part of the problem is that, historically, `nsteps` did too many things, and had overly complicated (and hard-to-correctly-determine/document semantics).

For instance, would it be sufficient for you to have a command-line flag that explicitly set the maximum number of steps that the command should run, similar to -maxh?

Or would it be sufficient to have a script that produced a new input file with an altered nsteps from the user-provided input? Or is it important that the altered simulation input be possible without creating a new filesystem artifact?

#2 Updated by Mark Abraham 2 months ago

Victor Rusu wrote:

Dear GROMACS team,

As part of a support team in an HPC center, one of my duties is to help users to run jobs on our systems.
In my daily work, one of the most used flags is the -nsteps. This allows me to easily test the user input files for different systems and different architectures with a reasonable simulation time without having to modify the "original" user input per system.

If you want to limit run time, that is best done with mdrun -maxh. Does that suit your use case?

#3 Updated by Victor Rusu 2 months ago

Dear Mark and Eric,

Thanks for the replies.
We do use both maxh and nsteps. The reason we use these flags is that we are looking not only for the intermediate info inside the log file but also for the statistics at the end of the file without interfering with the user input.
When there are claims that the GROMACS performance and/or sanity are fluctuating in a computing system we take the user input and run in several nodes and/or clusters (using an automated system).

So, we test the input files running for the same amount of time (maxh) and for the same number of steps (nsteps). And we check the output for all cases.

Depending on the system size, we need to run the inputs for minutes, while in other systems we run in seconds. So, the nsteps is a great way to say, run for X steps and I do not care how long it takes so that I can check the energy at that step.
The automated system compares the energies and performance. To have consistent energies we use the nsteps heavily.

The maxh is mostly used for performance checks. And we tend to use it when we perform checks manually. Because history has shown that it is difficult for us to find an optimal maxh value that would fit all user input files and systems we have.
While 5'000, 10'000 and 50'000 steps have proven to be enough to screen problems in the binaries and hardware.

I think in our uses cases a flag like -maxsteps would be exactly the same behavior as the -nsteps.

Please let me know if I was able to properly explain the use cases.

#4 Updated by Eric Irrgang 2 months ago

The workaround is to create a new TPR file with convert-tpr, as noted in the deprecation warning.

In the next few months, we need to implement API functionality that allows simulation input preparation in-memory so that a new file is not necessary, but the sensible implementation of a -maxsteps command line flag would be implemented more like -maxh, I believe, and not alter the simulation input record (which is part of the problem with the -nsteps implementation).

Such an option should probably get some buy-in from other developers before anyone spends time implementing it (though it should be pretty straight-forward), so it could help to update the issue title and description, and to bring it up on the developer mailing list or at the biweekly developer teleconferences.

#5 Updated by Victor Rusu 2 months ago

I understand the reason why you guys want to drop the nsteps. And I also know about the workaround.
In practice, I am really asking is instead of dropping nsteps and maybe implement in a near (but not clear and committed) future the new command-line option, is to do not drop the nsteps until you have a real solution that will replace nsteps.

#6 Updated by Eric Irrgang 2 months ago

Thank you for helping to clarify use cases that were not sufficiently addressed before the deprecation was announced. I don't see a task on the roadmap yet to remove -nsteps by any particular release, presumably because the deprecation warning is still helping to solicit feedback.

I think it will be constructive to recast this issue as a feature request for -maxsteps that is required before -nsteps can be removed. A title such as "-maxsteps option to handle an -nsteps use case" might be effective. I didn't want to change the issue title or description without your agreement, though.

#7 Updated by Eric Irrgang 2 months ago

Update: I created a task #3256 to track the deprecated -nsteps option. You can "watch" that task for updates. The current issue tracking system allows us to mark that task as "blocked" by, say, a feature request for "-maxsteps", but I don't know if that is available in the GitLab issue tracking system that we will be migrating to soon. I'm marking these two issues as "related" for now.

#8 Updated by Eric Irrgang 2 months ago

  • Related to Task #3256: Remove -nsteps option from mdrun added

#9 Updated by Mark Abraham about 1 month ago

  • Related to Task #2569: announce deprecations in GROMACS 2019 added

#10 Updated by Mark Abraham about 1 month ago

If we replace it by something that does all the same things, then I think we should not replace the name. And if we keep something like it, then we need to not have its implementation make the implementation of mdrun unduly hard to understand and test.

A collection of tpr files e.g. created with gmx convert_tpr with different numbers of steps suits the benchmarking / performance testing use cases just as well as such command flags, e.g.

gmx mdrun -s topol_for_1000_steps.tpr

vs

gmx mdrun -nsteps 1000

and the proper solution is to use the API functionality to change the mdp setting and/or write tpr files as necessary behind the scenes.

#11 Updated by Eric Irrgang about 1 month ago

My proposal for a maxsteps option would not be to do the same things as nsteps. It would not affect the final simulation step, not allow infinite steps, and not override the input record. The idea would be to mimix maxh as closely as possible, and would facilitate testing for such things as checkpointing and restoration, in part by being divorced from the simulation API. Alternatively, it could be implemented in terms of the API using the evolving stop condition hook, but we need to be clear about whether the simulation would be considered "done" or not. (My sense of "maxsteps" is that it has no interaction with whether a simulation is considered "done", whereas API exposure of the stop condition hook does imply that the simulation work is complete. However, the stop signal should probably always indicate premature termination, and should be separated from the reconciliation of simulation "done"ness under adaptive work.)

Creating / modifying simulation inputs that define a run of a certain number of steps is a separate scenario. It does seem like that scenario effectively addresses this use case, but I am open to a maxsteps facility. If it could be implemented as a minor extension to maxh, then, if not for this use case, it could be a useful seam for prematurely terminating a simulation in progress without a bunch of extra rigging. Otherwise, I will look at further evolution and wrapping of the stop signal handler.

I've gone a bit beyond the scope of this issue. See also, then, #3289.

#12 Updated by Eric Irrgang about 1 month ago

  • Related to Task #3289: Distinguish identifying and non-identifying inputs to API operations. added

#13 Updated by Szilárd Páll about 1 month ago

Mark Abraham wrote:

A collection of tpr files e.g. created with gmx convert_tpr with different numbers of steps suits the benchmarking / performance testing use cases just as well as such command flags, e.g.

gmx mdrun -s topol_for_1000_steps.tpr

vs

gmx mdrun -nsteps 1000

FWIW for my own use-cases that is not true. Requiring a separate command to generate and use a new file every time I might want to increase or decrease the number of steps of a test run during development (something I may do dozens of times on some days) is not equivalent with changing a number (typically a digit of a number) on the command line.

Eric Irrgang wrote:

My proposal for a maxsteps option would not be to do the same things as nsteps. It would not affect the final simulation step, not allow infinite steps, and not override the input record. The idea would be to mimix maxh as closely as possible, and would facilitate testing for such things as checkpointing and restoration, in part by being divorced from the simulation API.

-nsteps -1 is AFAIK typically used in conjunction with -maxh, or even without just relying on having mdrun terminated by a signal. Are you proposing a new -maxh replacement that implements those use-cases with a single command line option?

#14 Updated by Eric Irrgang about 1 month ago

-nsteps -1 is AFAIK typically used in conjunction with -maxh, or even without just relying on having mdrun terminated by a signal. Are you proposing a new -maxh replacement that implements those use-cases with a single command line option?

I am specifically trying to avoid the subject of infinitely long simulations for the moment. (I'm more comfortable with either defining an overly long simulation task or an API-driven loop of simulation segments.)

My proposal for the -nsteps -1 -maxh XX use case would be to define a long simulation in a single TPR and just use -maxh.

The main use case I have had for maxh has been to help HPC jobs shut down cleanly, well before the wall time limit is reached, either for testing or production runs.

My expectation is that -maxsteps would probably be used without -maxh, using a TPR specifying nsteps = large, and setting -maxsteps in [1, nsteps), such as 1 or 2 steps just to make sure a simulation can launch properly, or just run to the next interval of interest, like a neighborlist rebuild, global communication, checkpoint interval, output interval, logging interval, etc. I can only envision use cases related to profiling and testing (particularly to constrain the progress of a simulation within its prescribed work).

#15 Updated by Szilárd Páll about 1 month ago

Eric Irrgang wrote:

-nsteps -1 is AFAIK typically used in conjunction with -maxh, or even without just relying on having mdrun terminated by a signal. Are you proposing a new -maxh replacement that implements those use-cases with a single command line option?

I am specifically trying to avoid the subject of infinitely long simulations for the moment. (I'm more comfortable with either defining an overly long simulation task or an API-driven loop of simulation segments.)

My proposal for the -nsteps -1 -maxh XX use case would be to define a long simulation in a single TPR and just use -maxh.

The main use case I have had for maxh has been to help HPC jobs shut down cleanly, well before the wall time limit is reached, either for testing or production runs.

I think that is what most people use it for. I am not certain the user interface should not cater at all for the case where the user may not know or care about the number of steps, but only wants to submit jobs of N hours in length (often back-to-back with continuation).
If I understand correctly, in the proposed setup users might often have to modify their tpr and come up with an nsteps = large to avoid ending up with a sequence of max-walltime jobs quitting because they reached the original not-large-enough nsteps. Wouldn't this be encouraging the use of magic numbers that simply substitute nsteps = -1 (e.g. "1000000000000000000000000")? Or wrapper scripts that check before every re-submission whether large is still large enough?

My expectation is that -maxsteps would probably be used without -maxh, using a TPR specifying nsteps = large, and setting -maxsteps in [1, nsteps), such as 1 or 2 steps just to make sure a simulation can launch properly, or just run to the next interval of interest, like a neighborlist rebuild, global communication, checkpoint interval, output interval, logging interval, etc. I can only envision use cases related to profiling and testing (particularly to constrain the progress of a simulation within its prescribed work).

I agree, I don't see a need for combining a -maxsteps feature with -maxh. However, the above seems to imply that -maxsteps would perform the same as -nsteps N except with N rounded up to a multiple of e.g. nstlist? If so, perhaps there is not even a need to rename it as the slight change in behavior can perhaps be communicated well enough in documentation and a console override note.

Side-note: tt would however be somewhat strange (and potentially annoying in some use-cases perhaps) the same mdrun invocation with the same inputs may run for different number of steps in different launches (e.g. depending on the hardware).

#16 Updated by Eric Irrgang about 1 month ago

Szilárd Páll wrote:

I think that is what most people use it for. I am not certain the user interface should not cater at all for the case where the user may not know or care about the number of steps, but only wants to submit jobs of N hours in length (often back-to-back with continuation).

It seems reasonable to me that a user might choose a finite but large value, like 1 millisecond, and have to take additional action to expressly affirm that was not enough, but I think that the infinite-trajectory use case is reasonably reformulated as an (infinite) loop of shorter, well-defined (finite) segments.

For users who want simple CLI ways to access API facilities like looping, it could be appropriate to introduce an additional command line tool for defining and launching such work so that mdrun can continue to evolve towards a simpler single-purpose responsibility of launching the MD simulation work expressed in the simulation input, but I honestly haven't given that any thought.

If I understand correctly, in the proposed setup users might often have to modify their tpr and come up with an nsteps = large to avoid ending up with a sequence of max-walltime jobs quitting because they reached the original not-large-enough nsteps. Wouldn't this be encouraging the use of magic numbers that simply substitute nsteps = -1 (e.g. "1000000000000000000000000")?

I think it would be preferable to choose an arbitrarily large number than to have a magic "-1" number. I am okay with a simulation that can never reasonably be expected to complete. But I think it is useful to have a strong definition of "completion."

Or wrapper scripts that check before every re-submission whether large is still large enough?

Not to dodge the question, but it might not be a bad thing to encourage manual user intervention before extending a trajectory that has run for unexpectedly many steps.

I think we should continue to move in the direction of stronger definitions for chunks of simulation work. An arbitrarily long trajectory can be described in terms of a sequence of well-defined trajectory segments, and a set of core API use cases are to manage a long trajectory in terms of shorter trajectories. Unfortunately, the API handling in this regard is not where we would like it to be, and I (at least) had not adequately considered how such an evolution needs to be addressed for command line users.

At the very least, instead of forcing GROMACS 2021 users to write extra wrapper scripts, I think it would be fine for GROMACS to provide extra utility functions (in this case, to do the equivalent of while true: append simulation segment). Users would have to alter the syntax of a command line in a job script, getting a reminder of the conceptual GROMACS changes and new ways to manage simulations, but hopefully without excessive torment.

I agree, I don't see a need for combining a -maxsteps feature with -maxh. However, the above seems to imply that -maxsteps would perform the same as -nsteps N except with N rounded up to a multiple of e.g. nstlist? If so, perhaps there is not even a need to rename it as the slight change in behavior can perhaps be communicated well enough in documentation and a console override note.

I think that it could be possible to simplify and clarify -nsteps enough, but since it has the same name as an MDP field, I think that the only acceptable semantics for -nsteps would be to produce a record of an API operation to replace the nsteps value in the simulation input (see also #3147). This is different, though, from setting -maxsteps 1e5 to require 10 invocations to complete a 1e6 step trajectory.
In addition to improved consistency with the name of maxh, I think there are a few cases where historical nsteps behavior is different enough that a new name might be warranted. Namely, maxsteps would never run a simulation past the nsteps in the simulation input.

Side-note: tt would however be somewhat strange (and potentially annoying in some use-cases perhaps) the same mdrun invocation with the same inputs may run for different number of steps in different launches (e.g. depending on the hardware).

There might be cases that warrant levels of warnings and/or overrides, and I think the documentation should probably encourage (if not require) that -maxsteps use a multiple of the checkpoint interval.

Also available in: Atom PDF