-deffnm with enforced rotation code writes files that break checkpoint resume
mdrun called with -deffnm appears to write two xvg files with the same name. The first one is backed up to #foo.xvg.1# but is saved in the checkpoint file so that an appending checkpoint resume fails (tries to read # bytes corresponding to the first xvg file).
Proposed solution: change default file naming for pull xvg files to include a unique suffix.
Append pullx and _pullf to pull files when -deffnm used.
Changes -deffnm behavior for pulling so that the pullx and pullf
files don't collide. Previously, this resulted in one being
backed up and checkpoint restarts failing when -deffnm was used.
(Technically this applies to anything where -px and -pf are identical
and not explicitly set, but that only happens with -deffnm.)
Additionally return fatal error if -px or -pf set and output files
Fix is now localized to the pull code.
Fixes #942 except for log file collision with pull-rotation.
Improve the "files not present" error message
It's possible to use -deffnm in restarts even if it wasn't used in
the initial simulation. This can lead to absurd situations such as:
Expected output files not present or named differently:
where pullx.xvg and pullf.xvg are present and named exactly as listed,
but GROMACS expects them to be named as -deffnm requested.
The improved error message suggest to the user to check for that
Refs #942 (partial workaround)
#6 Updated by Berk Hess over 7 years ago
Changing extensions to be unique is a bad idea. The extension should correspond to the file type (or should be completely irrelevant).
This issue is difficult to solve in an elegant fashion.
But in practice you can in a separate directory.
It seems that only xvg output and input (user table) files will be affected by this issue.
So another possibility is to prefix the complete xvg file name with <deffnm>_
#7 Updated by Roland Schulz over 7 years ago
It would also effect *.log in the case of pull-rotation. But we probably don't want to change it also for all *.log because it would be unexpected to have <deffnm>_md.log. Should we change it for all *.xvg and *.log which are optional file arguments (thus all but md.log)?
#8 Updated by Roland Schulz over 7 years ago
Ping. We could also simply not allow pull together with deffnm and print an error if the user tries it. Do you prefer that solution or the one I suggested in comment 7? We could do the error for 4.6 and then a nicer solution like the one described in 7 for 5.0.
#17 Updated by Roland Schulz over 5 years ago
- Status changed from Blocked, need info to Accepted
We still haven't decided what behavior we want. Specific suggestions were comments 7, 8, 9:
7) Change the behavior for .xvg and .log (other than md.log)
8) Print an error if deffnm is used with non-unique extensions
9) Change the behavior for the 2nd filename with the same extenions.
- the first name of non-unique extensions.
7: <deffnm>_pullx.xvg, <deffnm>_pullf.xvg #all having deffnm
9: pullx.xvg, <deffnm>_pullf.xvg #first is missing it
- for 7 the file-extensions with the changed behavior is hard-coded and thus needs to be updated if any other file-extension ever becomes non-unique.
- for ndx it is very unlikely/impossible to have both
- for cpt it makes sense to use <deffnm>.cpt instead of <deffnm>_state.cpt because the default is anyhow that input and output is the same.
#22 Updated by Peter Kasson over 4 years ago
I'm making my way through the new option-parsing code. Teemu (or others familiar), would you suggest modifying FileNameOptionManager or putting a hack in the pull code to check if -deffnm is set and modify accordingly (or a different approach)? That would be a nasty kludge, but...
If in FileNameOptionManager, the best I see is to add a hack to test if the option is pullx or pullf and append _pullx or _pullf as necessary to realPrefix before
About to upload draft patch using this method.
#27 Updated by Mark Abraham over 3 years ago
- Subject changed from -deffnm with pull code writes files that break checkpoint resume to -deffnm with enforced rotation code writes files that break checkpoint resume
Renamed issue to describe current status.
Carsten, does the rotation code need a fix like https://gerrit.gromacs.org/#/c/4768/? Can you look into that please?
#28 Updated by Carsten Kutzner over 3 years ago
As discussed earlier, this is a general problem which is not directly caused by the pull or the rotation module.
I looked at https://gerrit.gromacs.org/#/c/4768/ but I don't think it is wise to duplicate this code block for every module that has output files with non-unique extensions. Even if we did that, it would not solve the problem. If you e.g. use the swap code in combination with essential dynamics, you again run into the problem because both modules have .xvg output files.
Frankly I think it is the user's problem if he first sets a default for all file names and then complains that this results in duplicate names. One can use -deffnm if one manually sets one of the duplicates explicitly to something else, thereby solving all issues.
I would propose one of the following:
a) WONTFIX and CLOSE because the quick hack would just solve the problem for one other special case while for a real fix a lot of code would have to be written - for a convenience feature. If at some point new output files are added (or users try combinations of several modules), the issue will come up again.
b) An easy fix for all time (!) would be to remove the -deffnm option, which would additionally make the code simpler and more maintainable in many places
c) Make -deffnm hidden now and remove it later
#29 Updated by Peter Kasson over 3 years ago
I would strongly oppose removing -deffnm. It's an option we use all the time. For our usage I don't care if rotation is broken (code-wise it would be better to fix), but if -deffnm went away we'd scream (or rather revert the patch in all our local versions). Not having -deffnm means that unless the user explicitly sets all the output files, they all go to generic names, which is worse than -deffnm, since it causes name collisions if you ever run twice in the same directory.
#30 Updated by Erik Lindahl over 3 years ago
I too am a bit too fond of -deffnm to let it go right away, but it is a concern that some of those very-high-level workflows cause lots of problems on lower level, and the latter class of problems are virtually impossible to fix (since the low-level module cannot know about everything else).
This is not specific to -deffnm, but it also goes for handy features such as -multidir, -nsteps, -maxh, all the various options for extending tpr files, etc.
This, since this issue is now four years old, it's time for those of us who need deffnm to also help design a better implementation that removes the overwriting issues; otherwise Carsten's argument (b) will be quite strong for the 2017 release :-)
#31 Updated by Mark Abraham over 3 years ago
d) Require that all modules write their output to TNG, thus also moving the problem of filename collisions into the code, and making both checkpointing and -deffnm easy to implement and test.
Frankly, workflows that read or write twice in the same directory are broken unless they are doing something trivial. Tabulated interactions that differ between simulations are broken with -multi for similar reasons. (I have proposed removing -multi, accordingly, but I can't find it on Redmine.)
UNIX has a standard way to organize a group of related files, and it's called a directory :-) We should use it, unless someone wants to write a bunch of actual + test code that shows we can implement -deffnm soundly across the combination space of mdrun features that access files, including checkpointing and -multi. To me, that sounds like more work than editing some workflow scripts to use mkdir and cd rather than a variable and -deffnm, even given that multiple people have to do that latter.
#39 Updated by Mark Abraham over 1 year ago
Note that we deprecated and then removed -multi for 2019, so one aspect of the problem is simpler.
We can keep -deffnm if we stop writing such text files (option d above). Writing such files is anyway a poor idea on big iron parallel file systems, and not terribly usable either.
I've even seen that a pullx.xvg didn't get a backup generated when other output files did get a backup generated, which might be because such modules use the wrong text-file opening function. That's a separate problem, but it does illustrate that we need to stop permitting modules to behave like they are one-shot command-line tools for single simulations running a single simulation part.