Project

General

Profile

Bug #942

-deffnm with enforced rotation code writes files that break checkpoint resume

Added by Peter Kasson over 7 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

mdrun called with -deffnm appears to write two xvg files with the same name. The first one is backed up to #foo.xvg.1# but is saved in the checkpoint file so that an appending checkpoint resume fails (tries to read # bytes corresponding to the first xvg file).

Proposed solution: change default file naming for pull xvg files to include a unique suffix.


Related issues

Related to GROMACS - Bug #2233: replica exchange and -append bugged?Accepted
Has duplicate GROMACS - Bug #1657: pull x/f outputs overwrite each other with -deffnmRejected
Has duplicate GROMACS - Bug #2154: mdrun restart doesn't recognize pull-code files given default but unspecified namesClosed

Associated revisions

Revision aea28fdc (diff)
Added by Peter Kasson over 4 years ago

Append pullx and _pullf to pull files when -deffnm used.

Changes -deffnm behavior for pulling so that the pullx and pullf
files don't collide. Previously, this resulted in one being
backed up and checkpoint restarts failing when -deffnm was used.
(Technically this applies to anything where -px and -pf are identical
and not explicitly set, but that only happens with -deffnm.)

Additionally return fatal error if -px or -pf set and output files
collide.

Fix is now localized to the pull code.

Fixes #942 except for log file collision with pull-rotation.

Change-Id: I27b8b4ced0f307905e2c2ea4fb260376dd25dc32

Revision c33d8d64 (diff)
Added by Vedran Miletic over 2 years ago

Improve the "files not present" error message

It's possible to use -deffnm in restarts even if it wasn't used in
the initial simulation. This can lead to absurd situations such as:

Expected output files not present or named differently:
pullx.xvg
pullf.xvg

where pullx.xvg and pullf.xvg are present and named exactly as listed,
but GROMACS expects them to be named as -deffnm requested.

The improved error message suggest to the user to check for that
possibility.

Refs #942 (partial workaround)

Change-Id: I983a7a2be791a634b877b0cbadb34e56a1ee2f82

History

#1 Updated by Berk Hess over 7 years ago

I propose to rename pullx.xvg and pullf.xvg to pull_x.xvg and pull_f.xvg and have -deffnm only rename the part of the file name up to an underscore.

#2 Updated by Erik Lindahl over 7 years ago

Good idea, but I'd take it one step further to e.g. out_pullx.xvg, so that the file after a deffnm would be called e.g. ionchannel_pullx.xvg - that way it's more obvious what the file actually contains!

#3 Updated by Berk Hess over 7 years ago

I just thought of that as well. Bit are you suggesting that for all file options, or only those with non unique extensions?

#4 Updated by Erik Lindahl over 7 years ago

I was just thinking about unique extensions, but it might look good to have it for them all. One caveat is that it could lead to strange effects if we rewrite input file names, i.e. the user says "-deffnm run", but needs to provide run_topol.tpr?

#5 Updated by Peter Kasson over 7 years ago

Yes, for that reason it might be better to do non-unique extensions, perhaps? The caveat would be that if an extension becomes non-unique, we'd change the naming behavior e.g. if we for some reason introduce a second .xtc output file.

#6 Updated by Berk Hess over 7 years ago

Changing extensions to be unique is a bad idea. The extension should correspond to the file type (or should be completely irrelevant).
This issue is difficult to solve in an elegant fashion.
But in practice you can in a separate directory.

It seems that only xvg output and input (user table) files will be affected by this issue.
So another possibility is to prefix the complete xvg file name with <deffnm>_

#7 Updated by Roland Schulz over 7 years ago

It would also effect *.log in the case of pull-rotation. But we probably don't want to change it also for all *.log because it would be unexpected to have <deffnm>_md.log. Should we change it for all *.xvg and *.log which are optional file arguments (thus all but md.log)?

#8 Updated by Roland Schulz about 7 years ago

Ping. We could also simply not allow pull together with deffnm and print an error if the user tries it. Do you prefer that solution or the one I suggested in comment 7? We could do the error for 4.6 and then a nicer solution like the one described in 7 for 5.0.

#9 Updated by Peter Kasson about 7 years ago

I think the _foo append is a good solution. One way to designate it might be -deffnm has just <deffnm>.extension for the first file of that type and <deffnm>_specifier.extension for any subsequent file.

#10 Updated by Roland Schulz about 7 years ago

I don't think it is a good idea to append it only for subsequent files. Seems inconsistent to append for pullf but not for pullx.

#11 Updated by Roland Schulz about 7 years ago

Ping. What do we do about this?

#12 Updated by Mark Abraham almost 7 years ago

Is this a priority for 4.6? If it breaks restarts with the pull code, then I'd say it is...

#13 Updated by Erik Lindahl almost 7 years ago

  • Target version changed from 4.6 to 4.6.1

#14 Updated by Mark Abraham over 6 years ago

  • Status changed from New to Blocked, need info
  • Target version deleted (4.6.1)
  • Affected version set to N/A

Is stuff still broken?

#15 Updated by Rossen Apostolov over 5 years ago

  • Assignee set to Carsten Kutzner
  • Priority changed from Normal to High

If the issue is still present would be good to fix it before 5.0

#16 Updated by Rossen Apostolov over 5 years ago

  • Target version set to 5.x

#17 Updated by Roland Schulz over 5 years ago

  • Status changed from Blocked, need info to Accepted

We still haven't decided what behavior we want. Specific suggestions were comments 7, 8, 9:
7) Change the behavior for .xvg and .log (other than md.log)
8) Print an error if deffnm is used with non-unique extensions
9) Change the behavior for the 2nd filename with the same extenions.

The main difference between 7 and 9 would be:
  • the first name of non-unique extensions.
    7: <deffnm>_pullx.xvg, <deffnm>_pullf.xvg #all having deffnm
    9: pullx.xvg, <deffnm>_pullf.xvg #first is missing it
  • for 7 the file-extensions with the changed behavior is hard-coded and thus needs to be updated if any other file-extension ever becomes non-unique.
Besides xvg and log, also ndx and cpt is strictly speaking non-unique.
  • for ndx it is very unlikely/impossible to have both -dn (dipole) and -mn (membed)
  • for cpt it makes sense to use <deffnm>.cpt instead of <deffnm>_state.cpt because the default is anyhow that input and output is the same.

#18 Updated by Teemu Murtola over 5 years ago

cpt and ndx only have a single or no output option, so they are not affected by this issue.

#19 Updated by Teemu Murtola over 5 years ago

8 is not possible to implement generally without significant changes, since there is no place in the current code that would know about all the optional files whether they will be used or not.

#20 Updated by Teemu Murtola over 5 years ago

An option that combines behavior of 7 and generality of 9 would be to add a suffix to each output option whose type is not unique, but leaving the suffix out for required options if there is only one required option for a type.

#21 Updated by Justin Lemkul about 5 years ago

  • Has duplicate Bug #1657: pull x/f outputs overwrite each other with -deffnm added

#22 Updated by Peter Kasson over 4 years ago

I'm making my way through the new option-parsing code. Teemu (or others familiar), would you suggest modifying FileNameOptionManager or putting a hack in the pull code to check if -deffnm is set and modify accordingly (or a different approach)? That would be a nasty kludge, but...

If in FileNameOptionManager, the best I see is to add a hack to test if the option is pullx or pullf and append _pullx or _pullf as necessary to realPrefix before
findExistingExtension.

About to upload draft patch using this method.

#23 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #942.
Uploader: Kasson ()
Change-Id: I27b8b4ced0f307905e2c2ea4fb260376dd25dc32
Gerrit URL: https://gerrit.gromacs.org/4768

#24 Updated by Peter Kasson over 4 years ago

  • Status changed from Accepted to In Progress

Fixed by 4768 with the exception of pull-rotation.

#25 Updated by Peter Kasson over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

#26 Updated by Rossen Apostolov over 4 years ago

  • Status changed from Resolved to In Progress

re-opening as per Peter's comment

#27 Updated by Mark Abraham over 3 years ago

  • Subject changed from -deffnm with pull code writes files that break checkpoint resume to -deffnm with enforced rotation code writes files that break checkpoint resume

Renamed issue to describe current status.

Carsten, does the rotation code need a fix like https://gerrit.gromacs.org/#/c/4768/? Can you look into that please?

#28 Updated by Carsten Kutzner over 3 years ago

As discussed earlier, this is a general problem which is not directly caused by the pull or the rotation module.

I looked at https://gerrit.gromacs.org/#/c/4768/ but I don't think it is wise to duplicate this code block for every module that has output files with non-unique extensions. Even if we did that, it would not solve the problem. If you e.g. use the swap code in combination with essential dynamics, you again run into the problem because both modules have .xvg output files.

Frankly I think it is the user's problem if he first sets a default for all file names and then complains that this results in duplicate names. One can use -deffnm if one manually sets one of the duplicates explicitly to something else, thereby solving all issues.

I would propose one of the following:

a) WONTFIX and CLOSE because the quick hack would just solve the problem for one other special case while for a real fix a lot of code would have to be written - for a convenience feature. If at some point new output files are added (or users try combinations of several modules), the issue will come up again.

b) An easy fix for all time (!) would be to remove the -deffnm option, which would additionally make the code simpler and more maintainable in many places

c) Make -deffnm hidden now and remove it later

#29 Updated by Peter Kasson over 3 years ago

I would strongly oppose removing -deffnm. It's an option we use all the time. For our usage I don't care if rotation is broken (code-wise it would be better to fix), but if -deffnm went away we'd scream (or rather revert the patch in all our local versions). Not having -deffnm means that unless the user explicitly sets all the output files, they all go to generic names, which is worse than -deffnm, since it causes name collisions if you ever run twice in the same directory.

#30 Updated by Erik Lindahl over 3 years ago

I too am a bit too fond of -deffnm to let it go right away, but it is a concern that some of those very-high-level workflows cause lots of problems on lower level, and the latter class of problems are virtually impossible to fix (since the low-level module cannot know about everything else).

This is not specific to -deffnm, but it also goes for handy features such as -multidir, -nsteps, -maxh, all the various options for extending tpr files, etc.

This, since this issue is now four years old, it's time for those of us who need deffnm to also help design a better implementation that removes the overwriting issues; otherwise Carsten's argument (b) will be quite strong for the 2017 release :-)

#31 Updated by Mark Abraham over 3 years ago

d) Require that all modules write their output to TNG, thus also moving the problem of filename collisions into the code, and making both checkpointing and -deffnm easy to implement and test.

Frankly, workflows that read or write twice in the same directory are broken unless they are doing something trivial. Tabulated interactions that differ between simulations are broken with -multi for similar reasons. (I have proposed removing -multi, accordingly, but I can't find it on Redmine.)

UNIX has a standard way to organize a group of related files, and it's called a directory :-) We should use it, unless someone wants to write a bunch of actual + test code that shows we can implement -deffnm soundly across the combination space of mdrun features that access files, including checkpointing and -multi. To me, that sounds like more work than editing some workflow scripts to use mkdir and cd rather than a variable and -deffnm, even given that multiple people have to do that latter.

#32 Updated by Mark Abraham over 3 years ago

  • Priority changed from High to Normal
  • Target version deleted (5.x)

#33 Updated by Carsten Kutzner about 3 years ago

  • Assignee deleted (Carsten Kutzner)

#34 Updated by Berk Hess over 2 years ago

  • Has duplicate Bug #2154: mdrun restart doesn't recognize pull-code files given default but unspecified names added

#35 Updated by Mark Abraham over 2 years ago

Note that the combinations of multi, cpi, append and deffnm continue to cause problems, e.g. #2233

#36 Updated by Mark Abraham over 2 years ago

  • Related to Bug #2233: replica exchange and -append bugged? added

#37 Updated by Erik Lindahl almost 2 years ago

There's no point in keeping this open anymore, since nobody is stepping up to find a general solution.

However, that also means the long-term solution will be to get rid of -deffnm.

#38 Updated by Erik Lindahl almost 2 years ago

  • Status changed from In Progress to Closed

#39 Updated by Mark Abraham about 1 year ago

Note that we deprecated and then removed -multi for 2019, so one aspect of the problem is simpler.

We can keep -deffnm if we stop writing such text files (option d above). Writing such files is anyway a poor idea on big iron parallel file systems, and not terribly usable either.

I've even seen that a pullx.xvg didn't get a backup generated when other output files did get a backup generated, which might be because such modules use the wrong text-file opening function. That's a separate problem, but it does illustrate that we need to stop permitting modules to behave like they are one-shot command-line tools for single simulations running a single simulation part.

Also available in: Atom PDF