Project

General

Profile

Bug #538

FAH Checkpointing Broken Aug. 16

Added by Kyle Beauchamp about 9 years ago. Updated about 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

As of Aug. 16 (6b4a52fc3e1098602dcdd97675716759f9f96459), FAH checkpointing does not work properly.

In that revision, Berk rearranged some of the md.log IO code. You guys obviously aren't responsible for what happens inside FAH, but the issues I'm having are as follows:

1. Recent FAHCore builds are having trouble deleting wudata_XX.log_tmp (note that in FAH, wudata_XX.log = md.log) upon checkpoint resumption. This file is not being deleted because it is still open somewhere.
2. After checkpoint resumption, recent FAHCore builds are writing to wudata_XX.log_tmp instead of wudata_XX.log
3. FAH tries to delete wudata_XX.log_tmp, but cannot (unlink returns -1). However, there is no error checking on the delete operation so things continue regardless of the status of this delete.
4. I think the place where wudata_XX.log_tmp is stuck open is actually within Gromacs. When Gromacs tries to open wudata_XX.log, the open call is passed through FAH. FAH then appends "_tmp" to the filename to be opened.

Thus, the breakdown seems to be that FAH assumes certain behaviors from Gromacs, which are no longer guaranteed.

Based on this data, my proposed patch (still needs more testing!) is to change gmx_log_open() in main.c as shown below.

One of my concerns is that my patch reverts (only for FAH) the "else" statement to an "else if" statement. Thus, I'm undoing some of the Aug. 16 patch. There may be a more appropriate fix that I'm missing due to my incomplete understanding (of both Gromacs and FAH).

FYI, one FAH-related goal is for us to (eventually) use more of the "standard" gromacs features and implement less on our own. Today's gromacs has a lot of tools that make our FAH code "redundant."

****************************************
Present: ****************************************
if (!bMasterOnly) {
/* Since log always ends with '.log' let's use this info */
par_fn(tmpnm,efLOG,cr,FALSE,!bMasterOnly,buf,255);
fp = gmx_fio_fopen(buf, bAppend ? "a+" : "w+" );
}
else {
fp = gmx_fio_fopen(tmpnm, bAppend ? "a+" : "w+" );
} ****************************************
Proposed: ****************************************
if (!bMasterOnly) {
/* Since log always ends with '.log' let's use this info */
par_fn(tmpnm,efLOG,cr,FALSE,!bMasterOnly,buf,255);
fp = gmx_fio_fopen(buf, bAppend ? "a+" : "w+" );
}
#ifdef GMX_FAHCORE
else if (!bAppend)
#else
else
#endif {
fp = gmx_fio_fopen(tmpnm, bAppend ? "a+" : "w+" );
}

History

#1 Updated by Sander Pronk about 9 years ago

On the face of it, that would run the risk of not opening a log file at all. Is there a legitimate case where bMasterOnly is not true with FAH? Perhaps it would be more appropriate to add an entirely separate case for FAH:

#ifndef GMX_FAHCORE
if (!bMasterOnly) {
/* Since log always ends with '.log' let's use this info */
par_fn(tmpnm,efLOG,cr,FALSE,!bMasterOnly,buf,255);
fp = gmx_fio_fopen(buf, bAppend ? "a+" : "w+" );
}
else {
fp = gmx_fio_fopen(tmpnm, bAppend ? "a+" : "w+" );
}
#else
fp = gmx_fio_fopen(tmpnm, bAppend ? "a+" : "w+" );
#endif

which would be easier to understand.

#2 Updated by Kyle Beauchamp about 9 years ago

"On the face of it, that would run the risk of not opening a log file at all."

That's true. Before Aug. 16, some of the log opening code was located in checkpoint.c. Thus, there were cases in gmx_log_open that would not open the log file at all.

I suspect the recent changes moved all log opening to gmx_log_open (makes sense!). However, our FAH code seems to be based on the previous model where the checkpoint code sometimes is in charge of opening the log.

Thus, I think I'm going to stick with the patch that I provided. (I tried Sander's, but it didn't fix the problem.)

I've done some testing of this patch, and it seems to work well in the several cases I've tried.

#3 Updated by Kyle Beauchamp about 9 years ago

FYI: I added my proposed patch to the GIT, so I would be content to call this one resolved.

#4 Updated by Sander Pronk about 9 years ago

OK it's closed.

Also available in: Atom PDF