Project

General

Profile

Bug #1448

multiple successive crashes during REMD can lead to .log files that do not represent the actual replica exchanges to match the .xtc files (complicating demultiplexing)

Added by Chris Neale over 5 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I have a large REMD run. If the cluster goes down (e.g. power outage), or if the run crashes for some other reason, then the .log files contain exchange information that extends in time beyond the last checkpoint. When restarting the simulation, this information seems to get overwritten in the .log files so that, upon later demuxing, everything works out ok. However, from a casual look at the .log files during this procedure, it seems that the .log file is overwritten in place, with the "tailing", now incorrect, exchanges existing until they are overwritten (i.e., tail my0.log returns the same text until the restarted run passes the time at which the original crash occurred). This is not problematic in itself.

However, if the run now crashes again before reaching the timestep of the first crash (happened to me when the cluster power was unstable due to a storm) then the resulting .log file never recovers (i.e., can not be used as-is for demultiplexing, is suspect because there is information in the .log file that never gets overwritten properly).

I have been able to fix my .log file by identifying time-gaps or backward time-steps in my .log file exchange list and then constructing a fixed .log file that is missing the stale exchange information.

I have uploaded a .log file obtained during such an event (from mdrun with file appending), and also the script file that I can use to identify problems in the .log file and also to extract the stale information. Upon this reworking of the .log file, the demultiplexing works properly (as gauged by relative stability of the demuxed trajectories).

In case it matters, I will highlight that I exchanged every 1 ps, but only saved .xtc frames every 10 ps.
Please also note that i used my own modified version of gromacs 4.6.1, but the changes were only in pull.c (to create a flat-bottomed harmonic restraining potential)

The .mdp file is here (the 'TEMPERATURE' value was replaced by a script prior to actual use by grompp):

define = -DPOSRES_ICL1 -DPOSRES_ICL2 -DPOSRES_ICL3 -DPOSRES_Cterm
refcoord_scaling = com
constraints = all-bonds
lincs-iter =  1
lincs-order =  6
constraint_algorithm =  lincs
integrator = sd
dt = 0.002
tinit = 0
nsteps = 5000000000
nstcomm = 1
nstxout = 0
nstvout = 0
nstfout = 0
nstxtcout = 5000
nstenergy = 5000
nstlist = 10
nstlog=0 ; reduce log file size
ns_type = grid
rlist = 1.0
rvdw = 1.0
rcoulomb = 1.0
coulombtype = PME
ewald-rtol = 1e-5
optimize_fft = yes
fourierspacing = 0.12
fourier_nx = 0
fourier_ny = 0
fourier_nz = 0
pme_order = 4
tc_grps             =  System
tau_t               =  1.0
ld_seed             =  -1
ref_t = TEMPERATURE
gen_temp = TEMPERATURE
gen_vel = yes
unconstrained_start = no
gen_seed = -1
Pcoupl = no
dispcorr = EnerPres

pull                     = umbrella
pull_geometry            = position
pull_dim                 = Y Y Y
pull_start               = no
pull_nstxout             = 250
pull_nstfout             = 250
pull_ngroups             = 2
; Don't use a pull reference group so that it goes in absolute coordinates

pull_group1              = r_112-122_&_C-alpha
pull_pbcatom1            = 1992
pull_init1               = 0 0 0
pull_rate1               = 0
pull_k1                  = 1000.0
pull_vec1                = 0 0 0

pull_group2              = r_112-122_&_C-alpha
pull_pbcatom2            = 1992
pull_init2               = 0 0 0
pull_rate2               = 0
pull_k2                  = 1000.0
pull_vec2                = 0 0 0

mdrun was like this:

mpirun -np 568 /project/p/pomes/cneale/GPC/exe/intel/gromacs-4.6.1_halfFlat1.0/exec2/bin/mdrun_mpi -notunepme -deffnm MD_ -dlb yes -npme -1 -cpt 60 -cpi MD_.cpt -px MD_coord.xvg -pf MD_force.xvg -multi 71 -s MD_ -replex 500 -maxh 47.5 

Remaining files are way too large to upload.

I didn't test any version beyond 4.6.1

script.sh (857 Bytes) script.sh script to check for errors and fix log file Chris Neale, 03/01/2014 08:37 PM
MD_0.log.bz2 (8.23 MB) MD_0.log.bz2 log file with errors that complicate demux.pl Chris Neale, 03/01/2014 08:38 PM

Related issues

Related to GROMACS - Feature #1864: write tng files with energiesNew

History

#1 Updated by Mark Abraham over 5 years ago

  • Category set to mdrun
  • Assignee set to Mark Abraham
  • Target version set to 5.x

Unfortunately, the REMD implementation pre-dates the checkpointing mechanism, and has never been made to behave properly with things like file appending (which is associated with checkpointing). This is a regular problem for GROMACS - support for A and B does not in practice imply that they work well together, particularly when A and B were written by two different people (as was the case here) :-(

As a work-around, I can only recommend not using file appending, and note that this is not perfect either. I am planning an overhaul of the REMD implementation early in the post-5.0 phase, and better support for this would be something I'd be able to do as part of that.

#2 Updated by Mark Abraham over 3 years ago

#3 Updated by Mark Abraham over 3 years ago

  • Target version deleted (5.x)

I don't think this can be fixed well until we write all replica-exchange data to (e.g.) a TNG file, where we can properly support truncation of a log file when doing a continuation with appending, because the checkpoint was written when the contents of the output file were known. I would like to do this for the 2017 release, time permitting.

Also available in: Atom PDF