multiple successive crashes during REMD can lead to .log files that do not represent the actual replica exchanges to match the .xtc files (complicating demultiplexing)
I have a large REMD run. If the cluster goes down (e.g. power outage), or if the run crashes for some other reason, then the .log files contain exchange information that extends in time beyond the last checkpoint. When restarting the simulation, this information seems to get overwritten in the .log files so that, upon later demuxing, everything works out ok. However, from a casual look at the .log files during this procedure, it seems that the .log file is overwritten in place, with the "tailing", now incorrect, exchanges existing until they are overwritten (i.e., tail my0.log returns the same text until the restarted run passes the time at which the original crash occurred). This is not problematic in itself.
However, if the run now crashes again before reaching the timestep of the first crash (happened to me when the cluster power was unstable due to a storm) then the resulting .log file never recovers (i.e., can not be used as-is for demultiplexing, is suspect because there is information in the .log file that never gets overwritten properly).
I have been able to fix my .log file by identifying time-gaps or backward time-steps in my .log file exchange list and then constructing a fixed .log file that is missing the stale exchange information.
I have uploaded a .log file obtained during such an event (from mdrun with file appending), and also the script file that I can use to identify problems in the .log file and also to extract the stale information. Upon this reworking of the .log file, the demultiplexing works properly (as gauged by relative stability of the demuxed trajectories).
In case it matters, I will highlight that I exchanged every 1 ps, but only saved .xtc frames every 10 ps.
Please also note that i used my own modified version of gromacs 4.6.1, but the changes were only in pull.c (to create a flat-bottomed harmonic restraining potential)
The .mdp file is here (the 'TEMPERATURE' value was replaced by a script prior to actual use by grompp):
define = -DPOSRES_ICL1 -DPOSRES_ICL2 -DPOSRES_ICL3 -DPOSRES_Cterm refcoord_scaling = com constraints = all-bonds lincs-iter = 1 lincs-order = 6 constraint_algorithm = lincs integrator = sd dt = 0.002 tinit = 0 nsteps = 5000000000 nstcomm = 1 nstxout = 0 nstvout = 0 nstfout = 0 nstxtcout = 5000 nstenergy = 5000 nstlist = 10 nstlog=0 ; reduce log file size ns_type = grid rlist = 1.0 rvdw = 1.0 rcoulomb = 1.0 coulombtype = PME ewald-rtol = 1e-5 optimize_fft = yes fourierspacing = 0.12 fourier_nx = 0 fourier_ny = 0 fourier_nz = 0 pme_order = 4 tc_grps = System tau_t = 1.0 ld_seed = -1 ref_t = TEMPERATURE gen_temp = TEMPERATURE gen_vel = yes unconstrained_start = no gen_seed = -1 Pcoupl = no dispcorr = EnerPres pull = umbrella pull_geometry = position pull_dim = Y Y Y pull_start = no pull_nstxout = 250 pull_nstfout = 250 pull_ngroups = 2 ; Don't use a pull reference group so that it goes in absolute coordinates pull_group1 = r_112-122_&_C-alpha pull_pbcatom1 = 1992 pull_init1 = 0 0 0 pull_rate1 = 0 pull_k1 = 1000.0 pull_vec1 = 0 0 0 pull_group2 = r_112-122_&_C-alpha pull_pbcatom2 = 1992 pull_init2 = 0 0 0 pull_rate2 = 0 pull_k2 = 1000.0 pull_vec2 = 0 0 0
mdrun was like this:
mpirun -np 568 /project/p/pomes/cneale/GPC/exe/intel/gromacs-4.6.1_halfFlat1.0/exec2/bin/mdrun_mpi -notunepme -deffnm MD_ -dlb yes -npme -1 -cpt 60 -cpi MD_.cpt -px MD_coord.xvg -pf MD_force.xvg -multi 71 -s MD_ -replex 500 -maxh 47.5
Remaining files are way too large to upload.
I didn't test any version beyond 4.6.1
#1 Updated by Mark Abraham almost 7 years ago
- Category set to mdrun
- Assignee set to Mark Abraham
- Target version set to 5.x
Unfortunately, the REMD implementation pre-dates the checkpointing mechanism, and has never been made to behave properly with things like file appending (which is associated with checkpointing). This is a regular problem for GROMACS - support for A and B does not in practice imply that they work well together, particularly when A and B were written by two different people (as was the case here) :-(
As a work-around, I can only recommend not using file appending, and note that this is not perfect either. I am planning an overhaul of the REMD implementation early in the post-5.0 phase, and better support for this would be something I'd be able to do as part of that.
#3 Updated by Mark Abraham over 4 years ago
- Target version deleted (
I don't think this can be fixed well until we write all replica-exchange data to (e.g.) a TNG file, where we can properly support truncation of a log file when doing a continuation with appending, because the checkpoint was written when the contents of the output file were known. I would like to do this for the 2017 release, time permitting.