Project

General

Profile

Bug #536

mdrun segfaults if -cpi ensemble differs from .tpr ensemble

Added by Mark Abraham about 9 years ago. Updated about 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I was attempting an NVT-then-NPT equilibration with the following scheme on git head release-4-5-patches

grompp -f equilV.mdp -c ../em/confout.gro -p ../preparation/topol -o equilV -po equilV_po -v
mdrun -deffnm equilV

grompp -f equilP.mdp -c equilV -p ../preparation/topol -o equilP -po equilP_po
mdrun -deffnm equilP -cpi equilV

I thought this would work, with the .tpr state being superseded by the .cpt. The .log file notes that the .cpt lacks various relevant terms, viz

State entry mismatch between the simulation and the checkpoint file
Entries which are not present in the checkpoint file will not be updated
simulation checkpoint
box present present
box-rel present not present
pres_prev present not present
x present present
v present present

The run goes fine until a segfault in write_traj after the last step. Memory debugging showed something go astray in update_energyhistory, which sounds like a plausible thing to be buggy given the input state mismatch.

If I do

grompp -f equilP.mdp -c equilV -p ../preparation/topol -o equilP -po equilP_po -t equilV.cpt
mdrun -deffnm equilP

then the run works.

If I do

grompp -f equilP.mdp -c equilV -p ../preparation/topol -o equilP -po equilP_po -t equilV.cpt
mdrun -deffnm equilP -cpi equilV

the same segfault occurs as above.

bugzilla.tgz (22.7 MB) bugzilla.tgz tarball of my .cpts, .tprs, .mdps and a .log Mark Abraham, 09/01/2010 09:51 AM
equilP.log (20.5 KB) equilP.log .log file under discussion Mark Abraham, 09/01/2010 11:04 AM

History

#1 Updated by Mark Abraham about 9 years ago

Created an attachment (id=528)
tarball of my .cpts, .tprs, .mdps and a .log

I've attached a tarball of my .cpts, .tprs, .mdps and a .log, but the system is 190K atoms, so that might not be suitable for debugging.

#2 Updated by Berk Hess about 9 years ago

We should look into this and put in fatal errors for all cases where a missing term would cause incorrect results.
Are the missing terms you posted the only ones, or is that all?
Especially your were talking about the energy history, but there is no mention of energy history mismatch in your report.

Berk

#3 Updated by Mark Abraham about 9 years ago

Created an attachment (id=529)
.log file under discussion

I'm not sure I understand you. The only complaints in the .log file are those I reported. Full .log attached separately now (though it was in the tarball).

There was no complaint about energy history mismatch in the .log. I don't think there could be such a mismatch to observe and complain about, in this case. It was a 1000-step NVT to generate the .cpt, then that .cpt was read in to provide the state for an NPT continuation with a new .tpr.

I am guessing that the checkpoint-reading code is observing the state mismatch, and not setting up the history-recording data structure properly, so that when it goes to sort things out at the end of the run when the checkpoint file is first written, there's a memory error.

Do you want me to make small system that can demonstrate the memory error?

#4 Updated by Berk Hess about 9 years ago

We should simply disallow ensemble (NVT->NPT etc) switching between cpt and run.
grompp can read a checkpoint file and your run is anyhow not continuous,
since you extend the state of the system.

I can add checks when I have time for this, hopefully tonight.

Berk

#5 Updated by Mark Abraham about 9 years ago

Sure, that's good. Only permitting grompp to handle ensemble switching works fine and is more maintainable.

#6 Updated by Mark Abraham about 9 years ago

(In reply to comment #4)

We should simply disallow ensemble (NVT->NPT etc) switching between cpt and
run.
grompp can read a checkpoint file and your run is anyhow not continuous,
since you extend the state of the system.

I can add checks when I have time for this, hopefully tonight.

I tried the "erroneous" ensemble switch again with git head release-4-5-patches, viz

~/progs/bin/grompp_master -f equilP.mdp -c equilV -p ../preparation/topol -o equilP -po equilP_po
mpirun -np 8 ~/progs/bin/mdrun_mpi_master -deffnm equilP -cpi equilV.cpt

and saw that we now give a warning to stdout or stderr cueing the user to consult the .log file. That .log file presents the information about the state mismatch (which it always did), but doesn't make clear that this is not correct procedure, viz

State entry mismatch between the simulation and the checkpoint file
Entries which are not present in the checkpoint file will not be updated
simulation checkpoint
box present present
box-rel present not present
pres_prev present not present
x present present
v present present

Then the simulation proceeds to completion - no segfault any more. The output .cpt is properly written, and a gmxdump reports the following, which looks like box-rel and pres_prev ARE being updated, unlike the statement in .log

GROMACS version = 4.5.1
GROMACS build time = Thu Sep 9 07:22:15 EST 2010
GROMACS build user = mxa224@vayu2
GROMACS build machine = Linux 2.6.27.49-1.8.3a x86_64
generating program = /home/224/mxa224/progs/bin/mdrun_mpi_master
generation time = Fri Sep 10 11:10:47 2010

checkpoint file version = 12
generating host = vayu1
#atoms = 195537
#T-coupling groups = 1
#Nose-Hoover T-chains = 0
#Nose-Hoover T-chains for barostat = 0
integrator = 0
simulation part # = -1
step = 2000
t = 4.000000
#PP-nodes = 8
dd_nc[x] = 8
dd_nc[y] = 1
dd_nc[z] = 1
#PME-only nodes = 0
state flags = 406
ekin data flags = 0
energy history flags = 255
box (3x3):
box[ 0]={ 1.79762e+01, 0.00000e+00, 0.00000e+00}
box[ 1]={ 5.99207e+00, 1.69481e+01, 0.00000e+00}
box[ 2]={-5.99207e+00, 8.47406e+00, 1.46775e+01}
box-rel (3x3):
box-rel[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
box-rel[ 1]={ 3.33334e-01, 9.42809e-01, 0.00000e+00}
box-rel[ 2]={-3.33334e-01, 4.71405e-01, 8.16497e-01}
pres_prev (3x3):
pres_prev[ 0]={ 9.17330e+01, 2.95182e+01, -4.28630e+01}
pres_prev[ 1]={ 3.33031e+01, 1.83585e+02, 4.64465e+01}
pres_prev[ 2]={-4.31308e+01, 4.85320e+01, 1.45836e+02}
x (195537x3):
x[ 0]={ 3.86220e+00, 1.33144e+01, 6.96436e+00}
x[ 1]={ 3.81131e+00, 1.33993e+01, 6.98445e+00}

It seemed to me in the previous discussion on this topic that there ought to be situations in which mdrun issues a fatal error, not simply a warning.

Secondarily, the .log file output contradicts what actually happens to the .cpt.

#7 Updated by Berk Hess about 9 years ago

I forgot about this bug and did not put in a check.
I don't know why it happened to work for you this time.
I'll add a check today.

Berk

#8 Updated by Berk Hess about 9 years ago

I saw now that when reading a checkpoint file the energy history is allocated
in read_checkpoint and used later without checking the size of allocated arrays.
I made sure now that the history data structure is completely reinitialized when
not appending output. (With appending there was already a fatal error on state mismatch)
But I did not get valgrind warnings on your input.
Could you test if the issue is really fixed now?

Berk

#9 Updated by Mark Abraham about 9 years ago

I think the point is that a simulation with an NPT .tpr and an NVT .cpt should not be allowed to run at all. A user wanting to switch ensemble in a vaguely continuous way needs to be forced to use grompp -t, per earlier discussion here.

(In reply to comment #8)

I saw now that when reading a checkpoint file the energy history is allocated
in read_checkpoint and used later without checking the size of allocated
arrays.
I made sure now that the history data structure is completely reinitialized
when
not appending output. (With appending there was already a fatal error on state
mismatch)

Sure, that probably should have fixed my original energy-history memory-allocation-derived segfault.

But I did not get valgrind warnings on your input.
Could you test if the issue is really fixed now?

Simulation under NPT from the NVT checkpoint runs fine under current git HEAD and writes the terminal checkpoint without segfaulting. But it should have died during setup :-)

#10 Updated by Berk Hess about 9 years ago

Well, sometimes you might want this.
I works now and should give the same results as passing the cpt to grompp and then doing mdrun. File output appending, which is now default, will generate an error with such a mismatch, so it can only be used for starting a simulation from a certain frame.

The question is if we want to explicitly forbid this or just issue a warning.

Berk

#11 Updated by Mark Abraham about 9 years ago

It seems you've changed your mind about whether ensemble switching should be allowed in mdrun. Your comment #4 is quite different from your #10. I've been echoing the sentiment of #4 since you made that comment, but originally I was more of a #10 mind :-). I don't really care either way, so long as both ways to switch ensembles achieve the same result.

I'll do a comparative test of both on Monday.

#12 Updated by Berk Hess about 9 years ago

Sorry about that.
On the one hand I would like to be strict on such things.
On the other hand it can often be convenient to be able to quickly run something from a cpt file without going through grompp.
I see two options:
1) Allowing it with a warning as it is now in git
2) Giving a fatal error with a hint to an env var that the user can set to override the fatal error

Berk

#13 Updated by Mark Abraham about 9 years ago

(In reply to comment #12)

Sorry about that.
On the one hand I would like to be strict on such things.
On the other hand it can often be convenient to be able to quickly run
something from a cpt file without going through grompp.
I see two options:
1) Allowing it with a warning as it is now in git
2) Giving a fatal error with a hint to an env var that the user can set to
override the fatal error

Sure. I like #2. Switching ensemble at mdrun-time is a useful thing to be able to do, but I don't think a user should be able to mistakenly supply a mismatched .cpt, receive only a stderr warning and a .log file statement of the mismatch, and the simulation run to completion. Someone who knows what they want to do can learn to invoke the environment variable, and if they do it often, they'll put that in their job scripts.

#14 Updated by Mark Abraham about 9 years ago

(In reply to comment #11)

I'll do a comparative test of both on Monday.

I tried a comparative test with git head (82659cc) of grompp-based ensemble switching and mdrun-based ensemble switching. After an NVT equilibration, I switched in the two different ways from NVT to NPT .cpt, viz

grompp -f equilV.mdp -c ../em/confout.gro -p ../preparation/topol -o equilV -po equilV_po
mpirun mdrun_mpi -deffnm equilV

grompp -f equilP.mdp -c equilV -p ../preparation/topol -o equilP2 -po equilP2_po -t equilV
mpirun mdrun_mpi -reprod -deffnm equilP2

grompp -f equilP.mdp -c equilV -p ../preparation/topol -o equilP -po equilP_po
mpirun mdrun_mpi -reprod -deffnm equilP -cpi equilV

Unfortunately, the grompp-based switch replaced all the velocities with zero, so the step one temperature was wrong in one simulation. I deduce the code has regressed, somehow, since this always used to work!

The mdrun-based switch looked fine - continuous temperature, normal completion, correct final checkpoint.

#15 Updated by Berk Hess about 9 years ago

Just to be sure, you didn't have gen_vel turned on with 0 K temperature?

Berk

#16 Updated by Mark Abraham about 9 years ago

(In reply to comment #15)

Just to be sure, you didn't have gen_vel turned on with 0 K temperature?

gen_vel = no in equilP.mdp, and the same .mdp was used for both grompp invocations. I had already checked this, but forgot to note it, sorry.

#17 Updated by Berk Hess about 9 years ago

I just tested this for a water box and it works fine for me.
Is this with exactly the same files you attached for this bugzilla?

Berk

#18 Updated by Berk Hess about 9 years ago

You did not specify a file extension for grompp -t.
I guess that grompp read equilV.trr iso .cpt.
Judging from you mdp options, there are no velocities in the trr file
and thus the output tpr will have zero velocities.
grompp asks for a frame with velocities, but if none are present
you do not get a warning. I will add one.

Berk

#19 Updated by Mark Abraham about 9 years ago

(In reply to comment #18)

You did not specify a file extension for grompp -t.
I guess that grompp read equilV.trr iso .cpt.

Indeed, it did.

Judging from you mdp options, there are no velocities in the trr file
and thus the output tpr will have zero velocities.
grompp asks for a frame with velocities, but if none are present
you do not get a warning. I will add one.

OK. Is there a case for "grompp -t" requiring a non-checkpoint trajectory file and instead "grompp -cpi" (or whatever) be required for the present case? It seems to me that a warning won't save us from the case where "grompp -t equilV" picks up equilV.trr, finds some early frame with velocities, and uses it, when the user intended to pick up equilV.cpt. Ideally, the user should know that this is an exception, but I can see lots of people liking the idea that GROMACS is smart enough to "do what they mean" when they omit filename suffixes, and never noticing the problem. The downside is that people are already using grompp -t since GROMACS 4.0. Should grompp print a warning/error if both files exist?

Having hardcoded -t equilV.cpt, I reran the comparison of the two NPT stages, and they were identical for 2000 MD steps using mdrun -reprod. So I think that part of the issue is resolved.

The design choice in comment #12 and comment #13 is still to be resolved.

#20 Updated by Berk Hess about 9 years ago

I added the no velocities warning to grompp.
I added an env.var. for cpt/tpr mismatches.

I understand your problem with the grompp -t option not picking up the file you intended. But I would rather not have two file options that do exactly the same thing, only on a different input file format. The user should now what he/she is doing when not supplying extensions and also this will only happen with -deffnm.

Berk

#21 Updated by Mark Abraham about 9 years ago

(In reply to comment #20)

I added the no velocities warning to grompp.
I added an env.var. for cpt/tpr mismatches.

Looks good. Worked for me as expected.

I understand your problem with the grompp -t option not picking up the file you
intended. But I would rather not have two file options that do exactly the same
thing, only on a different input file format. The user should now what he/she
is doing when not supplying extensions and also this will only happen with
-deffnm.

Fair enough.

Have closed the bug.

Also available in: Atom PDF