Project

General

Profile

Bug #354

switching from Berendsen to V-rescale doesn't work if -nosum is used

Added by Mark Abraham almost 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

While troubleshooting bug #350, I tried playing around with -nosum. The runs I'm doing are continuing a 3.3.x NPT simulation using Berendsen for both couplings. I think it ought to be possible to make a fairly smooth transition to V-rescale. This is what I observe if I give grompp the old .trr and .edr with a V-rescale .mdp if the subsequent 64-processor mdrun uses -sum. If the subsequent mdrun uses -nosum then it blows up. If I give grompp the old .trr and .edr and an .mdp file that retains Berendsen, then the simulation runs fine regardless of -sum/-nosum.

I had a quick poke in the code, and my guess is that the ekinh data structures are not being initialized suitably.

The .tpr files in the tarball from bug #350 should be able to reproduce the failing case - they satisfy its conditions.

debugging_354.tgz (1.17 MB) debugging_354.tgz tarball to reproduce v-rescale issue with -sum/-nosum Mark Abraham, 10/12/2009 01:42 PM

History

#1 Updated by Berk Hess almost 10 years ago

You do not mention you tau_t, dt and nstlist values.
The 4.0 V-rescale code does not work perfectly when tau_t
is close to the time step for the termostatting, which is
nstlist*dt with -nosum.
I don't know if this is the case for you.
But Bussi provided better code (as mentioned in the paper)
for git master. You can try git master to check if that
is the issue.

Berk

#2 Updated by Berk Hess almost 10 years ago

I just realized that my previous explanation is not correct.
In 4.0 with -nosum the "actual" temperature for coupling
is fixed for nstlist steps, this might lead to instabilities
when tau_t is short compared to nstlist*dt. Also this I have
fixed in git master, but this fix only really helps together
with Bussi's improved code.
So still, if this is the issue, running git master should work,
or also running 4.0 with a longer tau_t.

Berk

#3 Updated by Mark Abraham almost 10 years ago

All my original cases had
  • dt=0.002
  • nstlist=10
  • tau_t=0.5 for protein & non-protein
  • with tcoupl = berendsen or v-rescale
  • with mdrun -sum or mdrun -nosum
  • on 4.0.5.

With v-rescale and -nosum, changing both tau_t=2, all 10 runs blew up. Changing both tau_t=10, was the same. Have I varied far enough?

On the tau_t=10 run I set nstlog=nstlist=10, and saw in one run that at step 10 that KE went from 1.7e5 to 1.8e5 kJ/mol after 10 steps, along with temperature up by 23K and pressure from 161 to 371 bar.

I reran the above tau_t=10 run with -sum. It ran fine, and in the same run described above, step 10 had KE move down from 1.71e5 to 1.70e5 kJ/mol (total energy changed only by +44 kJ/mol), with temperature dropping about 2K and pressure increasing 10 bar.

I'll try git master tomorrow. Obviously one can work around this for 4.0 by re-equilibrating.

#4 Updated by Berk Hess almost 10 years ago

Ah, those values should not really be a problem.
So it must be something else.
Could it be that you start at a very low temperature
or have a very unequilibrated system?

Also there the fixes in git master might help.

If you want, you can attach a tpr so I can have a look at it.

Berk

#5 Updated by Mark Abraham almost 10 years ago

Created an attachment (id=395)
tarball to reproduce v-rescale issue with -sum/-nosum

Ah, those values should not really be a problem.
So it must be something else.
Could it be that you start at a very low temperature
or have a very unequilibrated system?

I don't think those ideas will help. It's spent the last 10 nanoseconds apparently happily at 400K under 3.3.1. It's moved to 4.0.5, changing to DD, a new PME parameter set and the T-coupling algorithm. If there was a problem with the initial conditions or other parameters we might expect some symptoms when I used -sum or when I kept the Berendsen T-coupling. The 16ps with -sum looks fine to me, though (sample .log attached). Hence my theory about the initial conditions for v-rescale not being set up at the correct point of the code.

Also there the fixes in git master might help.

If you want, you can attach a tpr so I can have a look at it.

Sure. I did comment that I expected the .tpr files in bug #350 would demonstrate this. Regardless, here's a fresh v-rescale one in the attached tarball, together with .mdp and the output .log with -sum/-nosum. You can have a Berendsen one if you want it.

Sample mdrun line: mpirun mdrun_mpi_405 -nosum -npme 16 -dlb yes -deffnm 1oei_timing_04_30_30_24

#6 Updated by Berk Hess almost 10 years ago

I found the problem.
This is a rather annoying bug.
In vrescale_tcoupl in src/mdlib/coupling.c
Ek = trace(ekind->tcstat[i].ekinh);
should be replaced by
Ek = 0.5*ekind->tcstat[i].Th*BOLTZ*opts->nrdf[i];
since ekinh is updated with -nosum, but not summed, while Th is not updated.
But this is anyhow a dirty way of doing things.
In git master this is handling much cleaner by only coupling every nstlist steps.
So I am thinking if we should fix this for 4.0.6.
This would affect binary reproducibility which I would rather not have.
On the other hand, v-rescale is clearly the preferred thermostat with -nosum.

Berk

#7 Updated by Mark Abraham almost 10 years ago

I found the problem.
This is a rather annoying bug.
In vrescale_tcoupl in src/mdlib/coupling.c
Ek = trace(ekind->tcstat[i].ekinh);
should be replaced by
Ek = 0.5*ekind->tcstat[i].Th*BOLTZ*opts->nrdf[i];
since ekinh is updated with -nosum, but not summed, while Th is not updated.
But this is anyhow a dirty way of doing things.

That figures. The former line was the one that led me to form my guess :-) It would be an easy bug to miss in testing, since one might expect minor differences with -nosum.

The above change passed my set of test cases with v-rescale and -nosum.

In git master this is handling much cleaner by only coupling every nstlist
steps.

Sounds better.

So I am thinking if we should fix this for 4.0.6.
This would affect binary reproducibility which I would rather not have.
On the other hand, v-rescale is clearly the preferred thermostat with -nosum.

If I understand correctly, the 4.0.5 v-rescale is (slightly?) erroneous at non-neighboursearch steps under -nosum. This is because it is based on an ekinh matrix that was accurate last neighboursearch, but which has only had local updates since then. So not only is that wrong, but it's differently wrong on each PP processor?

Did my peculiar initial conditions provoke a severe failure because there was no accurate ekinh matrix to start with? If so, should grompp try to do better?

If so, under normal conditions it's apparently not wrong by very much, and perhaps smaller than the stochastic term. On the other hand there's a lot less merit in claiming the implementation of a T-coupling algorithm that preserves the ensemble if it doesn't strictly do that when there's enough parallelism to warrant -nosum. Binary reproducibility is certainly convenient for users and developers, but there's a point when the improvement in accuracy is high enough. Tough one!

At the very least, a note in 4.0.6 mdrun -h, perhaps manual, and known bugs list might be in order. The bug might be insignificant under normal use, but nobody's demonstrated that yet, and a user might like to make that judgement themselves.

#8 Updated by Berk Hess almost 10 years ago

No, the problem is far worse.
In parallel with -nosum ekinh is the kinetic of the local atoms
at non global sum steps. This is of course far too small
and causes the thermostat to scale all velocities with a large factor.
So we should fix it somehow.

Berk

#9 Updated by Berk Hess almost 10 years ago

I fixed this for 4.0.6.
Unfortunately 4.0.6 will not binary identical results to 4.0.5,
but most users wouldn't care about this.

Berk

#10 Updated by Mark Abraham almost 10 years ago

OK, great. I note that current git release-4-0-patches does not yet have the fix.

Mark

#11 Updated by Berk Hess almost 10 years ago

I forgot to do git push, did it now.
Thanks for checking.

Berk

Also available in: Atom PDF