## Bug #377

### Clarify meaning of -beginfit and -endfit in g_msd man page

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
analysis tools
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized

Description

Hi,

in addition to the standard -b and -e arguments, g_msd also supports -beginfit and -endfit, both of which are given in units of ps. The meaning of the latter arguments is not completely clear, especially since the default values are 10% and 90% of the runtime.

The output of the command is also not clear enough: looking at the log created by

\$ echo 0 | g_msd -b 2500 -beginfit 0 -endfit 10000 &> msd.log
(from a 10 ns simulation)

one sees, that only the trajectory from 2.5 ns to 10 ns is read in. g_msd correctly reports
Used 751 restart points spaced 10 ps over 7500 ps
but still, at the end it states
"Fitting from 0 to 10000 ps"

What's the reason behind the use of -beginfit and -endfit, don't -b and -e do exactly the same thing..?

### History

#### #1 Updated by Berk Hessalmost 10 years ago

Two weeks ago there has been a long discussion on this topic on the mailing list.
MSD is NOT a function of the simulation time, but of time differences.
Therefore these options do completely different things.
I don't know how to clarify this much more.
I think the problem arises from a lack of understanding of the concept of MSD.
Please read the discussion on the mailing list and give us tips of what
can be improved in the description of g_msd.

Looking at the description I think the first line is misleading:
"g_msd computes the mean square displacement (MSD) of atoms from their initial positions."
We should probably rephrase this.

Berk

#### #2 Updated by Jussi Lehtolaalmost 10 years ago

Two weeks ago there has been a long discussion on this topic on the mailing
list.
MSD is NOT a function of the simulation time, but of time differences.
Therefore these options do completely different things.
I don't know how to clarify this much more.
I think the problem arises from a lack of understanding of the concept of MSD.
Please read the discussion on the mailing list and give us tips of what
can be improved in the description of g_msd.

To me the concept is very clear: the diffusion constant is obtained by fitting a straight line to the rms displacement:

D = \lim_{t \to 0} [ < (x(t)-x(0))^2 > / 6 t ]

The only thing which is not clear is the man page and output of g_msd.

Looking at the description I think the first line is misleading:
"g_msd computes the mean square displacement (MSD) of atoms from their initial
positions."
We should probably rephrase this.

Yes. Changing "initial positions" to "reference configuration" would already help.

**

The thing that is most unclear is
"The diffusion constant is calculated by least squares fitting a straight line through the MSD from -beginfit to -endfit."

This, per se, is of course incorrect. The Einstein relation is only valid for (approximately) random walkers. When the time difference is too great, the MSD no longer is a linear function of time, and thus a linear fit gives nonsense.

If I have understood correctly, what you are doing is placing points between -trestart picoseconds and making the fit on each of these time intervals. So what the man page should actually say is

"The diffusion constant is calculated by least squares fitting straight lines through the MSD calculated between time intervals defined by -trestart. "

**

Another thing I don't understand from the man page is what
"An error estimate given, which is the difference of the diffusion coefficients obtained from fits over the two halfs of the fit interval."
actually means - are you referring to -beginfit and -endfit, or the intervals spaced -trestart?

**

A third thing: I still don't see the reason why you have options -beginfit and -endfit. What they make me think is that by default we scrap the first 10% and the last 10% of the analyzed runtime. But since there are already the options -b and -e to skip some of the trajectory, I don't see any reason for having those options.

**

As stated on the mailing list, it would be nice if
fprintf(stdout,"Fitting from %g to %g %s\n\n",beginfit,endfit,time_unit());
actually displayed the time in the same time system as -b and -e, since otherwise it's quite unclear what the routine actually does.

#### #3 Updated by Berk Hessalmost 10 years ago

I understand that the manual is unclear.
I will try to improve this.

But you still did not get the concept.
The MSD is calculated as:

MSD = < (x(t+t0) - x(t0))^2 >

Where the average <> is over many t0's (-trestart).
The linear fit is done on MSD.
In MSD the initial part will be ballistic and will
change to linear (unless you have a non-equilibrium system).
So you need to ignore some time at the beginning (-beginfit).
The last part of MSD might be inaccurate, because it uses
only on or a few restart points, so you might need to ignore
some part at the end for the fit (-endfit).
The linear fit does not go through zero exactly, because of
the initial ballistic part, so you need a linear fit with the constant,
but this effect is very small.

So again, the t in MSD is not simulation time and -beginfit
and -endfit have nothing to do with -b and -e.

Berk

#### #4 Updated by Jussi Lehtolaalmost 10 years ago

I understand that the manual is unclear.
I will try to improve this.

But you still did not get the concept.
The MSD is calculated as:

MSD = < (x(t+t0) - x(t0))^2 >

Where the average <> is over many t0's (-trestart).
The linear fit is done on MSD.

OK. I thought the average was taken just over the particles, not the restart times!

Of course, taking the average also over the times increases your sampling, since the physics of diffusion should be invariant wrt translations in time.

What are the sum limits in the average? If you're summing over all the points between -b and -e, then I see that the sampling decays with increasing t.

I'll probably have to run a few calculations on my own...

#### #5 Updated by Berk Hessalmost 10 years ago

Ah, I was still not 100% clear.
The averaging is over both particles and t0.
g_msd uses all starting times between -b and -e with a spacing of -trestart.
So the number of starting points decays linearly with time. But the actual
sampling decays even more, since for long times, all intervals for different
t0 will partially overlap. Where short intervals, especially those shorter
than t0, are (more) independent.
But if you have enough molecules, even long times might be converged.

Berk

#### #6 Updated by Jussi Lehtolaalmost 10 years ago

OK, still another question: why does the man page say

"Also when beginfit=-1,fitting starts at 0 and
when -endfit=-1, fitting goes to the end. Using this option one also
gets an accurate error estimate based on the statistics between indi

vidual molecules. Note that this diffusion coefficient and error esti-
mate are only accurate when the MSD is completely linear between
-beginfit and -endfit."

for '-mw' (the default value), but actually g_msd uses the default values of 10% and 90% for the fit?

#### #7 Updated by Berk Hessalmost 10 years ago

That part of the manual is completely incorrect.
The default (with -1) is always 10% - 90%.

Berk

#### #8 Updated by Berk Hessalmost 10 years ago

Hi,

I corrected the g_msd description.

Berk

g_msd computes the mean square displacement (MSD) of atoms from a set of
initial positions. This provides an easy way to compute the diffusion
constant using the Einstein relation. The time between the reference points
for the MSD calculation is set with -trestart. The diffusion constant is
calculated by least squares fitting a straight line (D*t + c) through the
MSD from -beginfit to -endfit (note that t is time from the reference
positions, not simulation time). An error estimate given, which is the
difference of the diffusion coefficients obtained from fits over the two
halves of the fit interval.

There are three, mutually exclusive, options to determine different types of
mean square displacement: -type, -lateral and -ten. Option -ten writes the
full MSD tensor for each group, the order in the output is: trace xx yy zz yx
zx zy.

Option -mol plots the MSD for molecules, this implies With option -rmcomm
center of mass motion can be removed. For trajectories produced with GROMACS
this is usually not necessary as mdrun usually already removes the center of
mass motion. When you use this option be sure that the whole system is stored
in the trajectory file.

-mw, i.e. for each individual molecule an diffusion constant is computed for
its center of mass. The chosen index group will be split into molecules. The
diffusion coefficient is determined by linear regression of the MSD, where,
unlike for the normal output of D, the times are weighted according to the
number of reference points, i.e. short times have a higher weight. Also when
-beginfit=-1,fitting starts at 10% and when -endfit=-1, fitting goes to 90%.
Using this option one also gets an accurate error estimate based on the
statistics between individual molecules. Note that this diffusion coefficient
and error estimate are only accurate when the MSD is completely linear
between -beginfit and -endfit.

Option -pdb writes a pdb file with the coordinates of the frame at time -tpdb
with in the B-factor field the square root of the diffusion coefficient of
the molecule. This option implies option -mol.

Also available in: Atom PDF