## Bug #377

### Clarify meaning of -beginfit and -endfit in g_msd man page

**Description**

Hi,

in addition to the standard -b and -e arguments, g_msd also supports -beginfit and -endfit, both of which are given in units of ps. The meaning of the latter arguments is not completely clear, especially since the default values are 10% and 90% of the runtime.

The output of the command is also not clear enough: looking at the log created by

$ echo 0 | g_msd -b 2500 -beginfit 0 -endfit 10000 &> msd.log

(from a 10 ns simulation)

one sees, that only the trajectory from 2.5 ns to 10 ns is read in. g_msd correctly reports

Used 751 restart points spaced 10 ps over 7500 ps

but still, at the end it states

"Fitting from 0 to 10000 ps"

What's the reason behind the use of -beginfit and -endfit, don't -b and -e do exactly the same thing..?

### History

#### #1 Updated by Berk Hess over 10 years ago

Two weeks ago there has been a long discussion on this topic on the mailing list.

MSD is NOT a function of the simulation time, but of time differences.

Therefore these options do completely different things.

I don't know how to clarify this much more.

I think the problem arises from a lack of understanding of the concept of MSD.

Please read the discussion on the mailing list and give us tips of what

can be improved in the description of g_msd.

Looking at the description I think the first line is misleading:

"g_msd computes the mean square displacement (MSD) of atoms from their initial positions."

We should probably rephrase this.

Berk

#### #2 Updated by Jussi Lehtola over 10 years ago

(In reply to comment #1)

Two weeks ago there has been a long discussion on this topic on the mailing

list.

MSD is NOT a function of the simulation time, but of time differences.

Therefore these options do completely different things.

I don't know how to clarify this much more.

I think the problem arises from a lack of understanding of the concept of MSD.

Please read the discussion on the mailing list and give us tips of what

can be improved in the description of g_msd.

To me the concept is very clear: the diffusion constant is obtained by fitting a straight line to the rms displacement:

`D = \lim_{t \to 0} [ < (x(t)-x(0))^2 > / 6 t ]`

The only thing which is not clear is the man page and output of g_msd.

Looking at the description I think the first line is misleading:

"g_msd computes the mean square displacement (MSD) of atoms from their initial

positions."

We should probably rephrase this.

Yes. Changing "initial positions" to "reference configuration" would already help.

**

The thing that is most unclear is

"The diffusion constant is calculated by least squares fitting a straight line through the MSD from -beginfit to -endfit."

This, per se, is of course incorrect. The Einstein relation is only valid for (approximately) random walkers. When the time difference is too great, the MSD no longer is a linear function of time, and thus a linear fit gives nonsense.

If I have understood correctly, what you are doing is placing points between -trestart picoseconds and making the fit on each of these time intervals. So what the man page should actually say is

`"The diffusion constant is calculated by least squares fitting straight lines through the MSD calculated between time intervals defined by -trestart. "`

**

Another thing I don't understand from the man page is what

"An error estimate given, which is the difference of the diffusion coefficients obtained from fits over the two halfs of the fit interval."

actually means - are you referring to -beginfit and -endfit, or the intervals spaced -trestart?

**

A third thing: I still don't see the reason why you have options -beginfit and -endfit. What they make me think is that by default we scrap the first 10% and the last 10% of the analyzed runtime. But since there are already the options -b and -e to skip some of the trajectory, I don't see any reason for having those options.

**

As stated on the mailing list, it would be nice if

fprintf(stdout,"Fitting from %g to %g %s\n\n",beginfit,endfit,time_unit());

actually displayed the time in the same time system as -b and -e, since otherwise it's quite unclear what the routine actually does.

#### #3 Updated by Berk Hess over 10 years ago

I understand that the manual is unclear.

I will try to improve this.

But you still did not get the concept.

The MSD is calculated as:

MSD = < (x(t+t0) - x(t0))^2 >

Where the average <> is over many t0's (-trestart).

The linear fit is done on MSD.

In MSD the initial part will be ballistic and will

change to linear (unless you have a non-equilibrium system).

So you need to ignore some time at the beginning (-beginfit).

The last part of MSD might be inaccurate, because it uses

only on or a few restart points, so you might need to ignore

some part at the end for the fit (-endfit).

The linear fit does not go through zero exactly, because of

the initial ballistic part, so you need a linear fit with the constant,

but this effect is very small.

So again, the t in MSD is not simulation time and -beginfit

and -endfit have nothing to do with -b and -e.

Berk

#### #4 Updated by Jussi Lehtola over 10 years ago

(In reply to comment #3)

I understand that the manual is unclear.

I will try to improve this.But you still did not get the concept.

The MSD is calculated as:MSD = < (x(t+t0) - x(t0))^2 >

Where the average <> is over many t0's (-trestart).

The linear fit is done on MSD.

OK. I thought the average was taken just over the particles, not the restart times!

Of course, taking the average also over the times increases your sampling, since the physics of diffusion should be invariant wrt translations in time.

What are the sum limits in the average? If you're summing over all the points between -b and -e, then I see that the sampling decays with increasing t.

I'll probably have to run a few calculations on my own...

#### #5 Updated by Berk Hess over 10 years ago

Ah, I was still not 100% clear.

The averaging is over both particles and t0.

g_msd uses all starting times between -b and -e with a spacing of -trestart.

So the number of starting points decays linearly with time. But the actual

sampling decays even more, since for long times, all intervals for different

t0 will partially overlap. Where short intervals, especially those shorter

than t0, are (more) independent.

But if you have enough molecules, even long times might be converged.

Berk

#### #6 Updated by Jussi Lehtola over 10 years ago

OK, still another question: why does the man page say

"Also when ~~beginfit=-1,fitting starts at 0 and~~

when -endfit=-1, fitting goes to the end. Using this option one also

gets an accurate error estimate based on the statistics between indi

vidual molecules. Note that this diffusion coefficient and error esti-

mate are only accurate when the MSD is completely linear between

-beginfit and -endfit."

for '-mw' (the default value), but actually g_msd uses the default values of 10% and 90% for the fit?

#### #7 Updated by Berk Hess over 10 years ago

That part of the manual is completely incorrect.

The default (with -1) is always 10% - 90%.

Berk

#### #8 Updated by Berk Hess over 10 years ago

Hi,

I corrected the g_msd description.

Berk

g_msd computes the mean square displacement (MSD) of atoms from a set of

initial positions. This provides an easy way to compute the diffusion

constant using the Einstein relation. The time between the reference points

for the MSD calculation is set with -trestart. The diffusion constant is

calculated by least squares fitting a straight line (D*t + c) through the

MSD from -beginfit to -endfit (note that t is time from the reference

positions, not simulation time). An error estimate given, which is the

difference of the diffusion coefficients obtained from fits over the two

halves of the fit interval.

There are three, mutually exclusive, options to determine different types of

mean square displacement: -type, -lateral and -ten. Option -ten writes the

full MSD tensor for each group, the order in the output is: trace xx yy zz yx

zx zy.

Option -mol plots the MSD for molecules, this implies With option -rmcomm

center of mass motion can be removed. For trajectories produced with GROMACS

this is usually not necessary as mdrun usually already removes the center of

mass motion. When you use this option be sure that the whole system is stored

in the trajectory file.

-mw, i.e. for each individual molecule an diffusion constant is computed for

its center of mass. The chosen index group will be split into molecules. The

diffusion coefficient is determined by linear regression of the MSD, where,

unlike for the normal output of D, the times are weighted according to the

number of reference points, i.e. short times have a higher weight. Also when

-beginfit=-1,fitting starts at 10% and when -endfit=-1, fitting goes to 90%.

Using this option one also gets an accurate error estimate based on the

statistics between individual molecules. Note that this diffusion coefficient

and error estimate are only accurate when the MSD is completely linear

between -beginfit and -endfit.

Option -pdb writes a pdb file with the coordinates of the frame at time -tpdb

with in the B-factor field the square root of the diffusion coefficient of

the molecule. This option implies option -mol.