Project

General

Profile

Bug #408

unwanted ensemble averaging of distance restraints with multisim / replica exchange

Added by Floris Buelens over 9 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

with option disre=simple, the documented behaviour is to conduct ensemble averaging over simulations when running with -multi. The statement

if (ms)
    {
gmx_sum_sim(2*dd->nres,Rt_6,ms);
}

in disre.c was apparently causing intermittent errors like this one:

[node027:4676] * An error occurred in MPI_Allreduce
[node027:4676]
on communicator MPI COMMUNICATOR 3 CREATE FROM 0
[node027:4676]
MPI_ERR_TRUNCATE: message truncated
[node027:4676] *
MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

I think it's some kind of race condition. It only seems to occur with a large number of multi-simulations.

This is happening for me in my custom code based on 4.0.5, and I'm using -multi for replica exchange between free energy lambda values. So the actual error above could be my fault. However it seems to me that averaging over simulations when if(ms) evaluates true doesn't discriminate between different reasons for running multisim. There shouldn't be any kind of distance restraint sum_sim going on as far as I can see for my purposes.

Removing distance restraints from my topology results in no more of the above Allreduce errors.


Related issues

Related to GROMACS - Bug #1117: ensemble-averaged distance restraints is probably brokenClosed

History

#1 Updated by David van der Spoel over 9 years ago

Let me recap: the multisim option is used for both distance restraints and replica exchange, and when you run replex in a topology with restraints you get ensemble averaging rather than something else?

The test in g_disre should therefore be on the option set in the mdp file, right?

Or, should it be like that one can not combine replex and ensemble averaging? I guess that would require multiple layers of parallellism, and that is not implemented. It is not possible to catch this in grompp, because the replex is only done in mdrun. Hence mdrun should have a check against ensemble averaging whenever replex is turned on.

#2 Updated by Floris Buelens over 9 years ago

(In reply to comment #1)

Let me recap: the multisim option is used for both distance restraints and
replica exchange, and when you run replex in a topology with restraints you get
ensemble averaging rather than something else?

Replica exchange appears to run fine although I wouldn't rule out unwanted interactions that might not be immediately apparent.

The test in g_disre should therefore be on the option set in the mdp file,
right?

Or, should it be like that one can not combine replex and ensemble averaging? I
guess that would require multiple layers of parallellism, and that is not
implemented. It is not possible to catch this in grompp, because the replex is
only done in mdrun. Hence mdrun should have a check against ensemble averaging
whenever replex is turned on.

I would suggest the mdp option disre=simple should mean no averaging of any kind. A new option disre=multi could give the current behaviour when mdrun is run with -multi. mdrun could then check if disre=multi in the mdp, and either warn or exit if both -multi and -replex are used.

#3 Updated by David van der Spoel over 9 years ago

But is then my conclusion correct that you specify disre = simple in the mdp file, and as a result of running replex get ensemble averaging?

#4 Updated by Floris Buelens over 9 years ago

(In reply to comment #3)

But is then my conclusion correct that you specify disre = simple in the mdp
file, and as a result of running replex get ensemble averaging?

that's right, with disre = simple in mdp and mdrun -multi -replex this conditional in disre.c

if (ms)
    {
gmx_sum_sim(2*dd->nres,Rt_6,ms);
}

evaluates as true and Rt_6 is communicated across the replicas.

#5 Updated by Berk Hess over 9 years ago

I fixed this by requiring the environment variable to be set
for ensemble averaging over simulations.
Note that the mdp option is for ensemble averaging within a simulation.

Berk

#6 Updated by Mark Abraham over 3 years ago

  • Related to Bug #1117: ensemble-averaged distance restraints is probably broken added

Also available in: Atom PDF