Project

General

Profile

Bug #271

mdrun -rerun broken in CVS

Added by Mark Abraham almost 11 years ago. Updated almost 11 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The rerun facility with domain decomposition is broken in CVS. Line 853 of src/kernel/md.c does a check for a compatible system size between the run input file and the rerun trajectory by comparing mdatoms->nr and rerun_fr.natoms. However under DD, mdatoms->nr is only a local size, so something like state_global->natoms should be used instead.

However, even with this trivial fix applied, the same run input file and trajectory works normally on two nodes with mdrun -rerun -pd with correct results, but under mdrun -rerun I see

...
Reading file 1oei.tpr, VERSION 4.0.99 (single precision)
NNODES=2, MYRANK=1, HOSTNAME=ac
NODEID=1 argc=8
Making 1D domain decomposition 2 x 1 x 1
starting md rerun 'Protein in water', reading coordinates from input trajectory '../1oei_00001.trr'

trn version: GMX_trn_file (single precision)
trn version: GMX_trn_file (single precision)
Reading frame 0 time 0.000
WARNING: Some frames do not contain velocities.
Ekin, temperature and pressure are incorrect,
the virial will be incorrect when constraints are present.

Reading frame 0 time 0.000
WARNING: Some frames do not contain velocities.
Ekin, temperature and pressure are incorrect,
the virial will be incorrect when constraints are present.

Warning: 1-4 interaction between 200 and 207 at distance 3.365 which is larger than the 1-4 table size 2.000 nm
These are ignored for the rest of the simulation
This usually means your system is exploding,
if not, you should increase table-extension in your mdp file
or with user tables increase the table size
Warning: 1-4 interaction between 32 and 38 at distance 3.882 which is larger than the 1-4 table size 2.000 nm
These are ignored for the rest of the simulation
This usually means your system is exploding,
if not, you should increase table-extension in your mdp file
or with user tables increase the table size
Reading frame 1 time 1.000
A list of missing interactions:
Bond of 323 missing 15
Angle of 402 missing 46
U-B of 168 missing 27
Proper Dih. of 6 missing 4
Ryckaert-Bell. of 598 missing 113
Improper Dih. of 76 missing 11
LJ-14 of 787 missing 99
exclusions of 52740 missing 176

Molecule type 'Protein_A'
the first 10 missing interactions, except for exclusions:
Bond atoms 1 5 global 1 5
Angle atoms 1 5 6 global 1 5 6
Angle atoms 1 5 7 global 1 5 7
Ryckaert-Bell. atoms 1 5 7 8 global 1 5 7 8
Ryckaert-Bell. atoms 1 5 7 9 global 1 5 7 9
LJ-14 atoms 1 8 global 1 8
LJ-14 atoms 1 9 global 1 9
U-B atoms 2 1 5 global 2 1 5
LJ-14 atoms 2 6 global 2 6
LJ-14 atoms 2 7 global 2 7

-------------------------------------------------------
Program mdrun_mpi_c_cvs4, VERSION 4.0.99
Source code file: ../../../src/mdlib/domdec_top.c, line: 349

Fatal error:
491 of the 72120 bonded interactions could not be calculated because some atoms involved moved further apart than the multi-body cut-off distance (1 nm) or the two-body cut-off distance (1 nm), see option rdd, for pairs and tabulated bonds also see option -ddcheck
------------------------------------------------------

the end of the .log file was

...
Started mdrun on node 0 Tue Dec 16 15:30:49 2008

Step           Time         Lambda
0 0.00000 0.00000

Long Range LJ corr.: <C6> 2.8094e-04
Long Range LJ corr.: Epot -2661.82, Pres: -151.532, Vir: 2661.82
Energies (kJ/mol)
Bond Angle U-B Proper Dih. Ryckaert-Bell.
4.50980e+08 9.65542e+04 2.73146e+07 1.23539e+00 4.05791e+03
Improper Dih. LJ-14 Coulomb-14 LJ (SR) Disper. corr.
3.36201e+04 4.71055e+06 -7.29533e+02 8.75633e+07 -2.66182e+03
Coulomb (SR) Coul. recip. Potential Kinetic En. Total Energy
nan -2.27584e+06 nan 0.00000e+00 nan
Temperature Pressure (bar)
0.00000e+00 nan

DD step 499 load imb.: force 33.9%

Step           Time         Lambda
500 1.00000 0.00000

Not all bonded interactions have been properly assigned to the domain decomposition cells

A list of missing interactions:
Bond of 323 missing 15
Angle of 402 missing 46
U-B of 168 missing 27
Proper Dih. of 6 missing 4
Ryckaert-Bell. of 598 missing 113
Improper Dih. of 76 missing 11
LJ-14 of 787 missing 99
exclusions of 52740 missing 176

Molecule type 'Protein_A'
the first 10 missing interactions, except for exclusions:
Bond atoms 1 5 global 1 5
Angle atoms 1 5 6 global 1 5 6
Angle atoms 1 5 7 global 1 5 7
Ryckaert-Bell. atoms 1 5 7 8 global 1 5 7 8
Ryckaert-Bell. atoms 1 5 7 9 global 1 5 7 9
LJ-14 atoms 1 8 global 1 8
LJ-14 atoms 1 9 global 1 9
U-B atoms 2 1 5 global 2 1 5
LJ-14 atoms 2 6 global 2 6
LJ-14 atoms 2 7 global 2 7

-------------------------------------------------------
Program mdrun_mpi_c_cvs4, VERSION 4.0.99
Source code file: ../../../src/mdlib/domdec_top.c, line: 349

Fatal error:
491 of the 72120 bonded interactions could not be calculated because some atoms involved moved further apart than the multi-body cut-off distance (1 nm) or the two-body cut-off distance (1 nm), see option rdd
, for pairs and tabulated bonds also see option -ddcheck
------------------------------------------------------

Applying -ddcheck has no effect on the output that I can see.

Would be happy to supply .tpr, but I can't see that it could be a .tpr-specific issue at this stage.

History

#1 Updated by Berk Hess almost 11 years ago

How did you generate the trajectory?

My guess is that DD goes wrong, because it expects
all charge groups to be (nearly) in the box, which will not
be the case when you generated the rerun trajectory
single processor, or with PD, or in certain cases with DD
with a triclinic box and more or less nodes.

Berk

#2 Updated by Mark Abraham almost 11 years ago

The .tpr was generated with 3.3.1, hence PD. I'll test the theory tomorrow.

#3 Updated by Berk Hess almost 11 years ago

The tpr is completely irrelevant for this.
But maybe you consistently never use old tpr files
with 4.0, as you suggest on gmx-users,
although this works without problems
(you only loose a bit of performance for large systems).

I found at least one error.
In md.c it says:
bMasterState = FALSE;
but this should be on for rerun, so could you try changing it to:
bMasterState = bRerunMD;

Note that the scaling with -rerun will be bad, since the state
needs to be broadcasted for every frame from the master node
to the other nodes. I would expect good scaling only to 2 or 4
nodes.

Berk

#4 Updated by Berk Hess almost 11 years ago

The fix required making a few more lines conditional.
I committed fixes for the head and 4.0 release branch.

Berk

#5 Updated by Mark Abraham almost 11 years ago

The .tpr was generated with 3.3.1, hence PD. I'll test the theory tomorrow.

should, of course, read

"The .trr was generated with 3.3.1, hence PD. I'll test the theory tomorrow."

My .tpr was always a 4.0.2 one. In any case, the fix in CVS seems to have solved the issue I was seeing. Thanks.

Also available in: Atom PDF