Project

General

Profile

Bug #168

Parallel runs crash when molecules split over processors

Added by Manolis Doxastakis about 12 years ago. Updated about 12 years ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Hello,

I have a simulation of protein with lipids that crash when submitting parallel runs with the 3.3.2 version and using anything than more 3 processors.
(double precision, no constraints)

Messages like "water can not be settled" appear but this is caused due to high forces and temperature increase.

This does not happen when I use
constraints = all-bonds
but crashes immediately when I do not constrain the bonds.

After some searching, I believe the problem rises from the addition of the following code in ewald_util.c ( ewald_LRcorrection function ):

if (bFirst && (cr) && PAR(cr)) {
gmx_sumi(1,&bSumForces,cr);
bFirst = FALSE;
}
if (bSumForces) {
/* This is necessary if molecules are split over processors. Should
be optimized! */
gmx_sum(nsb->natoms*DIM,fr->f_el_recip[0],cr);
}

In practice, my simulation of 4 processors reports the following value e.g. for f_el_recip0[XX] BEFORE the above code:

First Step: Processor(0) 213... (this is were the atom belongs to)
Processor(1) 0
Processor(2) 0
Processor(3) 0

AFTER the code:
First Step: Processor(0) 213... (this is were the atom belongs to)
Processor(1) 213
Processor(2) 213
Processor(3) 213

due to the gmx_sum etc.

In the second step, BEFORE the code:
First Step: Processor(0) 209... (this is were the atom belongs to)
Processor(1) 213
Processor(2) 213
Processor(3) 213

AFTER the code:
First Step: Processor(0) 848 (this is were the atom belongs to)
Processor(1) 848
Processor(2) 848
Processor(3) 848

So as you see, what is happening is that f_el_recip0[XX] is not recalculated (changed) in the rest of the processors (apart 0) and accumulates multiple times the values.

I have found that I could solve it , simply by setting f_el_recip for all atoms to 0, in force.c (force function) but it does not appear as an elegant solution to me. Let me know your opinion.

Manolis

topol_2.tpr (7.73 MB) topol_2.tpr Run on two processors (same grid) for 50 steps, runs fine Manolis Doxastakis, 10/09/2007 04:19 AM
topol_4.tpr (7.73 MB) topol_4.tpr Run on four processors (same grid) for 50 steps, that crashes after a few steps Manolis Doxastakis, 10/09/2007 04:22 AM

History

#1 Updated by Manolis Doxastakis about 12 years ago

I apologize for the type, the last two in the example, correspond to second step

In the second step, BEFORE the code:
SECOND STEP: Processor(0) 209... (this is were the atom belongs to)
Processor(1) 213
Processor(2) 213
Processor(3) 213

AFTER the code:
SECOND STEP: Processor(0) 848 (this is were the atom belongs to)
Processor(1) 848
Processor(2) 848
Processor(3) 848

#2 Updated by David van der Spoel about 12 years ago

Please upload a tpr file that reproduces the problem.

#3 Updated by Manolis Doxastakis about 12 years ago

Created an attachment (id=247)
Run on two processors (same grid) for 50 steps, runs fine

This is a run on two processors that finishes fine (50 steps)

#4 Updated by Manolis Doxastakis about 12 years ago

Created an attachment (id=248)
Run on four processors (same grid) for 50 steps, that crashes after a few steps

This should crash very fast (4-5 steps)

#5 Updated by Manolis Doxastakis about 12 years ago

The 4 processors run does not have any water on 1st process. Still this does not cause the bug, since I did try to re-order the system, place a water in the beginning and still got the problem. I am using openmpi and icc, but we have reproduced with lam and gcc.

#6 Updated by David van der Spoel about 12 years ago

FYI: In have reproduced the bug, will look into it.

#7 Updated by David van der Spoel about 12 years ago

This has been fixed in CVS for release 3.3. Thanks for the detailed report, with solution suggestion.

Also available in: Atom PDF