Project

General

Profile

Bug #229

Crashes with dynamic load balancing

Added by Justin Lemkul about 12 years ago. Updated almost 12 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Under version 4.0 RC2 and the official version 4.0, I have seen sporadic crashes with the messages I reported here:

http://www.gromacs.org/pipermail/gmx-users/2008-October/037116.html

Explicitly using "-dlb no" with mdrun seems to alleviate the problem. In the attached tarball is a .tpr file that consistently results in a crash soon after the run starts. Log files from "-debug 1" are also included, as well as the .e and .o files from our cluster's queuing system; the .o file contains the exact error message for this specific case.

The cluster uses PowerPC G5 nodes (dual-core) running Mac OSX 10.3.9. The gcc/mpicc version is 3.3.

The mdrun command I have been issuing is:

mdrun_mpi -s md_50_60.tpr -np 128 -npme 32 (-dlb auto/no)

crash_2.tar.gz (12.7 MB) crash_2.tar.gz Input .tpr file, log files from -debug Justin Lemkul, 10/16/2008 02:03 PM

History

#1 Updated by Justin Lemkul about 12 years ago

Created an attachment (id=304)
Input .tpr file, log files from -debug

#2 Updated by Berk Hess about 12 years ago

The bug in June could be a completely different one.
Since then I have fixed one or more bugs that could cause
such an error message.

The pressure coupling tau_p could influence the probability
that you observe this issue. Especially since for the official
4.0 release I corrected a factor 16.6 in the use of tau_p,
which makes the coupling and therefore the box fluctuations
much faster.

But tweaking some parameters does not solve the problem
or bug itself.
For that I'd rather tweak the parameters such that
the error occurs immediately.

Berk

#3 Updated by Berk Hess about 12 years ago

Oops, I answered a gmx-users question on bugzilla.
Please ignore my previous post.

Berk

#4 Updated by Berk Hess almost 12 years ago

I fixed the bug.
It only occurs with DLB and multiple communication pulses.
Then in extreme cases such as yours the ci error could show up.
I think this would always happen before you actually start
to miss interactions.

It took a long time, because while fixing this I also noticed
that the atom communication for triclinic cells was not optimal
and in rare cases a few interactions could be missing.
Fixing this reduces the atom communication for triclinic cells
by about 10%.

Berk

Also available in: Atom PDF