Crashes with dynamic load balancing
Under version 4.0 RC2 and the official version 4.0, I have seen sporadic crashes with the messages I reported here:
Explicitly using "-dlb no" with mdrun seems to alleviate the problem. In the attached tarball is a .tpr file that consistently results in a crash soon after the run starts. Log files from "-debug 1" are also included, as well as the .e and .o files from our cluster's queuing system; the .o file contains the exact error message for this specific case.
The cluster uses PowerPC G5 nodes (dual-core) running Mac OSX 10.3.9. The gcc/mpicc version is 3.3.
The mdrun command I have been issuing is:
mdrun_mpi -s md_50_60.tpr -np 128 -npme 32 (-dlb auto/no)
#2 Updated by Berk Hess about 12 years ago
The bug in June could be a completely different one.
Since then I have fixed one or more bugs that could cause
such an error message.
The pressure coupling tau_p could influence the probability
that you observe this issue. Especially since for the official
4.0 release I corrected a factor 16.6 in the use of tau_p,
which makes the coupling and therefore the box fluctuations
But tweaking some parameters does not solve the problem
or bug itself.
For that I'd rather tweak the parameters such that
the error occurs immediately.
#4 Updated by Berk Hess almost 12 years ago
I fixed the bug.
It only occurs with DLB and multiple communication pulses.
Then in extreme cases such as yours the ci error could show up.
I think this would always happen before you actually start
to miss interactions.
It took a long time, because while fixing this I also noticed
that the atom communication for triclinic cells was not optimal
and in rare cases a few interactions could be missing.
Fixing this reduces the atom communication for triclinic cells
by about 10%.