Project

General

Profile

Bug #1474

Possible races in FEP code

Added by Alexey Shvetsov over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Our PhD student seems like found conditions when gromacs will crash on FEP simulation starting from some number of MPI processes.
Tpr file with system that affected to this bug attached. It runs fine on two node setup (each with 2 quad core Xeon 54xx) however starting from 3 nodes it crashed.

Crash occure just before first time step.

PS i can run without any problems quite large simulations using same gromacs build (e.g. nucleosome ~1M atoms using 512-1024 cores) so seems like only FEP setup affected

mpi-288108.err (14.4 KB) mpi-288108.err crashed 3 node setup Alexey Shvetsov, 04/03/2014 06:49 PM
mpi-288109.err (6.62 KB) mpi-288109.err working 2 node setup Alexey Shvetsov, 04/03/2014 06:49 PM
DNA.eq_npt_0.9.tpr (1.75 MB) DNA.eq_npt_0.9.tpr problematic tpr Alexey Shvetsov, 04/03/2014 06:50 PM
#DNA.eq_npt_0.9.log.1# (17.2 KB) #DNA.eq_npt_0.9.log.1# md.log from segfaulted run Alexey Shvetsov, 04/03/2014 09:37 PM
DNA.pr_0.55.cpt (1.59 MB) DNA.pr_0.55.cpt cpt from previous stage Marina Polyakova, 04/04/2014 09:34 AM
DNA.eq_npt_0.55.tpr (1.75 MB) DNA.eq_npt_0.55.tpr problematic tpr Marina Polyakova, 04/04/2014 09:34 AM
DNA.top (385 KB) DNA.top initial top Marina Polyakova, 04/04/2014 09:34 AM
eq_npt_0.55.mdp (11.2 KB) eq_npt_0.55.mdp mdp for problematic task Marina Polyakova, 04/04/2014 09:34 AM
posre.itp (26.1 KB) posre.itp initial itp Marina Polyakova, 04/04/2014 09:34 AM
DNA_cluster.gro (3.05 MB) DNA_cluster.gro initial gro Marina Polyakova, 04/04/2014 09:34 AM

Associated revisions

Revision 4b78099a (diff)
Added by Berk Hess over 3 years ago

Fixed nbnxn FE list allocation issue

With free-energy calculations, the nbnxn search code could write
beyond the list bound, which could cause a segv.
Fixes #1474

Change-Id: I4c202fc14b04980f05ad1b3ea001732fdfaa9f00

History

#1 Updated by Roland Schulz over 3 years ago

When compiling with asserts it gives:
gmx: ../src/gromacs/mdlib/nbnxn_search.c:3261: fep_list_new_nri_copy: Assertion `nlist->nri < nlist->maxnri' failed.

This is even when just using 2 MPI ranks.

#2 Updated by Alexey Shvetsov over 3 years ago

But it dont fail with 2 ranks. However some earlier revisions failes even on start at lest rev 23d79229d99e4ff53bf9356c22f57727450f01af

#3 Updated by Roland Schulz over 3 years ago

You mean with "dont fail" that they don't crash with segfault? If so it is normal with segfaults that they don't always occur. What do you mean with "on start"?

#4 Updated by Alexey Shvetsov over 3 years ago

Yep. I mean that it dont crash with segfault.

By "on start" I mean that mdrun segfaults even before it init first step.

#5 Updated by Roland Schulz over 3 years ago

It should be trivial to bisect the problem. But it requires the grompp input files. Not sure it'll help Berk or whether he anyhow knows what the problem is.

#6 Updated by Marina Polyakova over 3 years ago

I would like to add that some tasks for other lambdas (from 0 till 0.45) have finished without any problems.

#7 Updated by Alexey Shvetsov over 3 years ago

Seems like this bug should be blocker for 5.0 release. Mark?

#8 Updated by Mark Abraham over 3 years ago

There was a recent fix in #1462 that might be relevant. It seems like that fix was not in the version you ran; if not, can you please try 5.0-rc1 (or master HEAD), which does have it?

Otherwise, yeah, something feral like that would block a major release ;-)

#9 Updated by Alexey Shvetsov over 3 years ago

Let me recheck with latest build.

#10 Updated by Alexey Shvetsov over 3 years ago

Latest build from git master has same issue.

#11 Updated by Mark Abraham over 3 years ago

OK. Not sure if I'll get time for this one this week (also Erik's writing grants and Berk's away), but thanks for the report!

#12 Updated by Alexey Shvetsov over 3 years ago

Ok. I'll try to debug it. But main problem that i cannot reproduce it on my workstation (seems like it depends on number of mpi threads). Debugging on cluster is a little bit problematic because of its setup

#13 Updated by Roland Schulz over 3 years ago

Did you try my suggestion of compiling with asserts (default if you use Debug build). In that case it is for me independent on number of mpi threads. BTW: for debugging you can also simply oversubscribe your workstation.

#14 Updated by Alexey Shvetsov over 3 years ago

Arent RelWithDebInfo also builds with asserts? If not i'll try pure debug build

#15 Updated by Roland Schulz over 3 years ago

No. Asserts are disabled by -DNDEBUG. And this is by default also in the flags for RelWithDebInfo. Only Debug, Reference, and RelWithAssert don't have that flag by default. Of course you can also simply change CMAKE_C*_FLAGS_* to remove the flag.

#16 Updated by Roland Schulz over 3 years ago

  • Status changed from New to Blocked, need info

The coordinates in the tpr file don't agree with those either in the gro or cpt file. If I recreate the tpr based on either the coordinates in the gro or cpt file then I don't get an error. A newly generated tpr with the coordinates in the tpr produces the same error as with the tpr you provided. Could it be that something went wrong generating the tpr and the tpr simply has bad coordinates?

#17 Updated by Alexey Shvetsov over 3 years ago

Its possible. I'll ask Marina to upload all needed files. However *.gro file was starting point for minimization it seems. What files do you checked? Right pair sould be DNA.pr_0.55.cpt for velocities and starting coordinates and DNA.eq_npt_0.55.tpr.

#18 Updated by Roland Schulz over 3 years ago

Yes DNA.pr_0.55.cpt and DNA.eq_npt_0.55.tpr match. I was using DNA.eq_npt_0.9.tpr. And I only get the assert error with DNA.eq_npt_0.9.tpr on step 0 for 2 MPI ranks. With the 0.55 tpr I don't get the assert for step 0 for 2 or 16 ranks, but I do get it for 24 ranks.

#19 Updated by Alexey Shvetsov over 3 years ago

Yep. Same here. 16 mpi ranks works, while 24 failes with trace.

#20 Updated by Alexey Shvetsov over 3 years ago

Looks like some kind of off-by-one error. nlist->nri became equal to nlist->maxnri

#21 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1474.
Uploader: Alexey Shvetsov ()
Change-Id: I3ef7000e5beb37fc0390efb8b755d226202a2414
Gerrit URL: https://gerrit.gromacs.org/3356

#22 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1474.
Uploader: Berk Hess ()
Change-Id: I4c202fc14b04980f05ad1b3ea001732fdfaa9f00
Gerrit URL: https://gerrit.gromacs.org/3373

#23 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1474.
Uploader: Berk Hess ()
Change-Id: I4c202fc14b04980f05ad1b3ea001732fdfaa9f00
Gerrit URL: https://gerrit.gromacs.org/3374

#24 Updated by Berk Hess over 3 years ago

  • Status changed from Blocked, need info to Resolved
  • % Done changed from 0 to 100

#25 Updated by Roland Schulz over 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF