Possible races in FEP code
Our PhD student seems like found conditions when gromacs will crash on FEP simulation starting from some number of MPI processes.
Tpr file with system that affected to this bug attached. It runs fine on two node setup (each with 2 quad core Xeon 54xx) however starting from 3 nodes it crashed.
Crash occure just before first time step.
PS i can run without any problems quite large simulations using same gromacs build (e.g. nucleosome ~1M atoms using 512-1024 cores) so seems like only FEP setup affected
#6 Updated by Marina Polyakova almost 3 years ago
- File DNA.eq_npt_0.55.tpr added
- File DNA.pr_0.55.cpt added
- File DNA.top added
- File DNA_cluster.gro added
- File eq_npt_0.55.mdp added
- File posre.itp added
I would like to add that some tasks for other lambdas (from 0 till 0.45) have finished without any problems.
#8 Updated by Mark Abraham almost 3 years ago
There was a recent fix in #1462 that might be relevant. It seems like that fix was not in the version you ran; if not, can you please try 5.0-rc1 (or master HEAD), which does have it?
Otherwise, yeah, something feral like that would block a major release ;-)
#16 Updated by Roland Schulz almost 3 years ago
- Status changed from New to Blocked, need info
The coordinates in the tpr file don't agree with those either in the gro or cpt file. If I recreate the tpr based on either the coordinates in the gro or cpt file then I don't get an error. A newly generated tpr with the coordinates in the tpr produces the same error as with the tpr you provided. Could it be that something went wrong generating the tpr and the tpr simply has bad coordinates?
#17 Updated by Alexey Shvetsov almost 3 years ago
Its possible. I'll ask Marina to upload all needed files. However *.gro file was starting point for minimization it seems. What files do you checked? Right pair sould be DNA.pr_0.55.cpt for velocities and starting coordinates and DNA.eq_npt_0.55.tpr.
#18 Updated by Roland Schulz almost 3 years ago
Yes DNA.pr_0.55.cpt and DNA.eq_npt_0.55.tpr match. I was using DNA.eq_npt_0.9.tpr. And I only get the assert error with DNA.eq_npt_0.9.tpr on step 0 for 2 MPI ranks. With the 0.55 tpr I don't get the assert for step 0 for 2 or 16 ranks, but I do get it for 24 ranks.