Project

General

Profile

Bug #3413

COMM Removal Failure in GROMACS 2020.1

Added by Daniel Kozuch 3 months ago. Updated 3 months ago.

Status:
Feedback wanted
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

This is the same issue reported in the following thread: https://mailman-1.sys.kth.se/pipermail/gromacs.org_gmx-users/2020-March/128660.html

Summary: GROMACS 2020.1 fails to remove center of mass motion for a protein-membrane system in a very short simulation (1 ns) where the membrane translates several nm in the xy plane (plane of the membrane) as measured by gmx traj and confirmed by visual inspection.

Files are attached in a zipped/compressed folder. The system is somewhat large, but since the same error was not observed in a pure membrane system (no protein), I found it necessary to test with this. The error is observable after 1 ns so long simulations are not required. The system was built with the CHARMM GUI and using the CHARMM36 force field.

NOTE: I also tried the system with only two comm groups (i.e. I combined the protein and membrane groups into a single comm group), but this did not resolve the issue.

Description of files:

(slurm file for compiling tpr and running simulation)
slurm_eq.cmd

(End point from previous restrained npt equilibration)
mem_aqp_npt_restraint.gro
mem_aqp_npt_restraint.cpt

(topology files and index file)
topol.top
toppar/
index.ndx

(mdp file)
npt.mdp

(tpr file and simulation output)
mem_aqp_npt.tpr
mem_aqp_npt.edr
mem_aqp_npt.cpt
mem_aqp_npt.log
(xtc file every 100 ps to avoid uploading a huge file)
mem_aqp_npt_dt100ps.xtc

(logging files)
grompp_npt.txt
mdrun_npt.txt

(com tracking for protein embedded in membrane)
com_prot.xvg

files_for_redmine.tar.gz (32 MB) files_for_redmine.tar.gz Daniel Kozuch, 03/09/2020 01:00 AM

History

#1 Updated by Berk Hess 3 months ago

  • Category set to mdrun
  • Status changed from New to Feedback wanted

A run on a single GPU without the GMX_GPU_DD_COMMS environment variable does not show this issue.
My guess is that this is caused by the experimental GMX_GPU_DD_COMMS feature. Could you try without this environment variable to check if that causes the issue?

#2 Updated by Alan Gray 3 months ago

Daniel, thanks very much for reporting this. To help us isolate the issue, could you also please try with exactly the same settings as your original run, but without GMX_FORCE_UPDATE_DEFAULT_GPU set. This will then trigger the new GPU communication features, but not the new GPU update feature.

Alan

#3 Updated by Daniel Kozuch 3 months ago

Thanks for the replies. It looks like the COMM issue is resolved if I don't use GMX_FORCE_UPDATE_DEFAULT_GPU feature, although there is a small performance hit (maybe 10%).
If I run without the GMX_GPU_DD_COMMS feature (while using GMX_GPU_PME_PP_COMMS and GMX_FORCE_UPDATE_DEFAULT_GPU on 4 GPUs), I get an immediate segmentation fault.

Dan

#4 Updated by Alan Gray 3 months ago

Thanks Daniel - just to be sure, do I understand correctly that:

the code works as expected with * no experimental features set * the GMX_GPU_DD_COMMS and GMX_GPU_PME_PP_COMMS variables set

it fails to remove center of mass motion with: * the GMX_GPU_DD_COMMS, GMX_GPU_PME_PP_COMMS and GMX_FORCE_UPDATE_DEFAULT_GPU variables set

it crashes with a seg fault with: * the GMX_GPU_PME_PP_COMMS and GMX_FORCE_UPDATE_DEFAULT_GPU variables set

#5 Updated by Daniel Kozuch 3 months ago

That is correct.

Dan

#6 Updated by Alan Gray 3 months ago

OK, thanks. So it looks like the (originally reported) issue is a bug in the GPU update code (adding Artem). The seg fault might be a separate issue - I'll look into that.

#7 Updated by Alan Gray 3 months ago

The seg fault might be a separate issue - I'll look into that.

This also seems to be a problem with the GPU update - it fails with GMX_GPU_PME_PP_COMMS and GMX_FORCE_UPDATE_DEFAULT_GPU, and runs fine with only GMX_GPU_PME_PP_COMM.

Also available in: Atom PDF