Project

General

Profile

Task #2909

consider implementing mechanisms to ensure pair lists are not used past their max lifetime

Added by Szilárd Páll 6 months ago. Updated 6 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Difficulty:
uncategorized
Close

Description

Both outer and especially inner pair list can end up being used beyond their intended lifetime for which the original cutoff distance buffer was calculated. This would cause an increased drift that would be particularly hard to notice.

The mechanisms needed to ensure that the outer and inner lists are re-generated before they expire are different, in particular with chunked rolling pruning on GPUs.


The discussion originally came up here: https://gerrit.gromacs.org/#/c/9327/9/src/gromacs/nbnxm/pairlistsets.h@97
where Berk expressed the opinion that it is not possible to track the "age" of a list as it does not have one (only the coordinates do), while Szilard thought that the list, as a proximity relationship structure does have an age and that the coordinates are stored separately is in a way just an implementation detail.

History

#1 Updated by Berk Hess 6 months ago

The does not have an intrinsic age. You need the combination of the list and the current integration step. So to add a check, the step number needs to be passed to the force and dynamic prune kernel dispatch.

#2 Updated by Szilárd Páll 6 months ago

  • Description updated (diff)

#3 Updated by Szilárd Páll 6 months ago

When MD steps overlap the step counter goes out of sync with the "age" / lifespan of the list, especially tricky when scheduling is async, , so I don't think relying on external conditionals in the schedule alone will be feasible anyway. One could always pass the step counter, but it is the task before execution that would have to check consistency, not the schedule.
It seems reasonable to try to implement an internal counter in the pair list class, even if this is incremented on external signal, but independently of the step counter, e.g. at the end of when all force tasks are completed (which we can keep track of) so that i) the schedule code could check that currentStep % maxOuterListLifespan == pairList.outerList as well as before launching a force kernel that pairList.innerListlifeSpan <= maxInnerListLifespan and do that independently of the scheduling code itself. pairList.innerListlifeSpan would have to be reset of course with a CAS when concurrency is possible (like on a GPU) and that way we would have a consistency check that would allow spotting e.g. if a GPU task dependency is missing. This could of course be restricted to debug mode to avoid overheads.

Also available in: Atom PDF