Project

General

Profile

Bug #3354

release-2020 nightly gpuupdte matrix failing

Added by Szilárd Páll about 1 month ago. Updated 30 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
testing
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Runs fail with the following assertion:

Assertion failed:
Condition: useGpuForPme || (useGpuForNonbonded && simulationWork.useGpuBufferOps)
Either PME or short-ranged non-bonded interaction tasks must run on the GPU to
use GPU update.

http://jenkins.gromacs.org/job/Gromacs_Nightly_2020_gpuupdate/58/

Associated revisions

Revision 73e2b7dc (diff)
Added by Szilárd Páll 13 days ago

Avoid dev flags in triggering gpuupdate nightly matrix

The GPU update release feature should be tested independenly from the
experimental features which were all enabled for the "gpuupdate" nightly
job. This change removes the GMX_GPU_DD_COMMS and GMX_GPU_PME_PP_COMMS
as well as the unnecessary buffer ops env var.

Refs #3354

Change-Id: I777f6996ca5b1ae1b3e7f787c18d82f605035e47

History

#1 Updated by Szilárd Páll about 1 month ago

Also noticed that the gpuupdate matrix enables a bunch of unnecessary dev features:


This run will default to '-update gpu' as requested by the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable. GPU update with domain decomposition lacks substantial testing and should be used with caution.

GMX_GPU_DD_COMMS environment variable detected, but the 'GPU halo exchange' feature will not be enabled as nonbonded interactions are not offloaded.

GMX_GPU_PME_PP_COMMS environment variable detected, but the 'GPU PME-PP communications' feature was not enabled as PME is not offloaded to the GPU.
Changing nstlist from 10 to 100, rlist from 5.016 to 5.211

We are aiming to test the non-default GPU update release feature, the dev flags should not be defined.

#2 Updated by Artem Zhmurov about 1 month ago

Assertions in md.cpp went out of sync with the task assignment. Should be an easy fix.

We had a solid reason to enable these environment variables for GPU update. I think the idea was to test more - these don't do anything in case of a single rank. But in case of multiple ranks, the GPU update is used first, followed by the CPU update. More tests - more bugs captured.

#3 Updated by Szilárd Páll 30 days ago

Artem Zhmurov wrote:

Assertions in md.cpp went out of sync with the task assignment. Should be an easy fix.

We had a solid reason to enable these environment variables for GPU update. I think the idea was to test more - these don't do anything in case of a single rank. But in case of multiple ranks, the GPU update is used first, followed by the CPU update. More tests - more bugs captured.

The gpuupdate matrix for 2020 is aimed to test the GPU update release features. To test this we should not enable all extra unstable/dev features because than we (may) test something else.

#4 Updated by Artem Zhmurov 30 days ago

Szilárd Páll wrote:

Artem Zhmurov wrote:

Assertions in md.cpp went out of sync with the task assignment. Should be an easy fix.

We had a solid reason to enable these environment variables for GPU update. I think the idea was to test more - these don't do anything in case of a single rank. But in case of multiple ranks, the GPU update is used first, followed by the CPU update. More tests - more bugs captured.

The gpuupdate matrix for 2020 is aimed to test the GPU update release features. To test this we should not enable all extra unstable/dev features because than we (may) test something else.

I think, we need to pass devFlags to decideGpuUsage and disable GPU update if the GPU comms are not enabled.

Also available in: Atom PDF