Project

General

Profile

Bug #3354

release-2020 nightly gpuupdte matrix failing

Added by Szilárd Páll 7 months ago. Updated 6 months ago.

Status:
Closed
Priority:
Normal
Category:
testing
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Runs fail with the following assertion:

Assertion failed:
Condition: useGpuForPme || (useGpuForNonbonded && simulationWork.useGpuBufferOps)
Either PME or short-ranged non-bonded interaction tasks must run on the GPU to
use GPU update.

http://jenkins.gromacs.org/job/Gromacs_Nightly_2020_gpuupdate/58/

Associated revisions

Revision 73e2b7dc (diff)
Added by Szilárd Páll 6 months ago

Avoid dev flags in triggering gpuupdate nightly matrix

The GPU update release feature should be tested independenly from the
experimental features which were all enabled for the "gpuupdate" nightly
job. This change removes the GMX_GPU_DD_COMMS and GMX_GPU_PME_PP_COMMS
as well as the unnecessary buffer ops env var.

Refs #3354

Change-Id: I777f6996ca5b1ae1b3e7f787c18d82f605035e47

Revision 1a7301b8 (diff)
Added by Szilárd Páll 6 months ago

Improve GPU update tasks assignment consistency

GPU update task assignment was not consistent with the assumptions and
supported features of the 2020 release and did not implement the correct
checks and fallback in cases where GPU update was decided to not be
supported. Specifically, this change makes sure that when separate PME
ranks are used, without direct GPU communication for PP-PME, GPU update
falls back to the CPU.

Fixes #3354

Change-Id: I7c9dd67cd8cf61f0201b626b8b7674917e3365a5

History

#1 Updated by Szilárd Páll 7 months ago

Also noticed that the gpuupdate matrix enables a bunch of unnecessary dev features:


This run will default to '-update gpu' as requested by the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable. GPU update with domain decomposition lacks substantial testing and should be used with caution.

GMX_GPU_DD_COMMS environment variable detected, but the 'GPU halo exchange' feature will not be enabled as nonbonded interactions are not offloaded.

GMX_GPU_PME_PP_COMMS environment variable detected, but the 'GPU PME-PP communications' feature was not enabled as PME is not offloaded to the GPU.
Changing nstlist from 10 to 100, rlist from 5.016 to 5.211

We are aiming to test the non-default GPU update release feature, the dev flags should not be defined.

#2 Updated by Artem Zhmurov 7 months ago

Assertions in md.cpp went out of sync with the task assignment. Should be an easy fix.

We had a solid reason to enable these environment variables for GPU update. I think the idea was to test more - these don't do anything in case of a single rank. But in case of multiple ranks, the GPU update is used first, followed by the CPU update. More tests - more bugs captured.

#3 Updated by Szilárd Páll 7 months ago

Artem Zhmurov wrote:

Assertions in md.cpp went out of sync with the task assignment. Should be an easy fix.

We had a solid reason to enable these environment variables for GPU update. I think the idea was to test more - these don't do anything in case of a single rank. But in case of multiple ranks, the GPU update is used first, followed by the CPU update. More tests - more bugs captured.

The gpuupdate matrix for 2020 is aimed to test the GPU update release features. To test this we should not enable all extra unstable/dev features because than we (may) test something else.

#4 Updated by Artem Zhmurov 7 months ago

Szilárd Páll wrote:

Artem Zhmurov wrote:

Assertions in md.cpp went out of sync with the task assignment. Should be an easy fix.

We had a solid reason to enable these environment variables for GPU update. I think the idea was to test more - these don't do anything in case of a single rank. But in case of multiple ranks, the GPU update is used first, followed by the CPU update. More tests - more bugs captured.

The gpuupdate matrix for 2020 is aimed to test the GPU update release features. To test this we should not enable all extra unstable/dev features because than we (may) test something else.

I think, we need to pass devFlags to decideGpuUsage and disable GPU update if the GPU comms are not enabled.

#5 Updated by Szilárd Páll 6 months ago

  • Status changed from New to Resolved

#6 Updated by Szilárd Páll 6 months ago

  • Status changed from Resolved to Closed
  • Assignee set to Szilárd Páll
  • Target version set to 2020.1

Also available in: Atom PDF