Project

General

Profile

Task #1804

investigate OpenCL + MPI

Added by Szilárd Páll over 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

The DD + OpenCL is broken (only MPI with single device per physical node works) and needs investigation how to fix.

Associated revisions

Revision a512a937 (diff)
Added by Szilárd Páll over 3 years ago

Fix multiple tMPI ranks per OpenCL device

The OpenCL context and program objects were stored in the gpu_info
struct which was assumed to be a constant per compute host and therefore
shared across the tMPI ranks. Hence, gpu_info was initialized once
and a single pointer pointing to the data used by all ranks.
This led to the OpenCL context and program objects of different ranks
sharing a single device get overwritten/corrupted by one another.

Notes:
- MPI still segfaults in clCreateContext() with multiple ranks per node
both with and without GPU sharing, so no changes on that front.
- The AMD OpenCL runtime overhead with all hw threads used is quite
significant; as a short-term solution we should consider avoiding
using HT by launching less threads (and/or warning the user).

Refs #1804

Change-Id: I7c6c53a3e6a049ce727ae65ddf0978f436c04579

Revision 8a8904ad (diff)
Added by Szilárd Páll over 3 years ago

Fix multiple MPI ranks per node with OpenCL

Similarly to the thread-MPI case, the source of the issue was
the hardware detection broadcasting the outcome of GPU detection
within a node. The MPI platform and device IDs, OpenCL internal
entities, differ across processes even if both platform and device(s)
are shared. This caused corruption at context creation on all ranks
other than the first rank in the node (which did the detection).

This change disables the GPU data broadcasting for OpenCL with MPI.

Fixes #1804

Change-Id: I90defdcb3515796c46ba89efb0ed1e3c8b1b35f9

History

#1 Updated by Szilárd Páll over 3 years ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Target version changed from 5.x to 2016

Fix for the tMPI case in review; the MPI case is still unresolved.

#2 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '11' for Issue #1804.
Uploader: Szilárd Páll ()
Change-Id: I7c6c53a3e6a049ce727ae65ddf0978f436c04579
Gerrit URL: https://gerrit.gromacs.org/5729

#3 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1804.
Uploader: Szilárd Páll ()
Change-Id: I90defdcb3515796c46ba89efb0ed1e3c8b1b35f9
Gerrit URL: https://gerrit.gromacs.org/5780

#4 Updated by Szilárd Páll over 3 years ago

  • Status changed from In Progress to Fix uploaded

#5 Updated by Szilárd Páll over 3 years ago

  • Status changed from Fix uploaded to Resolved

#6 Updated by Szilárd Páll over 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF