Project

General

Profile

Bug #1784

OpenCL queues need clFlush for correct concurrency

Added by Szilárd Páll over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The current OpenCL implementation does not use clFlush(). However, as pointed out by AMD devs in recent email exchnge, to ensure CPU-GPU concurrent execution, clFlush needs to be called in order to kick off work on the GPU; from the specs:

Issues all previously queued OpenCL commands in a command-queue to the device associated with the command-queue.

Update:
I have just realized that this is not just needed to ensure CPU-GPU overlap, but based on the specs it is mandatory in inter-stream synchronization scenarios (like our non-local stream waiting for an event recorded in the local one).
This omission could be the cause of the broken DD/multi-GPU with OpenCL.
[Ref Sec 5.13 "Flush and Finish" of the v1.2 spec doc]

Associated revisions

Revision 73e5825a (diff)
Added by Szilárd Páll over 4 years ago

add clFlush to kick of OpenCL work

The OpenCL standard specifies clFlush as a necessary step which issues
all enqueued commands ensuring CPU-GPU concurrency which otherwise is
not guaranteed. For this reason three flushes are added to dispatch work
in the queue.

Additionally the specs (v1.2 sec 5.13) state that a flush is required in
the inter-stream synchronization case.

In total four flushes are added, their overhead seems to be small.

Fixes #1784

Change-Id: Ia287998c2716e21708979d6e8d261f853e39d4ef

History

#1 Updated by Szilárd Páll over 4 years ago

  • Description updated (diff)
  • Assignee set to Szilárd Páll

#2 Updated by Szilárd Páll over 4 years ago

  • Description updated (diff)

#3 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related DRAFT patchset '1' for Issue #1784.
Uploader: Szilárd Páll ()
Change-Id: Ia287998c2716e21708979d6e8d261f853e39d4ef
Gerrit URL: https://gerrit.gromacs.org/4917

#4 Updated by Szilárd Páll over 4 years ago

  • Description updated (diff)

#5 Updated by Szilárd Páll over 4 years ago

  • Status changed from New to Fix uploaded

#6 Updated by Szilárd Páll over 4 years ago

Update: based on further analysis, it seems that following the standard two more flush calls should be added, one right after the force clearing to ensure concurrency of this task with the constraints/update and another right after the pair list init which ensures overlap of the type copy with the search.

Queue flushes are harmless, they could however add overhead. Doing some measurements to check.

#7 Updated by Szilárd Páll over 4 years ago

Szilárd Páll wrote:

Queue flushes are harmless, they could however add overhead. Doing some measurements to check.

There seems to be a slight, but measurable overhead in the launch time, but also quite inconsistent and fluctuating. The total performance is barely affected, so I'll push up the change.

#8 Updated by Szilárd Páll over 4 years ago

  • Status changed from Fix uploaded to Resolved
  • % Done changed from 0 to 100

#9 Updated by Mark Abraham over 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF