Project

General

Profile

Task #2030

make the OpenCL nobonded kernels work on Intel iGPU

Added by Szilárd Páll about 4 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
-
Difficulty:
hard
Close

Description

The current OpenCL kernels have been developed for and tested mostly on NVIDIA and AMD GPUs. Intel iGPUs could be an interesting target especially starting with Gen9/Skylake which provides up to 1.1 Tflops single precision throughput. Additionally, it would be beneficial to i) provide a kernel flavor with no/less SIMD width dependence ii) investigate the different NBNXN algorithm flavors (varying super-cluster/cluster sizes) and their effect the the Intel hardware.

Two runtimes and OpenCL stacks are available: beignet (OSS) and the Intel proprietary OpenCL. The former is an esy install (but experimental), the latter still requires custom kernel/manual kernel module compiation.

Status:

Note: There is a fix missing from master but included in the beignet release branch that solves the same include issue that we need to work around on Apple: https://cgit.freedesktop.org/beignet/commit/?h=Release_v1.1&id=8e9ef20


Related issues

Related to GROMACS - Task #2454: OpenCL infrastructure improvementsClosed
Related to GROMACS - Task #2516: Support PME OpenCL execution width < 16New

Associated revisions

Revision 3c373f91 (diff)
Added by Szilárd Páll over 2 years ago

Refactor cj preload in the nonbonded OpenCL kernels

This change hides the implementation detail of preload/load operation in
inline functions configured based on the source flavor.
One of the main benefits is that this corrects the "nowarp" force and
pruning kernels removing a warp assumption (that was violated on
hardware with <32 wide wavefronts).

A step towards correctness on alternative GPU hardware, so
Refs #2030

Change-Id: If88d11df6613eadc9b59e5f03c34bac52df3514b

Revision d775a10f (diff)
Added by Roland Schulz almost 2 years ago

Recognize Intel GPU as compatible

Note new hardware support in the user guide
Also enable I_PREFETCH by default

Fixes #2030

Change-Id: I9c3efe70df4d273e463bf49f5ae4f0959a590ebe

History

#1 Updated by Szilárd Páll about 4 years ago

  • Description updated (diff)

#2 Updated by Szilárd Páll over 2 years ago

  • Related to Task #2454: OpenCL infrastructure improvements added

#3 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '15' for Issue #2030.
Uploader: Szilárd Páll ()
Change-Id: gromacs~master~If88d11df6613eadc9b59e5f03c34bac52df3514b
Gerrit URL: https://gerrit.gromacs.org/5752

#4 Updated by Roland Schulz over 2 years ago

The runtime which is now available and will be the main one going forward is: https://01.org/compute-runtime . It is fully Open-Source and the basis of all future development. There aren't any binaries for it yet but it is fully available on github and fully works.

#5 Updated by Roland Schulz over 2 years ago

Is it known what other issues need to be resolved to get correct results for iGPU?

#6 Updated by Szilárd Páll over 2 years ago

Roland Schulz wrote:

The runtime which is now available and will be the main one going forward is: https://01.org/compute-runtime . It is fully Open-Source and the basis of all future development. There aren't any binaries for it yet but it is fully available on github and fully works.

Thanks for the update, I have recently read about it in the tech-news and was quite excited to hear, especially given the cross-platform aspect.

Is it known what other issues need to be resolved to get correct results for iGPU?

The above linked change should correct the pruning kernels (and one long-ago identified issue in the force kernels). The remaining issue is most likely in the reduction.

#7 Updated by Szilárd Páll over 2 years ago

Szilárd Páll wrote:

The above linked change should correct the pruning kernels (and one long-ago identified issue in the force kernels). The remaining issue is most likely in the reduction.

Should be verifiable by removing the the workgroup-local reduction stage (which may violate the SIMD-width assumption) and doing straight atomic_add to global.

#8 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #2030.
Uploader: Roland Schulz ()
Change-Id: gromacs~master~I3a6fa9c7b154a7d0284546bc4c3f3a1956147346
Gerrit URL: https://gerrit.gromacs.org/7738

#9 Updated by Mark Abraham over 2 years ago

@Roland We're a bit short on design and implementation goals here - can we have ~20 lines on what runtime/driver/compiler stacks and/or hardware are targeted in this effort? We need to participate in the decision for what would make for reasonable testing coverage, what cluster sizes make sense, do we need to think about making search code better able to do run-time configuration based on which opencl-supporting hardware will run the kernels, etc.?

#10 Updated by Roland Schulz over 2 years ago

My target is Neo runtime for HW newer than SKL (maybe eventually also BDW when Neo driver is stable). I'm not sure what else you would like to know.

#11 Updated by Mark Abraham over 2 years ago

Roland Schulz wrote:

My target is Neo runtime for HW newer than SKL (maybe eventually also BDW when Neo driver is stable). I'm not sure what else you would like to know.

OK good.

It would have been nicer to have a Redmine (new or old) that clarified that before we considered whether and when it was feasible to implement CI now, or in a little while when stacks matured.

Similarly, what the relevant HW characteristics might be for current vs possible future devices.

Also, we should have been able to establish that it's OK to get some changes to e.g. kernels and search in and working even though e.g. how user-space decisions of what OpenCL devices to use are currently unclear.

#12 Updated by Aleksei Iupinov over 2 years ago

  • Related to Task #2516: Support PME OpenCL execution width < 16 added

#13 Updated by Szilárd Páll over 2 years ago

  • Subject changed from make OpenCL kernels work on Intel iGPU to make the OpenCL nobonded kernels work on Intel iGPU

#14 Updated by Szilárd Páll over 2 years ago

I've done some limited amount of testing, here are conclusions:
- beignet 1.3.1 on HSW seems to work except one regressiontest failure (that may not actually be Intel related);
- with CL_SIZE=4 the position-restraints test fails on all three platforms;
- TODO: the kernel selection report is incorrect with CL_SIZE=4 (possibly the buffer estimation too), currently we report 4x2, but I'm not sure pruning will split j's.

Other than the regressiontest issue (instability?) it seems that beignet might work for current e.g. post-submit verification needs.

#15 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #2030.
Uploader: Roland Schulz ()
Change-Id: regressiontests~master~Iebf46b085aa4fea36ac9eddc4ac3fa72846f88f7
Gerrit URL: https://gerrit.gromacs.org/7953

#16 Updated by Mark Abraham over 2 years ago

Should we do e.g. nightly testing on nvidia or amd opencl using the cluster size that uses Intel? We don't want that combination for performance, but it's probably useful to have that flexibility in the code.

#17 Updated by Gerrit Code Review Bot almost 2 years ago

Gerrit received a related patchset '13' for Issue #2030.
Uploader: Szilárd Páll ()
Change-Id: gromacs~master~I9c3efe70df4d273e463bf49f5ae4f0959a590ebe
Gerrit URL: https://gerrit.gromacs.org/7812

#18 Updated by Roland Schulz almost 2 years ago

  • Status changed from In Progress to Resolved

#19 Updated by Szilárd Páll over 1 year ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF