Project

General

Profile

Task #2515

Feature #2054: PME on GPU

Task #2453: PME OpenCL porting effort

clFFT RocM compatibility problem

Added by Aleksei Iupinov over 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
High
Category:
core library
Target version:
Difficulty:
uncategorized
Close

Description

PME with OpenCL uses 3D FFT as implemented in the external clFFT library.
As of May 2018, the current RocM stack (1.8.0) is known to compile clFFT badly: https://github.com/clMathLibraries/clFFT/issues/218
The PME OpenCL kernels do work with RocM (Ewald unit tests pass), but PmeTest in mdrun tests fails, as some iterations use -pmefft gpu/auto.
(Additionally, clFFT kernel compilation with rocm even outputs a few warnings during the execution).
clFFT is known to work at least with recent AMDGPU-PRO drivers 17.50 and 18.10, with all supported CUDA OpenCL versions, and with whatever runtime the "old" AMD build config has (fglrx 15.something?).
With this in mind, if situation doesn't improve before release 2019 (by improvements in RocM OpenCL runtime, clFFT, or even the unlikely alternative path of using rocFFT library: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/issues/53),
an OpenCL compiler version check should be implemented in our device sanity checks.
I assume one can detect RocM compiler, whether in host or device code, and disallow PmeRunMode::GPU accordingly (still allowing PmeRunMode::Mixed).


Related issues

Related to GROMACS - Task #2500: detect and allow linking external clFFT, or no clFFTClosed
Related to GROMACS - Bug #2420: OpenCL implementation not doing device sanity checksClosed

Associated revisions

Revision a41344a0 (diff)
Added by Aleksei Iupinov over 1 year ago

Added the bundled clFFT into OpenCL builds

Used an object library, since we have no need of a real library, to
have or to install, whether shared or static. Checked for the
availability of dynamic loading, and made it available portably to
libgromacs.

Clfft initialization class is added and used in mdrunner to
initialize/tear down clFFT library resources in a thread-safe
manner, and only on ranks that require such setup. Noted TODOs
for future work.

Noted a useful style for explicit listing of source files.

Refs #2500
Refs #2515
Refs #2535

Change-Id: I62d7d66f65e147bde17929ccc30abad36e2373c6

History

#1 Updated by Aleksei Iupinov over 1 year ago

  • Related to Task #2500: detect and allow linking external clFFT, or no clFFT added

#2 Updated by Aleksei Iupinov over 1 year ago

  • Related to Bug #2420: OpenCL implementation not doing device sanity checks added

#3 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '9' for Issue #2515.
Uploader: Aleksei Iupinov ()
Change-Id: gromacs~master~I62d7d66f65e147bde17929ccc30abad36e2373c6
Gerrit URL: https://gerrit.gromacs.org/7837

#4 Updated by Szilárd Páll about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to Szilárd Páll

Update: our reports and push have paid back, ROCm 1.9 compiles and all unit clFFT tests pass that we care about (only four DP tests fail, but we have no DP GPU support, so that's not an issue). GROMACS unit- and regressiontests pass too.

I'm in the process of verifying that this Fiji/Baffin and will update this redmine accordingly.

#5 Updated by Szilárd Páll about 1 year ago

Looks like only double precision FFT kernel tests fail on Fiji too, so I'm cautiously going to state: ROCm + clFFT is likely safe to use on these AMD architectures.

Issues filed against ROCm-OpenCL-Driver on github:
https://github.com/RadeonOpenCompute/ROCm-OpenCL-Driver/issues/72

Will keep testing future ROCm releases; for reference, this is the procedure:

https://github.com/clMathLibraries/clFFT.git
mkdir build_clFFT
# set up fftw for cmake detection
cmake ../clFFT/src -DBUILD_TEST=ON && make &&\
  LD_LIBRARY_PATH=$PWD/library staging/Test 2>&1 | tee test-${HOSTNAME}.log

#6 Updated by Szilárd Páll about 1 year ago

  • Status changed from In Progress to Resolved

#7 Updated by Mark Abraham about 1 year ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF