Project

General

Profile

Task #1938

work around broken NVIDIA JIT caching

Added by Szilárd Páll almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Low
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

The NVIDIA JIT compiler's binary caching is simply broken. It manifests in the kernels not getting compiled even though the source changes. As a result old kernel binaries can be used when recompilation would be required and this can often result in weird errors or incorrect results that are hard to explain or debug.

This mostly affects devs, but users can be affected too if e.g. they pull a bugfix and the kernels don't get recompiled; possibly even mismatching GROMACS binary/kernel combinations can happen too.

No better idea than forcing JIT off by setting the CUDA_CACHE_DISABLE env. var. Ugly, but seems warranted.


Related issues

Related to GROMACS - Feature #1720: Manage GROMACS JIT cachingRejected

Associated revisions

Revision 6556d8f4 (diff)
Added by Szilárd Páll over 3 years ago

Disable NVIDIA JIT cache with OpenCL

The NVIDIA JIT caching is known to be broken with OpenCL compilation in
the case when the kernel source changes but the path does not change
(e.g. kernels get overwritten). Therefore we disable the JIT caching on
NVIDIA.

Fixes #1938

Change-Id: I68749ea695a891ab8f14f07fc830ce632299b0c8

History

#1 Updated by Mark Abraham over 3 years ago

Is this still needing action? All we could do is document, and perhaps export it from GMXRC

#2 Updated by Szilárd Páll over 3 years ago

Mark Abraham wrote:

Is this still needing action?

Not sure, I have not tested NVIDA OpenCL much lately, but I doubt much has changed. I'll try to see if I can still reproduce.

All we could do is document, and perhaps export it from GMXRC

We could use setenv() too.

#3 Updated by Szilárd Páll over 3 years ago

  • Target version changed from 5.1.3 to 2016

Just tested and if the kernel file changes I can reproduce the incorrect cache reuse with a 364.19 driver. However, it seems that the path changes, e.g. if a new installation is used, the cache. Hence, ACAICT it will affect only devs, so document this and avoiding other hacks might be enough. So perhaps it's not even worth targeting 5.1.

#4 Updated by Mark Abraham over 3 years ago

  • Priority changed from Normal to Low

So, if the build is not from a tarball / is from a git repo, then we disable CUDA caching?

#5 Updated by Mark Abraham over 3 years ago

  • Target version deleted (2016)

Doesn't affect users, so not a priority for a release branch. Retarget if there's progress.

#6 Updated by Szilárd Páll over 3 years ago

Sorry for the late feedback. There is a slight danger that if one installs into the same location, e.g. the default one in /usr/local or without patch version /opt/gromacs-16, things can go wrong.

Not highly important, but I've a fix, in case if wanted, it can be merged either to rel-2016 or master.

#7 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related DRAFT patchset '2' for Issue #1938.
Uploader: Szilárd Páll ()
Change-Id: I68749ea695a891ab8f14f07fc830ce632299b0c8
Gerrit URL: https://gerrit.gromacs.org/5992

#8 Updated by Mark Abraham over 3 years ago

  • Status changed from New to Resolved
  • Assignee set to Szilárd Páll
  • Target version set to 2016

#10 Updated by Szilárd Páll over 3 years ago

  • Status changed from Resolved to Closed

#11 Updated by Mark Abraham over 3 years ago

Also available in: Atom PDF