Project

General

Profile

Bug #3125

OpenCL on Volta and Turing borken

Added by Magnus Lundborg about 1 month ago. Updated 16 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Affected version - extra info:
Also affects git master (2020-beta1)
Affected version:
Difficulty:
uncategorized
Close

Description

The attached system (testosterone in a stacked lipid bilayer system with little water) crashes on step 0 when using OpenCL, but works with CUDA or -pme cpu -nb cpu. I'm attaching tpr and log files.

unperturbed_opencl.log (72.6 KB) unperturbed_opencl.log Magnus Lundborg, 10/09/2019 10:49 AM
unperturbed_cpu.log (81.6 KB) unperturbed_cpu.log Magnus Lundborg, 10/09/2019 10:49 AM
testosterone_unperturbed.tpr (1.1 MB) testosterone_unperturbed.tpr Magnus Lundborg, 10/09/2019 10:49 AM

History

#1 Updated by Berk Hess about 1 month ago

It runs fine for me with OpenCL on Nvidia.

#2 Updated by Magnus Lundborg about 1 month ago

It works on my computer at home as well (I cannot get cmake to identify the correct opencl versions on my machine at work regardless of cuda version and proper cuda libraries), but I got the error on the threadripper-gpu01 machine. It's probably some hardware/software problem. But the fact that it compiles and then crashes is a bit worrying.

#3 Updated by Szilárd Páll about 1 month ago

Magnus Lundborg wrote:

It works on my computer at home as well (I cannot get cmake to identify the correct opencl versions on my machine at work regardless of cuda version and proper cuda libraries), but I got the error on the threadripper-gpu01 machine. It's probably some hardware/software problem. But the fact that it compiles and then crashes is a bit worrying.

That's likely due to the broken NVIDIA Volta (possibly Turing) support. Please try threadripper-gpu02 (has Pascal GPUs) just to confirm.

#4 Updated by Magnus Lundborg about 1 month ago

I didn't know Volta was broken. Is it only OpenCL that's the problem? I'll try threadripper-gpu02 and report (possibly tomorrow). But related to that (I'm trying to understand why FEP PME on GPU fails in OpenCL), do you know if there is a problem with Kepler as well?

#5 Updated by Magnus Lundborg about 1 month ago

It worked on threadripper-gpu02. So, should this be closed then?

#6 Updated by Szilárd Páll about 1 month ago

  • Subject changed from Too many LINCS warnings step 0 when using OpenCL to OpenCL on Volta borken

Magnus Lundborg wrote:

I didn't know Volta was broken. Is it only OpenCL that's the problem? I'll try threadripper-gpu02 and report (possibly tomorrow). But related to that (I'm trying to understand why FEP PME on GPU fails in OpenCL), do you know if there is a problem with Kepler as well?

I know of no issues on Kepler.

Magnus Lundborg wrote:

It worked on threadripper-gpu02. So, should this be closed then?

Perhaps we should give the issue a more descriptive title and keep a record of it. It would be good if NVIDIA looked into it.

#7 Updated by Berk Hess about 1 month ago

Perhaps we should give the issue a more descriptive title and keep a record of it. It would be good if NVIDIA looked into it.

I like the "borken" though!

Can we detect the Nvidia architecture in OpenCL? If so, we should disable it, since this is dangerous for the user and sucks up time of developers in testing.

#8 Updated by Magnus Lundborg about 1 month ago

Szilárd Páll wrote:

Magnus Lundborg wrote:

I didn't know Volta was broken. Is it only OpenCL that's the problem? I'll try threadripper-gpu02 and report (possibly tomorrow). But related to that (I'm trying to understand why FEP PME on GPU fails in OpenCL), do you know if there is a problem with Kepler as well?

I know of no issues on Kepler.

The FEP with PME on GPU did not work (forces, energies and the virial were wrong - contents of grid\[0\].d_fractShiftsTable and grid\[0\].d_gridlineIndicesTable were lost when using two grids) on Kepler (with CUDA 9.2). I'll see if I can can CUDA 10 to work and see if that was the problem. It works on threadripper-gpu02 and seemingly in the Jenkins tests.

So, I agree with Berk's suggestions that we should, if possible, disable OpenCL on architectures that are not proven to work or at least on architectures proven not to work.

#9 Updated by Roland Schulz about 1 month ago

Comparing the clinfo output from Volta and Pascal should quickly show whether there is any info you can use.

#10 Updated by Szilárd Páll about 1 month ago

Berk Hess wrote:

Perhaps we should give the issue a more descriptive title and keep a record of it. It would be good if NVIDIA looked into it.

I like the "borken" though!

Can we detect the Nvidia architecture in OpenCL?

clinfo does print NVIDIA specific information IIRC, so we can likely detect it.

If so, we should disable it, since this is dangerous for the user and sucks up time of developers in testing.

I agree.

Sorry I ran into the issue before (and IIRC raised the issue with some NVIDIA people), but never filed a redmine or tried to look deeper into it. In hindsight, I should have at least filed an issue and added a note in the user guide's "known issues/limitations" section.

I will certainly not have time before beta2, but at some later point I can try to look into it.

#11 Updated by Szilárd Páll about 1 month ago

Here's the relevant clinfo code that we should adopt:
https://github.com/Oblomov/clinfo/blob/master/src/clinfo.c#L1502

CC 7.0 is what we'd blacklist.

#12 Updated by Szilárd Páll about 1 month ago

  • Subject changed from OpenCL on Volta borken to OpenCL on Volta and Turing borken

Szilárd Páll wrote:

Here's the relevant clinfo code that we should adopt:
https://github.com/Oblomov/clinfo/blob/master/src/clinfo.c#L1502

CC 7.0 is what we'd blacklist.

also CC 7.5

#13 Updated by Magnus Lundborg about 1 month ago

Szilárd Páll wrote:

Szilárd Páll wrote:

Here's the relevant clinfo code that we should adopt:
https://github.com/Oblomov/clinfo/blob/master/src/clinfo.c#L1502

CC 7.0 is what we'd blacklist.

also CC 7.5

CC 3.0 is the version that lost the contents of d_fractShiftsTable and d_gridlineIndicesTable when using more than one grid.

#14 Updated by Paul Bauer 19 days ago

  • Target version changed from 2020-beta2 to 2020-beta3

bump

#15 Updated by Magnus Lundborg 16 days ago

With the latest version of FEP PME on GPU ( https://gerrit.gromacs.org/c/gromacs/+/13382 ) CC 3.0 works since there is only one copy of d_fractShiftsTable and d_gridlineIndicesTable even if there are two grids. So, it's only CC 7.0 and 7.5 that we need to care about now.

Also available in: Atom PDF