Project

General

Profile

Task #2408

device busy error with CC 2.0 and 6.1 in the same run

Added by Szilárd Páll almost 2 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

On a machine with:

#0: NVIDIA GeForce GTX 1050, compute cap.: 6.1, ECC:  no, stat: compatible
#1: NVIDIA GeForce GTX 580, compute cap.: 2.0, ECC:  no, stat: compatible

gmx mdrun -nsteps 0 -gpu_id 01 -ntmpi 2

produces the following error at the first API call:

cudaFuncGetAttributes failed: all CUDA-capable devices are busy or unavailable

The error is not reproducible with either two CC 2.0 cards or a CC 6.1 card combined with another CC 6.1 or 5.2 and it does reproduce with 375.66, 375.82, and 387.34 drivers.

This either means that there is some subtle bug in mdrun or a driver bug/feature that means this combination of ancient and recent hardware does not work within the same run.

Associated revisions

Revision 7df8e1f0 (diff)
Added by Mark Abraham over 1 year ago

Update use of gpu_id in Jenkins

We plan to remove one of the GPUs from the slave that had three, so we
need to update the testing configurations. The new configuration might
balance load a little better.

Refs #2408, #2410

Change-Id: Ia63cb701049a02ad86fdd9f9c03388cd11a14bcc

Revision 64f14c66 (diff)
Added by Mark Abraham over 1 year ago

Update use of gpu_id in Jenkins

We plan to remove one of the GPUs from the slave that had three, so we
need to update the testing configurations. The new configuration might
balance load a little better.

Refs #2408, #2410

Change-Id: Ie253ef07998cf45c5fc2e118a5394f0817e66f75

Revision 09e3626a (diff)
Added by Mark Abraham over 1 year ago

Update use of gpu_id in Jenkins

We plan to remove one of the GPUs from the slave that had three, so we
need to update the testing configurations. The new configuration might
balance load a little better.

Refs #2408, #2410

Change-Id: I3b5d9c8702c021f0676ea3ca12f9e7f1d3317ed1

Revision 354ebb3e (diff)
Added by Mark Abraham about 1 year ago

Removed support for NVIDIA CC 2.x devices (codename Fermi)

These are no longer tested or supported, but it is possible that the
OpenCL version of GROMACS will still run on such old devices.

Various code for configuration, the use of texture objects, the use
of shared memory, and the kernel dispatch is now simpler.

Fixes #2408
Fixes #2410
Fixes #2665

Change-Id: Ia7a00e5d6a97f93cd2768beb7ad56b2cce628a6f

History

#1 Updated by Mark Abraham almost 2 years ago

  • Subject changed from device buys error with CC 2.0 and 6.1 in the same run to device bus error with CC 2.0 and 6.1 in the same run

#2 Updated by Mark Abraham almost 2 years ago

  • Subject changed from device bus error with CC 2.0 and 6.1 in the same run to device busy error with CC 2.0 and 6.1 in the same run

#3 Updated by Mark Abraham almost 2 years ago

Does it work with release-2016?

#5 Updated by Aleksei Iupinov almost 2 years ago

Here some cryptocurrency miners are getting the same issue with Fermi + non+Fermi ...
https://github.com/fireice-uk/xmr-stak/issues/319
A wild guess would be that no-longer supported Fermi runs in some kind of exclusive mode in recent drivers.

#6 Updated by Szilárd Páll almost 2 years ago

Mark Abraham wrote:

Does it work with release-2016?

yes.

@Aleksei: I had a late night deja vu feeling, but somehow I failed to remember that we've seen this feature before.

I don't think we need to check for this, do we?

#7 Updated by Aleksei Iupinov almost 2 years ago

Well, we can check CCs and driver version and print a message "You are likely suffering from NVIDIA disallowing combination of CC < and >= 3.0 in recent drivers". That will be helpful to a confused user, but do they exist at all?

#8 Updated by Szilárd Páll almost 2 years ago

Aleksei Iupinov wrote:

Well, we can check CCs and driver version and print a message "You are likely suffering from NVIDIA disallowing combination of CC < and >= 3.0 in recent drivers". That will be helpful to a confused user, but do they exist at all?

Where would you check that? It would either need to be a task assignment-time check with some assumptions on driver versions or possibly interpreting a device busy error in a special way which poses a minor risk that the device is actually in exclusive mode.

#9 Updated by Mark Abraham almost 2 years ago

After detection, if we detect such a combination of devices, we could try to launch a test kernel (or maybe the buffer clearing ones) and provide the helpful message if we detect this particular failure.

#10 Updated by Mark Abraham almost 2 years ago

  • Affected version - extra info set to no problem apparent in 2016.x

#11 Updated by Mark Abraham almost 2 years ago

There has been some refactoring to handling of the detection and compatibility data structures since 2016, but it's hard to imagine how that could produce these symptoms...

#12 Updated by Szilárd Páll almost 2 years ago

Mark Abraham wrote:

There has been some refactoring to handling of the detection and compatibility data structures since 2016, but it's hard to imagine how that could produce these symptoms...

I'm sorry, I answered "yes" to a different question (whether 2016 is affected), to yours (whether it works in r2016) the answer is "no".

This is not related to code change and as Aleksei pointed out we've already seen this issue back in November when the upgrade from the 375.2 driver broke the multi-GPU tests on bs_nix1310 (which is what's prevented us from testing CUDA 9).

Yes, we could test for this is multiple ways, but I'm not sure it's a very high priority issue, there won't be many people who run as ancient hardware as Fermi in the same box with new stuff -- and for those cases we could e.g. add a simple check + additional message at the place where the currently run into the issue (in checkCompiledTargetCompatibility).

#14 Updated by Mark Abraham almost 2 years ago

  • Tracker changed from Bug to Task
  • Affected version - extra info deleted (no problem apparent in 2016.x)
  • Affected version deleted (2018)

There's no known code bug here, so no affected version

#15 Updated by Mark Abraham almost 2 years ago

Szilárd Páll wrote:

Mark Abraham wrote:

There has been some refactoring to handling of the detection and compatibility data structures since 2016, but it's hard to imagine how that could produce these symptoms...

I'm sorry, I answered "yes" to a different question (whether 2016 is affected), to yours (whether it works in r2016) the answer is "no".

OK

This is not related to code change and as Aleksei pointed out we've already seen this issue back in November when the upgrade from the 375.2 driver broke the multi-GPU tests on bs_nix1310 (which is what's prevented us from testing CUDA 9).

Yes, we could test for this is multiple ways, but I'm not sure it's a very high priority issue, there won't be many people who run as ancient hardware as Fermi in the same box with new stuff -- and for those cases we could e.g. add a simple check + additional message at the place where the currently run into the issue (in checkCompiledTargetCompatibility).

The simplest thing to do is to stop supporting Fermi at all, which means we can just stop testing it. I don't think it's a good use of resources to maintain a build slave if all it will do is host a Fermi card.

#16 Updated by Mark Abraham almost 2 years ago

Mark Abraham wrote:

The simplest thing to do is to stop supporting Fermi at all, which means we can just stop testing it. I don't think it's a good use of resources to maintain a build slave if all it will do is host a Fermi card.

Though bs_mic has an GeForce210 already, if that helps. We could have it as a "legacy stuff" slave, as it already has the KNC Phi card if we would ever do something with that.

#17 Updated by Szilárd Páll almost 2 years ago

GPU#0 in bs_nix1310 is a GTX480, a Fermi card with enough memory; GT 210 has so little memory that it's not worth the risk of getting intermittent failures. As discussed a few months ago that the reason why we're stuck with old driver and can't use CUDA 9.0 on that slave is that we mix Fermi with newer cards so I suggested to move that card into a "legacy" build machine.

Otherwise, we can't just stop testing Fermi even if we want to stop supporting it in the next release; at least until the EOL of 2018 it would be good to have at least some nightly builds that run on Fermi to avoid regressions of the r2018. The role of legacy GPU test slave can be merged with other roles (e.g. some older SIMD flavor or something). If bs_mic can be use for that, that's great.

#18 Updated by Mark Abraham almost 2 years ago

Szilárd Páll wrote:

GPU#0 in bs_nix1310 is a GTX480, a Fermi card with enough memory; GT 210 has so little memory that it's not worth the risk of getting intermittent failures. As discussed a few months ago that the reason why we're stuck with old driver and can't use CUDA 9.0 on that slave is that we mix Fermi with newer cards so I suggested to move that card into a "legacy" build machine.

Otherwise, we can't just stop testing Fermi even if we want to stop supporting it in the next release; at least until the EOL of 2018 it would be good to have at least some nightly builds that run on Fermi to avoid regressions of the r2018. The role of legacy GPU test slave can be merged with other roles (e.g. some older SIMD flavor or something). If bs_mic can be use for that, that's great.

It'd be nice to do that, but someone has to want to set such a slave up and manage the transition period (bs_mic does several other tasks, and some would have to move elsewhere). Any volunteers?

We also "support" CUDA and OpenCL on Mac and Windows too, but make no attempt to test it in Jenkins, even though we have a Windows slave that could perhaps have a card put in it. We do that because we don't think it's important enough to do the work to maintain it. IMO for the remaining lifetime of 2016 and 2018 release, that's a reasonable decision for Fermi also. That we used to test Fermi on Linux in Jenkins is an accident of history, rather than a binding promise. I vote we take the Fermi card out of the build slave, and give it to a museum.

Note that running a Fermi verification only nightly doesn't change the need to do the up front work to set things up.

#19 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '1' for Issue #2408.
Uploader: Mark Abraham ()
Change-Id: gromacs~master~Ia63cb701049a02ad86fdd9f9c03388cd11a14bcc
Gerrit URL: https://gerrit.gromacs.org/7689

#20 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '1' for Issue #2408.
Uploader: Mark Abraham ()
Change-Id: gromacs~release-2018~I3b5d9c8702c021f0676ea3ca12f9e7f1d3317ed1
Gerrit URL: https://gerrit.gromacs.org/7699

#21 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '1' for Issue #2408.
Uploader: Mark Abraham ()
Change-Id: gromacs~release-2016~Ie253ef07998cf45c5fc2e118a5394f0817e66f75
Gerrit URL: https://gerrit.gromacs.org/7700

#22 Updated by Mark Abraham about 1 year ago

  • Assignee set to Mark Abraham
  • Target version set to 2019

This is being resolved because CC 2.0 is no longer supported

#23 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2408.
Uploader: Mark Abraham ()
Change-Id: gromacs~master~Ia7a00e5d6a97f93cd2768beb7ad56b2cce628a6f
Gerrit URL: https://gerrit.gromacs.org/8547

#24 Updated by Mark Abraham about 1 year ago

  • Status changed from New to Resolved

#25 Updated by Erik Lindahl about 1 year ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF