Project

General

Profile

Bug #1970

GPU clock resetting at cleanup can interfere with other processes

Added by Szilárd Páll almost 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Low
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The NVML-based clock adjustment implementation resets the clocks at cleanup. This can result in interfering with other jobs that mdrun shares a GPU with. While it should not be very common that multiple processes share a node and a/some GPU and the admins set up the NVIDIA driver to allow user-adjustable application clocks, it can still be irritating if a user manages the machine themselves, sets the applciation clocks manually while mdrun is running e.g. for another process that's triggered after mdrun finishes just to have the clocks reset by mdrun before exit.

This is categorized under the "bug" tracker given that this is a possibly undesirable side-effect of the application-clock support. However, ATM I don't see much that we can do about it other than (possibly) improving user-facing notes + documentation).


Related issues

Related to GROMACS - Bug #1846: NVML management resets GPU application clocks rather than restoringClosed

History

#1 Updated by Szilárd Páll almost 4 years ago

  • Related to Bug #1846: NVML management resets GPU application clocks rather than restoring added

#2 Updated by Erik Lindahl almost 4 years ago

Is there really any reason to reset it? The card should still enter some power-saving mode if we don't use it, right?
Then I would prefer to simply leave it in the high performance mode, since users who get high-end cards for running simulations are unlikely to complain about the card being too fast but drawing a bit more power...

#3 Updated by Erik Lindahl almost 4 years ago

PS: I would suggest that we limit the "bug" tracker to things that are actually incorrect in our code. If we start to expand it to cover all effects and side-effects of running GROMACS it will cover the entire "feature" category too.

#4 Updated by Szilárd Páll almost 4 years ago

Erik Lindahl wrote:

Is there really any reason to reset it? The card should still enter some power-saving mode if we don't use it, right?
Then I would prefer to simply leave it in the high performance mode, since users who get high-end cards for running simulations are unlikely to complain about the card being too fast but drawing a bit more power...

The highest application clock is optimal for some codes, but not desirable for others. With the application clocks set to the maximum, when running a code that pushes the cards to its power or temperature limit (e.g. BLAS3-based code), the GPU will throttle, clock up and down, which will cause performance jitter (and e.g. play dirty with a load balancer). When idle, the card will of course still clock down.

As I see, the issue is with mdrun permanently altering the maximum clock of the card to a value different from the one at startup which may be perceived as rogue behavior. Hence, I still fell like the the best solution is to revert to the application clock to the value detected at the startup of mdrun and communicating this to the user; not reverting (nor resetting) at all is also consistent although it seems somewhat of a rogue behavior. In any case, resetting seems less desirable as, to the best of my knowledge (but I should double-check again), it will set clocks back to the base values even if that was not the state the card was found at mdrun startup.

#5 Updated by Szilárd Páll almost 4 years ago

Erik Lindahl wrote:

PS: I would suggest that we limit the "bug" tracker to things that are actually incorrect in our code. If we start to expand it to cover all effects and side-effects of running GROMACS it will cover the entire "feature" category too.

Here's a good summary of what software errors, faults/defects (often called bugs), and failures are as defined after the IEEE definitions:
https://programmers.stackexchange.com/questions/37029/difference-between-a-defect-and-a-bug-in-testing/37044#37044

Most importantly, the code in question does not restore the state of the hardware/driver configuration to its original state nor does it warn the user about this side-effect of running mdrun which results in an unintended side-effect. Hence, this clearly falls in the "defect" category.

As noted by many the term "defect" is preferred (to "bug", a term which is thought to trivialize the impact faults have on software quality) as it acknowledges a mistake, oversight, or lack of understanding during implementation.

#6 Updated by Szilárd Páll almost 4 years ago

I thought through the usability aspects a bit more. If the user has the permission to alter application clocks and a note informing them that the application clocks will be left altered is considered enough, we can simply remove the reset call. This is less modification than a few tens of lines of code needed to restore which is also an imperfect solution (though somewhat nicer behavior).

#7 Updated by Mark Abraham almost 4 years ago

I'm happy with that solution. The better solution might be easier to implement once we update some data structures, but I think it has extremely low priority.

#8 Updated by Erik Lindahl almost 4 years ago

Let's move any remaining discussion to #1846 where we actually talk about how we handle it inside GROMACS.

This topic of this particular bug report can never be solved by GROMACS since we can never control (or even talk to) other processes. I'll close it in a bit.

#9 Updated by Erik Lindahl almost 4 years ago

  • Status changed from New to Closed

#10 Updated by Szilárd Páll almost 4 years ago

Mark Abraham wrote:

I'm happy with that solution. The better solution might be easier to implement once we update some data structures, but I think it has extremely low priority.

Not sure it's hard to implement it's just two pairs of values queried and a few simple conditions, but I"m fine with offloading the burden to the user. I'm however pretty sure that most users will miss the message. Whether the end-result is a problem for them or not is a separate question, though.

#11 Updated by Szilárd Páll almost 4 years ago

Erik Lindahl wrote:

This topic of this particular bug report can never be solved by GROMACS since we can never control (or even talk to) other processes. I'll close it in a bit.

I disagree. From the title: "resetting at cleanup can interfere with other processes". mdrun is resetting which can interfere with the outside world. Claiming there is nothing to do about it is like saying that if we create a $HOME/mytempfile file simply because it sounded like a good idea during coding, that's not a bug even if it overwrites some other program's file and that we can't do much about it because the user/that other program was stupid to put there such a file.

I'm OK with moving to #1846, but note that that issue is about trying to intelligently restore rather than the issue of resetting. So IMO this should be closed when the issue at hand is addressed (or rejected).

Also available in: Atom PDF