Project

General

Profile

Bug #1018

cmake and gpu acceleration

Added by Mark Abraham over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Category:
build system
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Currently CMake assumes GPUs are available and makes a fatal error unless -DGMX_GPU=OFF is used:

[vayu3 r46 (release-4-6)] $ (cd build-cmake; cmake ..)
CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) (Required is at least version "3.2")
CMake Error at CMakeLists.txt:536 (message):

mdrun supports native GPU acceleration on NVIDIA hardware with compute
capability >=2.0. This requires the NVIDIA CUDA library, which was not
found; the location can be hinted by setting CUDA_TOOLKIT_ROOT_DIR.
CPU or GPU acceleration can be selected at runtime, but if you are
sure you can not make use of GPU acceleration, disable it by setting
the CMake variable GMX_GPU=OFF.

-- Configuring incomplete, errors occurred!

I don't think this is acceptable. I grumbled about it on gmx-core back in May or so, and got the impression then that it was a temporary convenience while developing the GPU code. If so, I think that's inferior practice, because now that convenience is a feature that inconveniences what I think are mainstream installations, and I have to make a case to change it :-) IMO scripting CMake to set -DGMX_CPU=ON would have been better.

Anyway, that case is that existing GROMACS practice has been to detect attributes and enable features accordingly (e.g. threading, kernels), or to allow the user to choose features to enable to suit the environment (e.g. real MPI). AFAIK it will be routine for both HPC and desktop installations for some years to come to be unable to make effective use of GPUs, even if they are present and CUDA is present. Thus, I think it is important that a default installation does not require the user to specify that they have a normal non-GPU execution environment.

Either we need to not enable GMX_GPU by default, or provide better auto-detection, or gracefully revert to non-GPU installation. Ideally, we can provide enough auto-detection to flag the user that maybe they want to use GPU code if they have not required it.

Associated revisions

Revision ef1127f1 (diff)
Added by Szilárd Páll over 6 years ago

automation for setting GMX_GPU & cmake GPU detection

Implemented detection of NVIDIA GPUs in CMake using:
- output of nvidia-smi (if available Linux/Mac/Win);
- presence and content of of /proc/driver/nvidia/gpus/*/information
(Linux)
- output of lspci (Linux)
Although the current implementation is not able to decide whether a GPU
is compatible with GROMACS, the build system is now able to hint the
user that there are potentially useful GPU compute resources.

Additionally, if GMX_GPU is not set explicitly by the user, its value
is considered to be and implicit "auto" with OFF as default. In this
case CUDA detection will be attempted and if successful, GMX_GPU is
set to "ON", otherwise kept "OFF".
If CUDA is not found and the user requested GPU acceleration, an
immediate fatal error is issued. If the user did not set GMX_GPU, a
non-fatal warning is issued at the end of the configuration with
additional information if GPUs were detected.

Fixes #1018

Change-Id: Iffa9ed343bed4278cffba5e2eb9f8b81f590b31d

History

#1 Updated by Mark Abraham over 6 years ago

Further, when the user does set -DGMX_GPU=OFF, using interactive ccmake then leads to

 BUILD_SHARED_LIBS                ON                                                                                                                                                                            
 CMAKE_BUILD_TYPE                 Release                                                                                                                                                                       
 CMAKE_INSTALL_PREFIX             /home/224/mxa224/progs                                                                                                                                                        
 CUDA_BUILD_CUBIN                 OFF                                                                                                                                                                           
 CUDA_BUILD_EMULATION             OFF                                                                                                                                                                           
 CUDA_SDK_ROOT_DIR                CUDA_SDK_ROOT_DIR-NOTFOUND                                                                                                                                                    
 CUDA_TOOLKIT_ROOT_DIR            CUDA_TOOLKIT_ROOT_DIR-NOTFOUND                                                                                                                                                
 CUDA_VERBOSE_BUILD               OFF                                                                                                                                                                           
 GMX_ACCELERATION                 SSE4.1                                                                                                                                                                        
 GMX_DEFAULT_SUFFIX               ON                                                                                                                                                                            
 GMX_DOUBLE                       OFF                                                                                                                                                                           
 GMX_FFT_LIBRARY                  mkl                                                                                                                                                                           
 GMX_GPU                          OFF                                                                                                                                                                           
 GMX_GSL                          OFF                                                                                                                                                                           
 GMX_MPI                          ON                                                                                                                                                                            
 GMX_OPENMM                       OFF                                                                                                                                                                           
 GMX_OPENMP                       ON                                                                                                                                                                            
 GMX_QMMM_PROGRAM                 none                                                                                                                                                                          
 GMX_THREAD_MPI                   OFF                                                                                                                                                                           
 GMX_X11                          OFF                                                                                                                                                                           
 GMX_XML                          ON                                                                                                                                                                            

The CUDA variables are present in the non-advanced namespace. They should be advanced.

#2 Updated by Szilárd Páll over 6 years ago

I admit that it's annoying to have the CUDA toolkit as a required dependency unless GMX_GPU is turned off manually is annoying. However, we want to accelerate the adoption of GPU support and IMO that won't happen without us:
  • either proactively telling (i.e. bugging) the user about the benefit of GPU acceleration (which is 2-4x so it's not negligible) or
  • reliably detecting the presence of a GPU without the CUDA toolkit (both for build- and run-time detection).

Implementing the latter is possible by using the CUDA driver API and dynamically loading libcuda.so (cuda.dll) at runtime to do the checks. In principle this is not a very difficult task, but it does require quite some code and a lot of testing to make sure that tons of OS + driver API version combinations are covered. I just simply have not had the time to do this.

If somebody could help out with such a code, even if it only works on Linux, I'd be happy to not require the CUDA toolkit by default. Otherwise, I prefer to annoy the users a bit (but at least force them to read the message) -- and bare with the fscks from some developers.

#3 Updated by Szilárd Páll over 6 years ago

PS: We could discuss this topic on the core or dev list again and if I get severely down-voted, than we'll just have to make the message 50 lines long with big ASCII-art borders so that people actually read it.

PS+: I've just checked and among all Apple machines only the MacBookPro9,1 and MacBookPro10,1 have supported GPUs, so we can set GMX_GPU=OFF on Mac OS.

#4 Updated by Szilárd Páll over 6 years ago

Mark Abraham wrote:

Further, when the user does set -DGMX_GPU=OFF, using interactive ccmake then leads to

[...]

The CUDA variables are present in the non-advanced namespace. They should be advanced.

The problem is that FindCUDA.cmake leaves behind those variables after failing to find CUDA and if you then re-run with GMX_GPU=OFF (without cleaning your build tree) you end up with those non-advanced variables in your cache.

Strictly speaking, only configure-s in a clean build tree are fully supported. I could add a few lines of code that mark these variables advanced, but I'm not entirely sure that it should be our responsibility to clean up after a half-baked/failed configure.

#5 Updated by Szilárd Páll over 6 years ago

  • Status changed from New to Feedback wanted

I'd love to reject this as IMO it's not a bug, but that would be quite uncool, so I'll set that statues to Feedback.

#6 Updated by Roland Schulz over 6 years ago

How about to automatically disable GMX_GPU if CUDA is not found and no Nvidia CPU is present( e.g. on Linux if "lspci | grep "3D controller: nVidia" doesn't find anything). That might be a good compromise.

#7 Updated by Teemu Murtola over 6 years ago

Why not just move the warning to the end of the cmake run such that it is more easily found (see my message on gmx-core)?

Also, I think that many users who build Gromacs themselves for use on their personal machines are not interested in peak performance, but instead on getting easy access to analysis tools and such. I think that ability to build Gromacs easily would be more important for them than getting fatal errors on possibly degraded mdrun performance. I also think there is a big difference between a fatal build error and a well-phrased non-fatal warning when it comes to perceived quality. This particular error has a clear suggestion for a solution, though.

#8 Updated by Szilárd Páll over 6 years ago

Roland Schulz wrote:

How about to automatically disable GMX_GPU if CUDA is not found and no Nvidia CPU is present( e.g. on Linux if "lspci | grep "3D controller: nVidia" doesn't find anything). That might be a good compromise.

I've thought about this -- and also considered other options like checking (on Linux) if /dev/nvidia* dev entries are present or trying to find and use the nvidia-smi tool that comes with the driver, but I think none of them is robust enough and could potentially create issues (e.g. security).

Although the proposed solution would issue a valid warning is some cases, it has two drawbacks:
  • it would only work at build time as we should probably not execute processes at run time due to the security involvements;
  • it is not able to detect the type of the GPU (without having a list of PCI IDs hard-coded in mdrun) and as we support only compute capability >=2.0, we would be issuing false positive warnings/errors.

So I would still prefer a solution that checks and loads the CUDA driver to determine whether there is a supported GPU in the system. However, if some lspci/libpci-based solution could be implemented in a secure and somewhat robust fashion, we could consider using it as an intermediate solution (and also as a fallback when we also implement the CUDA driver-based check).

#9 Updated by Szilárd Páll over 6 years ago

Also, I think that many users who build Gromacs themselves for use on their personal machines are not interested in peak performance, but instead on getting easy access to analysis tools and such. I think that ability to build Gromacs easily would be more important for them than getting fatal errors on possibly degraded mdrun performance. I also think there is a big difference between a fatal build error and a well-phrased non-fatal warning when it comes to perceived quality. This particular error has a clear suggestion for a solution, though.

There is an important difference between other performance-related settings like OpenMP or AVX that will provide minor performance improvement compared the potential 2-4x speedup possible with simple desktop GPUs. Such an improvement can make a difference between running an equilibration on a cluster or just doing in on a desktop machine over a lunch-break. This is the very reason why I am a bit reluctant to remove the message that requires action from the user.

#10 Updated by Roland Schulz over 6 years ago

Szilárd Páll wrote:

Roland Schulz wrote:

How about to automatically disable GMX_GPU if CUDA is not found and no Nvidia CPU is present( e.g. on Linux if "lspci | grep "3D controller: nVidia" doesn't find anything). That might be a good compromise.

I've thought about this -- and also considered other options like checking (on Linux) if /dev/nvidia* dev entries are present or trying to find and use the nvidia-smi tool that comes with the driver, but I think none of them is robust enough and could potentially create issues (e.g. security).

Although the proposed solution would issue a valid warning is some cases, it has two drawbacks:
  • it would only work at build time as we should probably not execute processes at run time due to the security involvements;

Sure. It isn't perfect. But I doubt we'll find a perfect solution with reasonable effort.

  • it is not able to detect the type of the GPU (without having a list of PCI IDs hard-coded in mdrun) and as we support only compute capability >=2.0, we would be issuing false positive warnings/errors.

Yes. But MUCH less than now. That's why I'm saying it is a compromise. And if we are concerned about false positives error the only option is to get rid of the error and always make it a warning. Which I'm fine with. But seems to be what you don't want to do.

From ML:

...and neither the lack of OpenMP nor the lack of AVX support will cause
anywhere near the performance loss as the lack of CUDA (and anywhere near
the potentially low adoption rate of the new feature).

But people use notebooks, or RF, or only do analysis, or have AMD GPU, or have no GPUs. All these people don't benefit. All these people get an error what is not an error (all false positives). And given that you want more than is typical (compared to AVX/OpenMP), it doesn't seem to be fair that you want someone else to contribute the code and otherwise make something a fatal error which simply isn't an error.

#11 Updated by Szilárd Páll over 6 years ago

  • Status changed from Feedback wanted to In Progress
  • Assignee changed from Berk Hess to Szilárd Páll

#12 Updated by Teemu Murtola over 6 years ago

  • Status changed from In Progress to Closed

Looks like the linked commit does more or less what is asked for.

Also available in: Atom PDF