Project

General

Profile

Bug #2624

GPU build system not robust enough

Added by Mark Abraham about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

When re-using an old cache on a newly installed repo on the same machien, I encountered a situation that you could also get by e.g. failing to load a module, namely that the GPU detection has been run, and CUDA cannot now be found. The over-use of GMX_GPU_DETECTION_DONE exposes a bug in FindNVML, but the former is the real problem. To reproduce on a normal machine:

$ cmake .. -DCUDA_TOOLKIT_ROOT_DIR=garbage -DGMX_GPU_DETECTION_DONE=on -DGMX_GPU=on
-- The C compiler identification is GNU 7.3.0
-- The CXX compiler identification is GNU 7.3.0
-- Check for working C compiler: /etc/alternatives/cc
-- Check for working C compiler: /etc/alternatives/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /etc/alternatives/c++
-- Check for working CXX compiler: /etc/alternatives/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Setting up ccache wrapper for GNU C compiler /etc/alternatives/cc
-- Setting up ccache wrapper for GNU CXX compiler /etc/alternatives/c++
-- Found Git: /usr/bin/git (found version "2.17.1") 
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Performing Test CXXFLAG_STD_CXX0X
-- Performing Test CXXFLAG_STD_CXX0X - Success
-- Performing Test CXX11_SUPPORTED
-- Performing Test CXX11_SUPPORTED - Success
-- Performing Test CXX11_STDLIB_PRESENT
-- Performing Test CXX11_STDLIB_PRESENT - Success
-- Could NOT find CUDA (missing: CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) (Required is at least version "7.0")
CMake Error at cmake/FindNVML.cmake:103 (if):
  if given arguments:

    "VERSION_LESS" "8.0" 

  Unknown arguments specified
Call Stack (most recent call first):
  cmake/gmxManageGPU.cmake:147 (find_package)
  CMakeLists.txt:227 (include)

-- Configuring incomplete, errors occurred!

The fact that GMX_GPU_DETECTION_DONE is true means that the logic does not respond properly to failing to find CUDA. FindNVML assumes we have found CUDA, but actually it should be more graceful about the way it fails.

History

#1 Updated by Mark Abraham about 1 year ago

Leaving GMX_GPU un-set is OK.

#2 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related DRAFT patchset '1' for Issue #2624.
Uploader: Mark Abraham ()
Change-Id: gromacs~master~Ia6310510fbd325d8f9cc6a4fc5c4ba08c203fe52
Gerrit URL: https://gerrit.gromacs.org/8221

#3 Updated by Szilárd Páll about 1 year ago

Mark Abraham wrote:

When re-using an old cache on a newly installed repo on the same machien, I encountered a situation that you could also get by e.g. failing to load a module

It would be good to detect if the CUDA install vanishes, but note that the example you give does not apply. Modules don't disappear and once detected, unless the path changes, the CUDA installation remains the same. Moreover, based on my experience this can't happen either -- I never load other than when running cmake the first time, I do reuse build trees most of the time and never have such issues.

#4 Updated by Mark Abraham about 1 year ago

Szilárd Páll wrote:

Mark Abraham wrote:

When re-using an old cache on a newly installed repo on the same machien, I encountered a situation that you could also get by e.g. failing to load a module

It would be good to detect if the CUDA install vanishes, but note that the example you give does not apply. Modules don't disappear and once detected, unless the path changes, the CUDA installation remains the same. Moreover, based on my experience this can't happen either -- I never load other than when running cmake the first time, I do reuse build trees most of the time and never have such issues.

OK, but the point is that a cmake cache pointing at a cuda that no longer exists shouldn't report it isn't found and then continue anyway as if it was found.

Also available in: Atom PDF