Project

General

Profile

Bug #2311

NVML compilation issues

Added by Erik Lindahl about 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
simple
Close

Description

On a few of our hosts the NVML libraries are not found at link time. It seems we need a better CMake test before we enable it.

CMakeCache.txt (68.1 KB) CMakeCache.txt Jochen Hub, 12/01/2017 12:36 PM

Related issues

Related to GROMACS - Bug #2061: fix FindNVML for CUDA 8.0Closed

Associated revisions

Revision 969fb1d0 (diff)
Added by Szilárd Páll about 2 years ago

Disable default-on NVML support in CMake

Due to the problems related to NVML builds failing in link stage when
linking against stub libs, we disable NVML by default to protect users
from a hard to disagnose bug.

Refs #2311

Change-Id: Id083254bc4344fbb3a91e7dd645a5f814163d043

Revision 3c71c117 (diff)
Added by Mark Abraham about 2 years ago

Leave NVML use off by default

Even if NVML is found, leave the default off because the
linking is unreliable for reasons that are currently unclear,
and only in some cases is linking with NVML advantageous.

Fixes #2311

Change-Id: I03e833964995f88350bf6cb70c06f1e3f67bb865

History

#1 Updated by Berk Hess about 2 years ago

Here is Jochen's report:

[100%] Linking CXX executable ../../bin/template
../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlDeviceGetPciInfo_v3'

The cuda/nvidia-related cmake output are:

-- Looking for NVIDIA GPUs present in the system
-- Number of NVIDIA GPUs detected: 2
-- Found CUDA: /cm/shared/apps/cuda90/toolkit/9.0.176 (found suitable version "9.0", minimum required is "6.5")
-- Found NVML: /cm/shared/apps/cuda90/toolkit/9.0.176/lib64/stubs/libnvidia-ml.so
-- Enabling single compilation unit for the CUDA non-bonded module. Multiple compilation units are not compatible with CC 2.x devices, to enable the feature specify only CC >=3.0 target architectures in GMX_CUDA_TARGET_SM/GMX_CUDA_TARGET_COMPUTE.

#2 Updated by Stefan Fleischmann about 2 years ago

I think you can reproduce that on any machine that doesn't have the NVIDIA driver installed. Or possibly a machine that has a driver version installed that is older than the minimum requirement for CUDA 9.0.

In any case cmake picks up lib64/stubs/libnvidia-ml.so from the CUDA installation directory. But when linking the library from /usr/ (installed by the NVIDIA driver package) is used if present. I always thought the point of lib64/stubs/libnvidia-ml.so in the CUDA directory is to be able to compile on machines that don't have a driver installed. But if that doesn't work, what's the point of having that "stub" library?

#3 Updated by Jochen Hub about 2 years ago

Gromacs 2018 beta1,

on a Scientific Linux, 12-core Xeon, GTX 1080, differnet gcc compiliers, linking stops with:

[100%] Linking CXX executable ../../bin/gmx
../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlDeviceGetPciInfo_v3'
collect2: error: ld returned 1 exit status

The cmake call was:

cmake /usr/users/cmb/shared/src/gmx/gromacs-2018-beta1 -DCMAKE_PREFIX_PATH=/usr/users/cmb/shared/opt/fftw/fftw335/broadwell/build-gcc-4.85-avx -DCMAKE_INSTALL_PREFIX=/usr/users/cmb/shared/opt/gromacs/broadwell/gromacs-2018-beta1 -DGMX_X11=OFF -DCMAKE_C_COMPILER=$CC -DCMAKE_CXX_COMPILER=$CXX -DGMX_GPU=on -Wno-dev -DGMX_MPI=OFF -DGMX_DOUBLE=OFF

I have attached the CmakeCache.txt.

#4 Updated by Aleksei Iupinov about 2 years ago

Thanks Stefan, I guess that's why I couldn't reproduce it on my machine.

Here is someone struggling with same issue: https://gitlab.kitware.com/cmake/cmake/issues/17175
Apparently NVML is special that way.

Jochen, what is your NVIDIA driver version? If our suspicion is true, it might simply mean that your driver is old and does not support CUDA 9.
It doesn't spare us from trying to work around this build system annoyance though, with the very least displaying a warning/turning NVML off.

#5 Updated by Jochen Hub about 2 years ago

Update:

With -DGMX_USE_NVML=OFF, Gromacs 2018-beta1 compiles fine with Cuda 9 and gcc 6.3, but mdrun does not find the GPUs any more:

NOTE: Error occurred during GPU detection:
CUDA driver version is insufficient for CUDA runtime version
Can not use GPU acceleration, will fall back to CPU kernels.

Is this because libnvidia-ml.so.1 is not used any more? lld on gmx lists only

libcufft.so.9.0 => /cm/shared/apps/cuda90/toolkit/9.0.176/lib64/libcufft.so.9.0 (0x00002aaaaec4b000)

With a working version, (Cuda 8, gcc 4.85, -DGMX_USE_NVML=ON), gmx uses:

libcufft.so.8.0 => /cm/shared/apps/cuda80/toolkit/8.0.61/lib64/libcufft.so.8.0 (0x00002aaaaecb9000)
libnvidia-ml.so.1 => /cm/local/apps/cuda/libs/current/lib64/libnvidia-ml.so.1 (0x00002aaab7b08000)

#6 Updated by Jochen Hub about 2 years ago

Hi,

nvidia-smi says:

NVIDIA-SMI 375.66 Driver Version: 375.66

Is this insufficient for Cuda 9?

Jochen

#7 Updated by Erik Lindahl about 2 years ago

NVIDIA unfortunately has a habit of frequently requiring updated drivers for newer CUDA versions, so there isn't much we can do about that.

#8 Updated by Aleksei Iupinov about 2 years ago

Yes, Jochen, I'm not sure where is the better source for this info, but here it says you need 384.81: https://github.com/NVIDIA/nvidia-docker/wiki/CUDA#requirements

#9 Updated by Aleksei Iupinov about 2 years ago

So the issue is happening not just between CUDA 8 and 9, but between nothing and any CUDA :-)

Trying a default build with CUDA 8 toolkit loaded and with no display driver:

../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlDeviceGetPciInfo_v2'
../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlErrorString'
../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlDeviceGetMaxClockInfo'
../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlShutdown'
../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlInit_v2'
../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlDeviceGetAPIRestriction'
../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlDeviceGetHandleByIndex_v2'
../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlDeviceGetApplicationsClock'
../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlDeviceGetCount_v2'
../../lib/libgromacs.so.3.0.0: undefined reference to `nvmlDeviceSetApplicationsClocks'

I think during NVML detection we could try to compile a simple program calling the NVML API.
If that fails, we can assume that the NVML is supposed to correspond to the driver's version, and we could terminate with the "old driver" message right there and save people's time if they want to run on this machine, alternatively suggest passing GMX_USE_NVML=OFF if they know what they're doing.
Does this line of thought make sense?
I think it's better than disabling NVML by default and not warning about old drivers early in the build.

#10 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '2' for Issue #2311.
Uploader: Aleksei Iupinov ()
Change-Id: gromacs~release-2018~If8cf15f68da7af5f9d02d901b956ee2e48703577
Gerrit URL: https://gerrit.gromacs.org/7264

#11 Updated by Aleksei Iupinov about 2 years ago

So I just pushed the work-in-progress change which doesn't actually fix the issue on a machine with no NVIDIA driver.
Feel free to take over and improve.

#12 Updated by Aleksei Iupinov about 2 years ago

Could it be that we're simply not linking our binaries correctly, since the sample NVML code compiles and links against the NVML stub from CUDA 8.0 on the machine with no NVIDIA driver/hardware?
In that case we should fix that instead, and keep relying on runtime initialization checks.

#13 Updated by Jochen Hub about 2 years ago

OK, thank you. I will ask our adminstrator to upgrade the drivers.

Jochen

#14 Updated by Stefan Fleischmann about 2 years ago

Aleksei Iupinov wrote:

Could it be that we're simply not linking our binaries correctly, since the sample NVML code compiles and links against the NVML stub from CUDA 8.0 on the machine with no NVIDIA driver/hardware?
In that case we should fix that instead, and keep relying on runtime initialization checks.

Based on your test I'd say it looks that way.

#15 Updated by Aleksei Iupinov about 2 years ago

P.S. I think someone more knowledgeable in Gromacs build system can look into the gmx/template NVML linking issue.
The easy way out for now would be to disable NVML by default

#16 Updated by Szilárd Páll about 2 years ago

Aleksei Iupinov wrote:

Could it be that we're simply not linking our binaries correctly

"Correctly" jumps to an assumption before understanding where the issue is. The stub library provided by the CUDA runtime is clearly insufficient for the what the GROMACS link stage does, but I'm not convinced the issue is in the GROMACS build system.

What I noticed, but have not followed up on yet, but wanted to record here is that I got a link error for "libnvidia-ml.so.1", but there's no soversioned stub library, so not sure where that came from (perhaps cmake resolves the real, non-stub so's name somehow from the stub's headers?).

#17 Updated by Mark Abraham about 2 years ago

Szilárd Páll wrote:

Aleksei Iupinov wrote:

Could it be that we're simply not linking our binaries correctly

"Correctly" jumps to an assumption before understanding where the issue is. The stub library provided by the CUDA runtime is clearly insufficient for the what the GROMACS link stage does, but I'm not convinced the issue is in the GROMACS build system.

What I noticed, but have not followed up on yet, but wanted to record here is that I got a link error for "libnvidia-ml.so.1", but there's no soversioned stub library, so not sure where that came from (perhaps cmake resolves the real, non-stub so's name somehow from the stub's headers?).

There's usually one to be found, e.g. find /opt/tcbsys -name libnvidia-ml.so

#18 Updated by Aleksei Iupinov about 2 years ago

  • Related to Bug #2061: fix FindNVML for CUDA 8.0 added

#19 Updated by Aleksei Iupinov about 2 years ago

So from reading https://gitlab.kitware.com/cmake/cmake/issues/17175 and https://codeyarns.com/2017/11/13/stub-library-warning-on-libnvidia-ml-so/ and people's suggestions
it's easy to see that that whole purpose of the stub library is to have something to link against, so we should be always linking binaries against it - and then we just have to be careful to not let the stub library path get into the RPATH.
The first link has a workaround for Linux only. I might try it.

#20 Updated by Szilárd Páll about 2 years ago

This is the link error I get:

[100%] Generating command-line completions for programs
/nethome/pszilard/projects/gromacs/gromacs-master/build_test/bin/gmx: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
Failed to generate shell completions, will build GROMACS without. Set GMX_BUILD_HELP=OFF if you want to skip this notification and warnings during installation.
[100%] Built target completion

What's weird is that .so.1 is nowhere in the cache, so we should be linking against the .so:

pszilard@login1:gromacs-master/build_test $ grep nvidia-ml CMakeCache.txt
NVML_LIBRARY:FILEPATH=/opt/tcbsys/cuda/9.0/lib64/stubs/libnvidia-ml.so
gpu_utilstest_cuda_LIB_DEPENDS:STATIC=general;/opt/tcbsys/cuda/9.0/lib64/libcudart_static.a;general;-lpthread;general;/usr/lib/x86_64-linux-gnu/librt.so;general;/usr/lib/x86_64-linux-gnu/libdl.so;general;/opt/tcbsys/cuda/9.0/lib64/stubs/libnvidia-ml.so;general;rt;
libgromacs_LIB_DEPENDS:STATIC=general;/opt/tcbsys/cuda/9.0/lib64/libcudart_static.a;general;-lpthread;general;/usr/lib/x86_64-linux-gnu/librt.so;general;/usr/lib/x86_64-linux-gnu/libdl.so;general;/opt/tcbsys/cuda/9.0/lib64/libcufft.so;general;/opt/tcbsys/cuda/9.0/lib64/stubs/libnvidia-ml.so;general;rt;general;/opt/tcbsys/fftw/3.3.6-pl1-sse2-avx-avx2-avx128fma-avx512/lib/libfftw3f.a;general;-lpthread;general;-fopenmp;general;m;
FIND_PACKAGE_MESSAGE_DETAILS_NVML:INTERNAL=[/opt/tcbsys/cuda/9.0/lib64/stubs/libnvidia-ml.so][/opt/tcbsys/cuda/9.0/include][v()]

$ ls -l //opt/tcbsys/cuda/9.0/lib64/stubs/libnvidia-ml.so
-rwxr-xr-x 1 root root 28K Nov 22 13:36 //opt/tcbsys/cuda/9.0/lib64/stubs/libnvidia-ml.so
[ 97%] Linking CXX executable ../../bin/gmx
cd /nethome/pszilard/projects/gromacs/gromacs-master/build_test/src/programs && /opt/tcbsys/cmake/3.9.4/bin/cmake -E cmake_link_script CMakeFiles/gmx.dir/link.txt --verbose=1
/opt/tcbsys/gcc/5.4/bin/g++-5   -march=core-avx2    -std=c++11  -Wundef -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wall  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds      CMakeFiles/gmx.dir/gmx.cpp.o CMakeFiles/gmx.dir/legacymodules.cpp.o CMakeFiles/mdrun_objlib.dir/mdrun/md.cpp.o CMakeFiles/mdrun_objlib.dir/mdrun/mdrun.cpp.o CMakeFiles/mdrun_objlib.dir/mdrun/membed.cpp.o CMakeFiles/mdrun_objlib.dir/mdrun/repl_ex.cpp.o CMakeFiles/mdrun_objlib.dir/mdrun/runner.cpp.o CMakeFiles/view_objlib.dir/view/view.cpp.o  -o ../../bin/gmx -Wl,-rpath,"\$ORIGIN/../lib:/opt/tcbsys/cuda/9.0/lib64:/opt/tcbsys/cuda/9.0/lib64/stubs" ../../lib/libgromacs.a -fopenmp /opt/tcbsys/cuda/9.0/lib64/libcudart_static.a -lpthread /usr/lib/x86_64-linux-gnu/librt.so /usr/lib/x86_64-linux-gnu/libdl.so /opt/tcbsys/cuda/9.0/lib64/libcufft.so /opt/tcbsys/cuda/9.0/lib64/stubs/libnvidia-ml.so -lrt /opt/tcbsys/fftw/3.3.6-pl1-sse2-avx-avx2-avx128fma-avx512/lib/libfftw3f.a -lpthread /usr/lib/x86_64-linux-gnu/librt.so /usr/lib/x86_64-linux-gnu/libdl.so /opt/tcbsys/cuda/9.0/lib64/libcufft.so /opt/tcbsys/cuda/9.0/lib64/stubs/libnvidia-ml.so -lrt /opt/tcbsys/fftw/3.3.6-pl1-sse2-avx-avx2-avx128fma-avx512/lib/libfftw3f.a -lm
make[2]: Leaving directory '/nethome/pszilard-projects/gromacs/gromacs-master/build_test'
[ 98%] Built target gmx

#21 Updated by Christoph Junghans about 2 years ago

Can you do a

ldd /opt/tcbsys/cuda/9.0/lib64/libcufft.so

#22 Updated by Szilárd Páll about 2 years ago

Christoph Junghans wrote:

Can you do a
[...]


$ ldd /opt/tcbsys/cuda/9.0/lib64/libcufft.so
    linux-vdso.so.1 =>  (0x00007fff1cffa000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007faec4986000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007faec467d000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007faec4460000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007faec4258000)
    libstdc++.so.6 => /opt/tcbsys/gcc/5.4/lib64/libstdc++.so.6 (0x00007faec3edd000)
    libgcc_s.so.1 => /opt/tcbsys/gcc/5.4/lib64/libgcc_s.so.1 (0x00007faec3cc7000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007faec38fd000)
    /lib64/ld-linux-x86-64.so.2 (0x00007faeccc2b000)

#23 Updated by Szilárd Páll about 2 years ago

Szilárd Páll wrote:

Christoph Junghans wrote:

Can you do a
[...]

[...]

I do not think NVML is a dependency of any other CUDA libraries, given that it's a management library, it would not really make sense anyway, would it?

#24 Updated by Christoph Junghans about 2 years ago

Yeah, it is not logical, but try:

ldd /opt/tcbsys/cuda/9.0/lib64/stubs/libnvidia-ml.so

#25 Updated by Szilárd Páll about 2 years ago

$ ldd /opt/tcbsys/cuda/9.0/lib64/stubs/libnvidia-ml.so
    linux-vdso.so.1 =>  (0x00007ffcad4c2000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f93891c5000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9388fc1000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9388bf7000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f93895e9000)

Doesn't look like the .so.1 comes from here.

#26 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2311.
Uploader: Szilárd Páll ()
Change-Id: gromacs~release-2018~Id083254bc4344fbb3a91e7dd645a5f814163d043
Gerrit URL: https://gerrit.gromacs.org/7313

#27 Updated by Erik Lindahl about 2 years ago

  • Status changed from New to Fix uploaded

#28 Updated by Erik Lindahl about 2 years ago

  • Status changed from Fix uploaded to Resolved

#29 Updated by Erik Lindahl about 2 years ago

  • Status changed from Resolved to Closed

#30 Updated by Erik Lindahl about 2 years ago

  • Status changed from Closed to Accepted

Reverted in https://gerrit.gromacs.org/#/c/7318/.

My bad for committing before testing. It won't work to just define the option inside an if-block that checks whether it has already been set - then it will never show up in the
CMake GUI.

Second, if we don't even try to compile NVML by default, we should not produce large
warnings on stderr that the user should compile Gromacs with it when we detect Tesla
cards.

#31 Updated by Gerrit Code Review Bot about 2 years ago

Gerrit received a related patchset '1' for Issue #2311.
Uploader: Mark Abraham ()
Change-Id: gromacs~release-2018~I03e833964995f88350bf6cb70c06f1e3f67bb865
Gerrit URL: https://gerrit.gromacs.org/7322

#32 Updated by Jonathan Vincent about 2 years ago

Just to make sure I understand this correctly.

  • libnvidia-ml.so is included when libgromacs is created, but not the main link line for gmx etc.
  • Libgromacs can be either a static or dynamic library depending on how GMX_PREFER_STATIC_LIBS is set (and if you build it statically you will not see libnvidia-ml.so in ldd any more)
  • When linking the library from /usr/ (installed by the NVIDIA driver package) is used if present, otherwise the stubs library is used.

We are seeing.

Failure when trying to build with CUDA 9 and the driver package is installed but too old for CUDA 9. Given that the driver is used if it is there, even if it is too old this is not surprising. I confirmed this one, I built it on a machine with a too old driver. libgromacs is built with an explicit /usr/lib64/libnvidia-ml.so in the link line.

I was not sure what happens if the driver package is not present at all, from the above it should link with the stubs, but that seems like a problem if libgromacs is built statically, but should be ok if it is built dynamically. In the process of double checking that, but if anyone has tried that let me know.

Is there a reason libnvidia-ml.so is linked into libgromacs? It might be simpler if it was linked into gmx dynamically then you could always use the stubs library, and you should not have compile problems. There would be potential runtime problems, but hopefully that would give a helpful error message.

#33 Updated by Erik Lindahl about 2 years ago

  • Status changed from Accepted to Resolved

"Resolved" by disabling NVML by default for now.

NVIDIA is moving away from allowing users to set application clocks anyway, so it's less important than it was before.

#35 Updated by Jonathan Vincent about 2 years ago

Ok had a bit more of a look at this.

Built on a machine with just the toolkit and without the driver.

For static linking the stubs library works fine for me.
Then you get expected behaviour. If you then run on a machine with a current driver then it works. If you try to run on a machine where the driver is too old then you get the error message :

NOTE: Error occurred during GPU detection:
CUDA driver version is insufficient for CUDA runtime version
Can not use GPU acceleration, will fall back to CPU kernels.

For dynamic linking with the stubs library then I needed to add $(ROOT)/lib64/stubs/libnvidia-ml.so to the link lines for template and gmx. Then it worked fine as well. I just did that manually for testing, taking the verbose=1 output of cmake and doing a manual compile.

So I think a fix would be.

1/ Don't use /usr/lib64/libnvidia-ml.so even if it is there
2/ For the GMX_PREFER_STATIC_LIBS=OFF also add the $(ROOT)/lib64/stubs/libnvidia-ml.so to the link line for template and gmx.

The 2nd one is done already for the statically linked version which is why it currently works. Probably at that point it would be better not to link libgromacs with libnvidia-ml.so as well.

#36 Updated by Erik Lindahl about 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF