CUDA CC 2.0 issue
Using either master HEAD, release-2016 HEAD, or tag v2016:
If I target compute and sm 20 (with
-DGMX_CUDA_TARGET_SM=20 -DGMX_CUDA_TARGET_COMPUTE=20) then by default we get a single CUDA compilation unit (since that's the only thing that can work). The regressiontests pass, but we we have an issue, e.g.
$ bin/mdrun-test --gtest_filter=\*Swap\* ... Running on 1 node with total 4 cores, 8 logical cores, 2 compatible GPUs Hardware detected: CPU info: Vendor: Intel Brand: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz SIMD instructions most likely to fit this hardware: AVX_256 SIMD instructions selected at GROMACS compile time: AVX_256 Hardware topology: Full, with devices GPU info: Number of GPUs detected: 2 #0: NVIDIA GeForce GTX 960, compute cap.: 5.2, ECC: no, stat: compatible #1: NVIDIA GeForce GTX 660 Ti, compute cap.: 3.0, ECC: no, stat: compatible Reading file /home/marklocal/git/r2016/build-cmake-gcc-gpu-cc20-debug/src/programs/mdrun/tests/Testing/Temporary/CompelTest_SwapCanRun.tpr, VERSION 2016.5-dev-20170923-d36730ca3 (single precision) Using 1 MPI thread Using 1 OpenMP thread 1 GPU user-selected for this run. Mapping of GPU ID to the 1 PP rank in this node: 0 NOTE: Thread affinity setting failed. This can cause performance degradation. If you think your settings are correct, ask on the gmx-users list. SWAP: Determining initial numbers of ions per compartment. SWAP: Setting pointers for checkpoint writing SWAP: Channel 0 flux history for ion type NA+ (charge 1): 0 molecules SWAP: Channel 1 flux history for ion type NA+ (charge 1): 0 molecules SWAP: Channel 0 flux history for ion type CL- (charge -1): 0 molecules SWAP: Channel 1 flux history for ion type CL- (charge -1): 0 molecules starting mdrun 'Channel_coco in octane membrane' 2 steps, 0.0 ps. ------------------------------------------------------- Program: mdrun-test, version 2016.5-dev-20170923-d36730ca3 Source file: src/gromacs/mdlib/nbnxn_cuda/nbnxn_cuda.cu (line 633) Fatal error: cudaStreamSynchronize failed in cu_blockwait_nb: an illegal memory access was encountered For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors -------------------------------------------------------
If I target compute and sm 30 then, by default, I get multiple CUDA compilation units and there is no issue.
If I target compute and sm 30 and set
-DGMX_CUDA_NB_SINGLE_COMPILATION_UNIT=on, then there is no issue.
So it looks to me like something in the CC 2.0 support is broken, or at least not properly used by the mdrun-test code. I'll try to bisect a bit more and see what I learn.
The absence of a reported bug does suggest that there is not much use of release-2016 on CC 2.0, and we should consider removing support for CC 2.0 for GROMACS 2017. This would simplify our texture and CMake code, and remove the question of whether someone should try to cover this case in Jenkins. Clearly nobody has prioritized doing or automating testing on this old setup. Note that NVIDIA has already deprecated those compilation targets in nvcc (and we suppress the warning). If we go this path, then I suggest we don't bother trying to fix release-2016, and if someone later has an issue, suggest they use an even earlier version.
Check CUDA available/compiled code compatibility
Added an early check to detect when the gmx binary does not embed code
compatible with the GPU device it tries to use nor does it have PTX that
could have been JIT-ed.
Additionally, if the user manually sets GMX_CUDA_TARGET_COMPUTE=20 and
no later SM or COMPUTE but runs on >2.0 hardware, we'd be executing
JIT-ed Fermi kernels with incorrect host-side code assumptions
(e.g amount of shared memory allocated or texture type).
This change also prevents such cases.
#1 Updated by Szilárd Páll about 3 years ago
The problem in practice is that you're requesting only sm_20 and compute_2.0 (i.e. SASS for the 2.0 arch and PTX for the 2.0 virtual arch) to be embedded in the binary but this binary is then run on a CC 5.2 device. JIT compilation is in this case is considered valid from 2.0 virt. arch to sm_52. However, in practice this is not valid as the two kernel flavors have different resource needs (see
calc_shmem_required_nonbonded()). However, as JIT silently accepts forward-compatible compilation (of already pre-processed code, otherwise the #error in nbnxn_cuda_kernel_fermi.cuh would get trigegred) and therefore at runtime we have no way to detect that the wrong kernel is getting dispatched to a newer GPU.
This is a very much corner-case that I do not think it's worth catering for, the only case where such an error can be reasonably triggered is if a user compiled a binary many years ago for some reason only for Fermi and then they upgrade their GPU.
I'm not even sure a workaround is possible as we can't query at runtime which kernels were compiled in without string parsing command line options (which seems like an overkill).
#3 Updated by Aleksei Iupinov about 3 years ago
Just ran same 2.0/3.0 procedures on ewald-test with CUDA 6.5.
In compute_20 build, the PME spread tests fail only - also with exact same error (an illegal memory access was
encountered during stream synchronization).
The tests don't fail if textures are disabled.
Spread is the only PME kernel which uses textures, and it's the only arch-dependent feature of the kernel.
Clearly, the problem lies with texrefs. Maybe we misuse them, or they are not supposed to be used on (CC >= 3.0) hardware at all.
So the general solution is needed.
#4 Updated by Szilárd Páll almost 3 years ago
The source of the issue is still the same, I think: the host code only takes into account at runtime what is the CC of the device in use and does not consider that the kernel may be compiled from PTX targeting a virt arch that does not support a certain feature (be it texobj or shuffle reduction).
#5 Updated by Mark Abraham almost 3 years ago
So, the host code needs to have more flexible behaviour, so that the PTX will always work. Further, it sounds like we need that in order that the compilation for multiple specific archs is always compatible with the host behaviour.
For 2016, either we need a build mode that supports CC 2.0 that always works (and doesn't break other support), or we need to remove 2.0 support. Similarly for 2017, but I would further suggest we don't implement 2.0 support for new functionality.