concurrency-related bug with thread-MPI
- hardware detection report not printed to log/console when ntmpi>1;
- GPU oversubscription check not working: e.g -ntmpi 2 -gpu_id 00 starts to run, but due to concurrency issues with GPU conntext sharing between tMPI ranks, it throws an error or segfaults before exiting (which is the reason why GPU oversubscription has not been allowed with tMPI).
The issue has also been reported on the user's list:
reorganized GPU detection and selection
The GPU selection has been separated from the GPU detection
and now happens after the thread-MPI threads are started.
The GPU user/auto-selected options have been removed from
gmx_hw_info_t, such that it only contains hardware info
and can be passed around as const.
As both the CPU and GPU options structs are now tMPI rank local,
tMPI thread concurrency issues are avoided.
Fixes #1334 #1359
The GPU detection is now skipped with mdrun -nb cpu
CPU acceleration binary/hardware mismatch is now only printed once
to stderr (instead of #MPI-rank times to stdout).
Removed the master_inf_t struct.
#3 Updated by Szilárd Páll about 6 years ago
Here's what happens: because of mutex-based implementation, now the first thread to arrive grabs the mutex and does the consistency checks, but all messages and warnings are printed using
md_print_info() which only print on rank 0. Hence, if rank 0 is not the first to arrive and grab the mutex, the warnings/errors as well as detection information will not be issued.
I don't have a suggestion for a good solution. Making sure that tMPI rank 0 executes the critical region would defeat the purpose of the elegant "proper" threading-style mutex-based implementation. To me it still seems that this issue represents yet another reason for not treating the thread-MPI parallelization on the top level as a "native" multi-threading implementation, but more like an MPI implementation which in some cases requires special measures to ensure thread safety.
For the full discussion see my comments on gerrit #2433 PS5.
#4 Updated by Szilárd Páll about 6 years ago
- Target version set to 4.6.4
Here is a possible workaround: in the beginning of
gmx_check_hw_runconf_consistency() do the following:
t_commrec *cr_hack; #ifdef GMX_THREAD_MPI cr_hack = NULL; #else cr_hack = cr; #endif
This is rather hack-ish workaround, but to me it seems the least invasive solution - unless we want to strip the mutex-based implementation (which I'd be in favor of).
PS: I set the target version because the problem has been well defined and there is a suggestion for the solution.
#5 Updated by Mark Abraham about 6 years ago
- Status changed from New to Fix uploaded