segfault with multi-node runs and GPU sharing
When running on more than one physical nodes with GPU, under certain circumstances (but quite often), a segmentation fault occurs; e.g. on a Cray XK7 with 8 ranks per node and two nodes:
OMP_NUM_THREADS=2 aprun -cc none -n 16 -N 8 -d 2 mdrun_mpi -npme 0 -pin on -nb gpu -gpu_id 00000000
Fix DD load balancing bug with GPU sharing
The recent DD load balancing fix which solved the issue of incorrect
imbalance measure with GPU sharing (ba8232e9) addressed GPUs with
incorrect indexing. This caused out of bounds indexing in the GPU ID
query function. The query function also had a bug in the error checking
which allowed the incorrect indexing.
Now also mdrun -nb cpu -gpu_id ... is allowed, which before would give
a fatal error.
This commit addresses both issues; fixes #1385
#1 Updated by Szilárd Páll almost 6 years ago
- Status changed from New to In Progress
- Assignee changed from Berk Hess to Szilárd Páll
The bug is triggered by incorrect GPU ID indexing with Nranks > 2 and GPU sharing among ranks: http://redmine.gromacs.org/projects/gromacs/repository/revisions/ba8232e965652669cc0b558a273f81a4d9733d25/entry/src/mdlib/domdec.c#L5700
A fix will be uploaded shortly.