Bug #1385
segfault with multi-node runs and GPU sharing
Description
When running on more than one physical nodes with GPU, under certain circumstances (but quite often), a segmentation fault occurs; e.g. on a Cray XK7 with 8 ranks per node and two nodes:
OMP_NUM_THREADS=2 aprun -cc none -n 16 -N 8 -d 2 mdrun_mpi -npme 0 -pin on -nb gpu -gpu_id 00000000
Associated revisions
History
#1 Updated by Szilárd Páll about 6 years ago
- Status changed from New to In Progress
- Assignee changed from Berk Hess to Szilárd Páll
The bug is triggered by incorrect GPU ID indexing with Nranks > 2 and GPU sharing among ranks: http://redmine.gromacs.org/projects/gromacs/repository/revisions/ba8232e965652669cc0b558a273f81a4d9733d25/entry/src/mdlib/domdec.c#L5700
A fix will be uploaded shortly.
#2 Updated by Szilárd Páll about 6 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset 904d4645ca712bd58e0a22fcdebead0291ec19d3.
#3 Updated by Mark Abraham about 6 years ago
- Status changed from Resolved to Closed
Fix DD load balancing bug with GPU sharing
The recent DD load balancing fix which solved the issue of incorrect
imbalance measure with GPU sharing (ba8232e9) addressed GPUs with
incorrect indexing. This caused out of bounds indexing in the GPU ID
query function. The query function also had a bug in the error checking
which allowed the incorrect indexing.
Now also mdrun -nb cpu -gpu_id ... is allowed, which before would give
a fatal error.
This commit addresses both issues; fixes #1385
Change-Id: I2800f610b873da92afe78bbfd869258f378ba2d7