Project

General

Profile

Bug #1385

segfault with multi-node runs and GPU sharing

Added by Szilárd Páll about 4 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
High
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

When running on more than one physical nodes with GPU, under certain circumstances (but quite often), a segmentation fault occurs; e.g. on a Cray XK7 with 8 ranks per node and two nodes:

OMP_NUM_THREADS=2 aprun -cc none -n 16 -N 8 -d 2 mdrun_mpi -npme 0 -pin on -nb gpu -gpu_id 00000000

Associated revisions

Revision 904d4645 (diff)
Added by Szilárd Páll about 4 years ago

Fix DD load balancing bug with GPU sharing

The recent DD load balancing fix which solved the issue of incorrect
imbalance measure with GPU sharing (ba8232e9) addressed GPUs with
incorrect indexing. This caused out of bounds indexing in the GPU ID
query function. The query function also had a bug in the error checking
which allowed the incorrect indexing.
Now also mdrun -nb cpu -gpu_id ... is allowed, which before would give
a fatal error.

This commit addresses both issues; fixes #1385

Change-Id: I2800f610b873da92afe78bbfd869258f378ba2d7

History

#1 Updated by Szilárd Páll about 4 years ago

  • Status changed from New to In Progress
  • Assignee changed from Berk Hess to Szilárd Páll

The bug is triggered by incorrect GPU ID indexing with Nranks > 2 and GPU sharing among ranks: http://redmine.gromacs.org/projects/gromacs/repository/revisions/ba8232e965652669cc0b558a273f81a4d9733d25/entry/src/mdlib/domdec.c#L5700

A fix will be uploaded shortly.

#2 Updated by Szilárd Páll about 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

#3 Updated by Mark Abraham about 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF