Project

General

Profile

Task #3077

Feature #2816: GPU offload / optimization for update&constraits, buffer ops and multi-gpu communication

Feature #2891: PME/PP GPU communications

PME/PP GPU Comms unique pointer deletion causes seg fault when CUDA calls exist in destructor

Added by Alan Gray 3 months ago. Updated about 1 month ago.

Status:
Feedback wanted
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Difficulty:
uncategorized
Close

Description

When the unique pointer used for the PME-PP GPU communications objects are automatically deleted, the code sometimes seg-faults. I originally thought this was only the case when CUDA calls exist in the destructor, but have now also seen it happen even with default destructors. I have reverted to regular pointers for now. This should be investigated further, with unique pointers reinstated.


Related issues

Related to GROMACS - Feature #3115: Device stream managerNew

Associated revisions

Revision c2e5f578 (diff)
Added by Alan Gray about 1 month ago

Explicitly destroy PME-PP GPU communication object

Add code to destroy object when it is no longer required. Even
although object is managed by a unique pointer, this needs to be done
while the GPU context still exists, otherwise a seg fault can occur
when it is automatically destroyed later.

Addresses #3077

Change-Id: I9d6f798d79a73e2ce366c9fb85a0ff9339fc9f88

History

#1 Updated by Alan Gray 3 months ago

  • Description updated (diff)

#2 Updated by Szilárd Páll 3 months ago

Is this still an issue?

#3 Updated by Alan Gray 3 months ago

Yes, it's still an issue - I've not had time to properly investigate/fix it yet.

#4 Updated by Mark Abraham 3 months ago

I've not seen any issues with such patches

#5 Updated by Alan Gray about 1 month ago

  • Status changed from New to Closed

#6 Updated by Szilárd Páll about 1 month ago

  • Status changed from Closed to Feedback wanted

We not have the same issue with gpuHaloExchange, I assume, only because we are not doing cudaStreamCreate?

Also, while looking into this I realized that:
- c2e5f578 added the freeing quite early; I suggest moving it closer to the place where related freeing happens.
in runner.cpp, around where gmx_pme_destroy() is called.
- we do not have a cudaStreamDestroy for pmePpCommStream_; I suggest adding the missing call to the destructor.

As noted on #3021, we need docs on this lifetime management concerns.
Side-note: we could side-step such issues if we had the code for #3115 as that would make the lifetime dependencies more clear.

#7 Updated by Szilárd Páll about 1 month ago

Also available in: Atom PDF