Project

General

Profile

Bug #2065

thread-MPI internal errors

Added by Szilárd Páll almost 4 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
5.1.4, 2019-beta2
Affected version:
Difficulty:
uncategorized
Close

Description

During recent benchmark run on Power8 I came across the following two tMPI internal errors:

step   80: timed with pme grid 72 72 72, coulomb cutoff 0.900: 45.7 M-cycles
step  160: timed with pme grid 60 60 60, coulomb cutoff 1.017: 39.7 M-cycles
step  240: timed with pme grid 52 52 52, coulomb cutoff 1.174: 39.5 M-cycles
step  320: timed with pme grid 48 48 48, coulomb cutoff 1.272: 40.3 M-cycles
step  400: timed with pme grid 44 44 44, coulomb cutoff 1.387: 39.2 M-cycles
step  400: the domain decompostion limits the PME load balancing to a coulomb cut-off of 1.387
step  480: timed with pme grid 44 44 44, coulomb cutoff 1.387: 39.1 M-cycles
step  560: timed with pme grid 48 48 48, coulomb cutoff 1.272: 35.9 M-cycles
step  640: timed with pme grid 52 52 52, coulomb cutoff 1.174: 37.1 M-cycles
step  720: timed with pme grid 56 56 56, coulomb cutoff 1.090: 36.7 M-cycles
step  800: timed with pme grid 60 60 60, coulomb cutoff 1.017: 37.7 M-cycles
tMPI error: Receive buffer size too small for transmission (in valid comm)
step   80: timed with pme grid 144 144 72, coulomb cutoff 0.900: 109.3 M-cycles
step  160: timed with pme grid 120 120 60, coulomb cutoff 1.017: 92.5 M-cycles
step  240: timed with pme grid 108 108 56, coulomb cutoff 1.130: 85.5 M-cycles
step  320: timed with pme grid 100 100 52, coulomb cutoff 1.221: 81.8 M-cycles
step  400: timed with pme grid 84 84 42, coulomb cutoff 1.453: 79.9 M-cycles
step  480: timed with pme grid 72 72 36, coulomb cutoff 1.695: 81.0 M-cycles
step  560: timed with pme grid 64 64 32, coulomb cutoff 1.907: 88.3 M-cycles
step  640: timed with pme grid 56 56 28, coulomb cutoff 2.180: 81.5 M-cycles
tMPI error: Multicast operation mismatch (multicast not collective across comm) (in valid comm)

I did not have debug flags, so the backtrace I could obtain from the core files is not complete, but still informative; this is from the former crash (with the receive buffer size error):

[Current thread is 1 (Thread 0x3bffdcfff100 (LWP 22980))]
(gdb) bt
#0  0x00003fff86c8f27c in __GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00003fff86c918f4 in __GI_abort () at abort.c:89
#2  0x00000000108595ac in tmpi_errors_are_fatal_fn ()
#3  0x000000001085962c in tMPI_Error ()
#4  0x00000000105bbf7c in tMPI_Mult_recv ()
#5  0x0000000010859b2c in tMPI_Alltoall ()
#6  0x000000001064f478 in fft5d_execute ()
#7  0x00000000101420d0 in gmx_parallel_3dfft_execute ()
#8  0x000000001013d48c in gmx_pme_do(gmx_pme_t*, int, int, float (*) [3], float (*) [3], float*, float*, float*, float*, float*, float*, float (*) [3], t_commrec*, int, int, t_nrnb*, gmx_wallcycle*, float (*) [3], float, float (*) [3], float, float*, float*, float, float, float*, float*, int) [clone ._omp_fn.0] ()
#9  0x00003fff86e6e8a4 in GOMP_parallel () from /usr/lib/powerpc64le-linux-gnu/libgomp.so.1
#10 0x000000001013eea4 in gmx_pme_do(gmx_pme_t*, int, int, float (*) [3], float (*) [3], float*, float*, float*, float*, float*, float*, float (*) [3], t_commrec*, int, int, t_nrnb*, gmx_wallcycle*, float (*) [3], float, float (*) [3], float, float*, float*, float, float, float*, float*, int) ()
#11 0x000000001055927c in do_force_lowlevel ()
#12 0x0000000010590cb8 in do_force_cutsVERLET(_IO_FILE*, t_commrec*, t_inputrec*, long, t_nrnb*, gmx_wallcycle*, gmx_localtop_t*, gmx_groups_t*, float (*) [3], float (*) [3], history_t*, float (*) [3], float (*) [3], t_mdatoms*, gmx_enerdata_t*, t_fcdata*, float*, t_graph*, t_forcerec*, interaction_const_t*, gmx_vsite_t*, float*, double, _IO_FILE*, gmx_edsam*, int, int) ()
#13 0x00000000105942d0 in do_force ()
#14 0x00000000100b1d28 in do_md ()
#15 0x00000000100c5e10 in mdrunner ()
#16 0x00000000100c66d8 in mdrunner_start_fn(void*) ()
#17 0x00000000105c240c in tMPI_Thread_starter ()
#18 0x00000000105b5360 in tMPI_Thread_starter ()
#19 0x00003fff872184a0 in start_thread (arg=0x3bffdcfff100) at pthread_create.c:335
#20 0x00003fff86d77e74 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:96

In total three occurrances of the crash, some with v5.1 others with v2016, all during the reverse scanning phase of the PP-PME load-balancing.

History

#1 Updated by Mark Abraham almost 4 years ago

  • Private changed from Yes to No

#2 Updated by Mark Abraham almost 4 years ago

Both those errors can be given from tMPI_Mult_recv, but the cause could be anywhere. I hope ThreadSanitizer works on POWER8 (but not holding my breath).

#3 Updated by Daniel Black over 3 years ago

Mark Abraham wrote:

I hope ThreadSanitizer works on POWER8 (but not holding my breath).

surprise!

$ clang -fsanitize=thread -g -O1 tiny_race.c
$ ./a.out
==================
WARNING: ThreadSanitizer: data race (pid=103724)
  Write of size 4 at 0x000010f30570 by main thread:
    #0 main tiny_race.c:11 (a.out+0x0000100d3e9c)

  Previous write of size 4 at 0x000010f30570 by thread T1:
    #0 Thread1 tiny_race.c:5 (a.out+0x0000100d3df4)

  Location is global '<null>' at 0x000000000000 (a.out+0x000010f30570)

  Thread T1 (tid=103726, finished) created by main thread at:
    #0 pthread_create <null> (a.out+0x000010025d80)
    #1 main tiny_race.c:10 (a.out+0x0000100d3e88)

SUMMARY: ThreadSanitizer: data race tiny_race.c:11 in main
==================
ThreadSanitizer: reported 1 warnings

$ clang --version
clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
Target: powerpc64le-unknown-linux-gnu

#4 Updated by Mark Abraham over 3 years ago

Nice! good to know, thanks!

#5 Updated by Szilárd Páll over 3 years ago

Good to know. Unfortunately, the machine I currently have access to has ancient SW stack, so I'll have to compile everything by hand. Daniel, would you be able to test a GROMACS TSAN build?

On a side-note: I'm not sure running with TSAN will still trigger the issue given that it will alter the load balancing during which the errors were triggered.

#6 Updated by Szilárd Páll almost 2 years ago

  • Affected version - extra info changed from 5.1.4 to 5.1.4, 2019-beta2

Just reproduced this again on Power8 with 2019 beta2:

GROMACS version:    2019-beta3-dev-20181108-d536de3
GIT SHA1 hash:      d536de3b5125b79d4222768e356c4914e0758d5a
Precision:          single
Memory model:       64 bit
MPI library:        thread_mpi
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:        disabled
SIMD instructions:  IBM_VSX
FFT library:        fftw-3.3.8
RDTSCP usage:       disabled
TNG support:        enabled
Hwloc support:      hwloc-1.11.8
Tracing support:    disabled
C compiler:         /home/pszilard/programs/gcc/7.3/bin/gcc GNU 7.3.0
C compiler flags:   -mcpu=power8 -mpower8-vector -mpower8-fusion -mdirect-move  -mvsx    -Werror=format-overflow -Wundef -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds
C++ compiler:       /home/pszilard/programs/gcc/7.3/bin/g++ GNU 7.3.0
C++ compiler flags: -mcpu=power8 -mpower8-vector -mpower8-fusion -mdirect-move  -mvsx    -std=c++11  -Wformat-overflow -Wundef -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-declarations -Wall  -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds

I'll try to get some sanitizer set up, but probably not in the coming days.

Also available in: Atom PDF