Project

General

Profile

Bug #257

-debug 2 segfaulted in p_graph

Added by Mark Abraham about 11 years ago. Updated almost 11 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

While testing some other stuff on BlueGene, I had reason to want to use -debug 2 to try to dump forces to the debug log files. -debug 1 had worked fine, but -debug 2 segfaulted inside p_graph. Sorry I can't be more specific than that. Commenting out the call to p_graph around lines 311-312 of src/gmxlib/mshift.c allowed mdrun to continue with no further problems.

1oei_00001.tpr (1.84 MB) 1oei_00001.tpr tpr that reproduces bug with mdrun -debug 2 Mark Abraham, 11/14/2008 02:59 AM

History

#1 Updated by Berk Hess about 11 years ago

I can not reproduce this bug.
Can you attach a tpr file and provide a command line?

Berk

#2 Updated by Mark Abraham about 11 years ago

Created an attachment (id=327)
tpr that reproduces bug with mdrun -debug 2

It's a 312-atom CHARMM peptide in about 50K atoms of CHARMM TIP3P water. The bug arises in system setup, so system size will not make debugging expensive. I got the bug using either 2 or 32 processors, but not with 1. The attached .tpr is for 2 processors.

#3 Updated by Mark Abraham about 11 years ago

The (text) core file stack trace is

p_graph
../../../src/gmxlib/mshift.c:156
mk_graph_ilist
../../../src/gmxlib/mshift.c:312
dd_bonded_cg_distance
../../../src/mdlib/domdec_top.c:1907
init_domain_decomposition
../../../src/mdlib/domdec.c:5859
mdrunner
../../../src/kernel/md.c:245
main
../../../src/kernel/mdrun.c:488
_start_blrts
../sysdeps/blrts/start.c:107

so the statement causing the segfault is on line 156 of mshift.c

Command line was

mpirun -label -exp_env NOASSEMBLYLOOPS -exp_env LOG_BUFS -cwd /hpc/bluefern/mja163/electrostatics/zahn/trioctapeptide/gromacs/large/low_temperature/simulation_c_4.0.1_debug_debug -connect TORUS -mode VN -np 2 /hpc/home/mja163/progs/bin/mdrun_mpi_c_fftw3_cvs4 -np 2 -deffnm 1oei_00001 -g 1oei_00001_ -debug 2

It's an up-to-date CVS GROMACS4 version.

#4 Updated by Berk Hess about 11 years ago

I can not reproduce this crash on x86_64 with gcc.
I have ran valgrind and it only complains about
accessing uninitialized memory on that line,
which should not lead to a crash.
I have looked through the code and I think that
even the uninitialized memory warnings are a bug
in valgrind, which often does not detect that memory
has been initialized by memset.
My p_graph mdrun0.log output also looks fine.

Could you try to locate the problem yourself,
simply by taking out agruments from the fprintf
until it does not crash?

Berk

#5 Updated by Mark Abraham almost 11 years ago

OK, I've isolated the symptom. If I edit p_debug in mshift.c to include

fprintf(log,"graph:  %s\n",title);
fprintf(log,"nnodes: %d\n",g->nnodes);
fprintf(log,"nbound: %d\n",g->nbound);
fprintf(log,"start: %d\n",g->start);
fprintf(log,"end: %d\n",g->end);
fprintf(log," atom shiftx shifty shiftz C nedg e1 e2 etc.\n");
for(i=0; (i<g->nnodes); i++)
if (g->nedge[i] > 0) {
fprintf(log,"%5d%7d%7d%7d %7d%7d_%1s%5d",
g->start+i+1,
g->ishift[i][XX],g->ishift[i][YY],
g->ishift[i][ZZ],
g->negc,
(g->negc > 0) ? g->egc[i]: 11111,
// (g->negc > 0) ? cc[g->egc[i]] : " ",
" ",
g->nedge[i]);
for(j=0; (j<g->nedge[i]); j++)
fprintf(log," %5u",g->edge[i][j]+1);
fprintf(log,"\n");
}
fflush(log);

So when g->negc is non-zero, it should contain the length of the g->egc array, which should have indexes in it.

I get as output in mdrun0.log:

graph: graph
nnodes: 312
nbound: 312
start: 0
end: 311
atom shiftx shifty shiftz C nedg e1 e2 etc.
1 0 0 0 268393520-2147418076_ 4 2 5 4 3
2 0 0 0 268393520-2084503528_ 1 1
3 0 0 0 268393520-2082406372_ 1 1
4 0 0 0 268393520_2080900006 1 1
5 0 0 0 268393520_941686816 3 7 6 1
6 0 0 0 268393520_1317011488 1 5
7 0 0 0 268393520-2147418076_ 3 9 8 5
8 0 0 0 268393520-2082406372_ 1 7
9 0 0 0 268393520_2080900006 4 22 11 10 7
10 0 0 0 268393520-2084503528_ 1 9
11 0 0 0 268393520_941686816 4 16 13 12 9
12 0 0 0 268393520_1275065336 1 11
13 0 0 0 268393520_2080899750 1 11
14 0 0 0 268393520-1809711200_ 3 17 16 15
15 0 0 0 268393520_2105540646 1 14
16 0 0 0 268393520-1820262320_ 3 20 14 11
17 0 0 0 268393520-1847525348_ 3 19 18 14
18 0 0 0 268393520_2088508280 1 17
19 0 0 0 268393520-1878982556_ 2 20 17
20 0 0 0 268393520_939589599 3 21 19 16
21 0 0 0 268393520_2139357248 1 20
22 0 0 0 268393520-1845428192_ 3 24 23 9
23 0 0 0 268393520-1843331036_ 1 22
24 0 0 0 268393520-1841233880_ 3 26 25 22
25 0 0 0 268393520-1839136724_ 1 24
26 0 0 0 268393520-1837039568_ 4 29 28 27 24
27 0 0 0 268393520-1834942412_ 1 26
(etc. for the full 312 lines of graph)

Clearly g->negc and its array are not being properly initialized.

I'm afraid I haven't a clue about the graph code, so I'm hesitant to spend more time on it at this stage. However if the above doesn't help you, I'll take a closer look.

#6 Updated by Berk Hess almost 11 years ago

Ah, stupid me, I had looked at the code several times,
but did not notice that negc was initialized just
after iso before the p_graph call.
I committed a fix.
Could you check that it works properly now?

Berk

#7 Updated by Berk Hess almost 11 years ago

I assume my change fixed this problem.

Berk

#8 Updated by Mark Abraham almost 11 years ago

Yes, seems to work now.

Also available in: Atom PDF