Project

General

Profile

Bug #350

Intermittent bug with pme_order == 6

Added by Mark Abraham almost 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Created an attachment (id=392)
Had trouble uploading a file directly while away from my normal facilities, hopefully the URL works OK.

Hi,

I was comparing normal mdruns with freshly compiled unmodified 4.0.5 on 64 processors of the XE machine of bug #346 over a range of different reciprocal space parameters and found some intermittent problems with pme_order = 6. All my runs had the same inputs to grompp, modulo reciprocal space parameters. I had 20/20 successful runs with pme_order = 4 and average fourier_n[xyz] around 45, and success with pme_order 6 and fourier_n[xyz] = 30/32/32 and 6/32/32/32. 6/30/30/24 and 6/30/32/30 both failed. The above set of four of (npme == 6) all ran on the same processor set in the same queue job, so I am confident it is not a hardware issue. Later, I observed 6/30/30/24 with Totalview and saw either segfaults in pme_calc_pidx or fatal range searching errors, but these would occur unpredictably after several hundred time steps, i.e. were not at all reproducible. I experimented on 6/30/30/24 with constraining -npme 16 and -npme 19, and both resulted in failure (with no constraint, mdrun chose 16 PME processors). I didn't try mdrun -reprod. I tried 6/30/30/24 once on 8 processors and had a successful run. I haven't tried my input files on another machine.

Attached are some representative .tpr/.log/.mdp combinations.

Sample grompp line:
grompp_405 -f 1oei_timing_06_30_30_24_temp.mdp -e ../../../../../1oei_timing -c ../../../../../1oei_timing -t ../../../../../1oei_timing -p ../../../../../1oei_timing -o
1oei_timing_06_30_30_24 -po 1oei_timing_06_30_30_24_po -pp 1oei_timing_06_30_30_24_pp

Sample mdrun line:
mpirun mdrun_mpi_405 -npme 16 -dlb yes -deffnm 1oei_timing_06_30_30_24

(mdrun without "-dlb yes" also permitted crashes, dlb turned itself on)

Mark

(51 Bytes) Had trouble uploading a file directly while away from my normal facilities, hopefully the URL works OK. Mark Abraham, 10/02/2009 01:50 AM

History

#1 Updated by Berk Hess almost 10 years ago

I have not been able to reproduce this (yet) on my workstation
using 64 nodes.
Crashed depending on dlb can be difficult to track down.
Could you try to run with the env var GMX_DLB_FLOP=4 ?
This will load balance based on flop count and therefore
the runs will be reproducible (if you do not have fftw timings
turned on). But I might need to add some scaling to the flops
balancing. The real load imbalance is usually about twice as high
as the flop imbalance, due mainly to the fast water loops.

Berk

#2 Updated by Berk Hess almost 10 years ago

Another thing.
Looking at the code I get the impression that this problem
is probably not triggered by pme_order, but by the pme grid
per DD cell ratio.
Could you run with exactly the same settings (same pme grid, -npme 16),
but with pme_order=4 iso 6 and see what happens?

Thanks,

Berk

#3 Updated by Mark Abraham almost 10 years ago

For direct comparison of pme_order = 4 vs 6, I ran the following set:
rw-r--r- 1 mxa224 x04 31325 Oct 2 22:58 1oei_timing_04_30_30_24.log
rw-r--r- 1 mxa224 x04 123973 Oct 2 23:00 1oei_timing_04_30_32_30.log
rw-r--r- 1 mxa224 x04 123973 Oct 2 23:07 1oei_timing_04_30_32_32.log
rw-r--r- 1 mxa224 x04 123969 Oct 2 23:09 1oei_timing_04_32_32_32.log
rw-r--r- 1 mxa224 x04 124187 Oct 2 22:57 1oei_timing_04_45_45_45.log
rw-r--r- 1 mxa224 x04 22181 Oct 2 22:49 1oei_timing_06_30_30_24.log
rw-r--r- 1 mxa224 x04 13733 Oct 2 22:49 1oei_timing_06_30_32_30.log
rw-r--r- 1 mxa224 x04 124189 Oct 2 22:51 1oei_timing_06_30_32_32.log
rw-r--r- 1 mxa224 x04 124186 Oct 2 22:56 1oei_timing_06_32_32_32.log
rw-r--r- 1 mxa224 x04 124190 Oct 2 23:11 1oei_timing_06_45_45_45.log
There was a segfault in gmx_pme_do for all three of the "short" runs.

I used mpirun mdrun_mpi_405 -npme 16 -dlb yes -deffnm 1oei_timing_06_30_30_24 with no special environment variables

Next I used the same set with GMX_DLB_FLOP=4 and turned off optimize_fft. I got

rw-r--r- 1 mxa224 x04 65304 Oct 2 23:58 1oei_timing_06_30_30_24.log
rw-r--r- 1 mxa224 x04 15771 Oct 2 23:58 1oei_timing_06_30_32_30.log

crashing rapidly, and horribly slow running of 1oei_timing_06_30_32_32:

R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
Computing:         Nodes     Number     G-Cycles    Seconds     %
-----------------------------------------------------------------------
Domain decomp. 48 800 196.947 65.9 0.1
Send X to PME 48 8001 29.290 9.8 0.0
Comm. coord. 48 8001 1766.731 591.1 1.2
Neighbor search 48 801 150.552 50.4 0.1
Force 48 8001 1571.020 525.6 1.1
Wait + Comm. F 48 8001 3232.788 1081.6 2.3
PME mesh 16 8001 4131.099 1382.2 2.9
Wait + Comm. X/F 16 8001 31686.328 10601.5 22.1
Wait + Recv. PME F 48 8001 5938.833 1987.0 4.1
Write traj. 48 8001 2587.573 865.7 1.8
Update 48 8001 75.736 25.3 0.1
Constraints 48 8001 15353.668 5137.0 10.7
Comm. energies 48 8001 76379.267 25554.7 53.3
Rest 48 173.715 58.1 0.1
-----------------------------------------------------------------------
Total 64 143273.545 47936.0 100.0
-----------------------------------------------------------------------

I've just thought of a plausible explanation - this machine just had openmpi/1.3.3 installed and I switched to try it out in the last few days. Previously 1.2.8 was giving no apparent problems. I'll test that theory, and also on two other machines I have, but not before next week.

Thanks for the rapid attention - hopefully I've tumbled to the problem!

Mark

#4 Updated by Mark Abraham almost 10 years ago

I've tried on a different machine now, both using GMX_DLB_FLOP=4:

On the machine for which I previously reported problems, now using openmpi/1.3.3

rw-r--r- 1 mxa224 x04 24101 Oct 5 16:05 1oei_timing_04_30_30_24.log
rw-r--r- 1 mxa224 x04 44972 Oct 5 16:06 1oei_timing_04_30_32_30.log
rw-r--r- 1 mxa224 x04 124005 Oct 5 16:08 1oei_timing_04_30_32_32.log
rw-r--r- 1 mxa224 x04 124017 Oct 5 16:11 1oei_timing_04_32_32_32.log
rw-r--r- 1 mxa224 x04 124017 Oct 5 16:05 1oei_timing_04_45_45_45.log
rw-r--r- 1 mxa224 x04 93868 Oct 5 15:57 1oei_timing_06_30_30_24.log
rw-r--r- 1 mxa224 x04 18897 Oct 5 15:57 1oei_timing_06_30_32_30.log
rw-r--r- 1 mxa224 x04 124229 Oct 5 16:00 1oei_timing_06_30_32_32.log
rw-r--r- 1 mxa224 x04 124222 Oct 5 16:02 1oei_timing_06_32_32_32.log
rw-r--r- 1 mxa224 x04 124231 Oct 5 16:13 1oei_timing_06_45_45_45.log

while with openmpi/1.2.8

rw-r--r- 1 mxa224 x04 63873 Oct 5 16:26 1oei_timing_04_30_30_24.log
rw-r--r- 1 mxa224 x04 124020 Oct 5 16:28 1oei_timing_04_30_32_30.log
rw-r--r- 1 mxa224 x04 124005 Oct 5 16:30 1oei_timing_04_30_32_32.log
rw-r--r- 1 mxa224 x04 124010 Oct 5 16:33 1oei_timing_04_32_32_32.log
rw-r--r- 1 mxa224 x04 124019 Oct 5 16:24 1oei_timing_04_45_45_45.log
rw-r--r- 1 mxa224 x04 16290 Oct 5 16:17 1oei_timing_06_30_30_24.log
rw-r--r- 1 mxa224 x04 24760 Oct 5 16:17 1oei_timing_06_30_32_30.log
rw-r--r- 1 mxa224 x04 124224 Oct 5 16:20 1oei_timing_06_30_32_32.log
rw-r--r- 1 mxa224 x04 124227 Oct 5 16:22 1oei_timing_06_32_32_32.log
rw-r--r- 1 mxa224 x04 124229 Oct 5 16:36 1oei_timing_06_45_45_45.log

On a different brand-new Intel machine (http://nf.nci.org.au/facilities/vayu/hardware.php) with freshly compiled 4.0.5 (co-incidentally using openmpi/1.3.3):

rw-r--r- 1 mxa224 x04 37917 Oct 6 17:39 1oei_timing_04_30_30_24.log
rw-r--r- 1 mxa224 x04 37922 Oct 6 17:40 1oei_timing_04_30_32_30.log
rw-r--r- 1 mxa224 x04 82255 Oct 6 17:41 1oei_timing_04_30_32_32.log
rw-r--r- 1 mxa224 x04 124054 Oct 6 17:39 1oei_timing_04_45_45_45.log
rw-r--r- 1 mxa224 x04 11777 Oct 6 17:29 1oei_timing_06_30_30_24.log
rw-r--r- 1 mxa224 x04 124273 Oct 6 17:31 1oei_timing_06_30_32_30.log
rw-r--r- 1 mxa224 x04 124275 Oct 6 17:34 1oei_timing_06_30_32_32.log
rw-r--r- 1 mxa224 x04 124267 Oct 6 17:36 1oei_timing_06_32_32_32.log

So I've got a reproducible, but non-deterministic problem that seems to appear only for some reciprocal space parameters under some conditions. Any ideas? Will test on Bluegene/L shortly.

#5 Updated by Berk Hess almost 10 years ago

Unfortunately I have not been able to reproduce this on my workstation.
The lazy solution would be to try valgrind.
This will make the runs much slower, but if they crash quickly it might
tell us where the problem is.
For valgrind you should compile with -g -O0 and use:
mpirun -np ... valgrind mdrun ...

Berk

#6 Updated by Mark Abraham almost 10 years ago

I was compiling the above with intel-cc/10.1.018 on the first machine (xe) and intel-cc/11.1.046 on vayu.

I ran some vayu -O0 tests with fftpack overnight and they all ran correctly, so it looks like a compiler issue. I'll ramp up the optimization, try gcc, see what I find, and maybe try valgrind.

Thanks for the input!

#7 Updated by Mark Abraham almost 10 years ago

I ran lots of variations today, and found that if I added all the right MPI libraries to the link and refrained from using the machines' installed versions of FFTW 3.2 then things were fine. Things were also fine on BlueGene. I've asked the admins to tell me what (if anything) they've done differently between the installations of FFTW 3.1.2 and 3.2, since the former works fine so far.

I didn't break out valgrind at all.

At this stage there's no evidence of a GROMACS problem - thanks for the "works for me" response, that was useful :-)

Also available in: Atom PDF