## Bug #551

### npme defaults on large processor sets

**Description**

Created an attachment (id=535)

.log files from simulations under discussion

Hi,

I was running 100K steps of MD on a 190K atom AMBER03 protein-complex system on 4.5.1 recently, and noticed a disturbing timing result on 256 CPUs with npme=-1. Here follows the output of

grep -h 'Mnbf\|Perfor\|^Initializing D\|^PME dom\|^Domain decomposition grid' sim_256/*log

Initializing Domain Decomposition on 256 nodes

Domain decomposition grid 6 x 5 x 5, separate PME nodes 106

PME domain decomposition: 106 x 1 x 1

(Mnbf/s) (GFlops) (ns/day) (hour/ns)

Performance: 2435.287 237.794 10.917 2.198

Initializing Domain Decomposition on 256 nodes

Domain decomposition grid 8 x 6 x 3, separate PME nodes 112

PME domain decomposition: 8 x 14 x 1

(Mnbf/s) (GFlops) (ns/day) (hour/ns)

Performance: 4953.415 483.932 22.199 1.081

Initializing Domain Decomposition on 256 nodes

Domain decomposition grid 8 x 4 x 5, separate PME nodes 96

PME domain decomposition: 8 x 12 x 1

(Mnbf/s) (GFlops) (ns/day) (hour/ns)

Performance: 4025.569 393.135 18.032 1.331

The first chunk is from a .log file which npme=-1 (the default), and the others are with npme equal to some nearby multiples of 8 (since this machine has 8 cores per Infiniband connection). No other parameters changed between runs.

The performance of the default simulation was around half of what could be achieved with a more plausible guess. It seems to me that 106 is far too prime to be right on any practical system, despite giving a nicely balanced DD. Any or all of npme/pp ratios of 112/144, 108/148, 104/152, 100/156 and 96/160 seem likely to be better on real-world machines - and the two extremal ratios were much better. Note that the same went for a 128-processor simulation. npme=-1 chose 53 PME processors (prime!); here, the performance relative to 48 or 56 was not so bad (11 ns/day vs 13-14).

I realise that tools like g_tune_pme are supposed to help with these choices, but I feel that the default shouldn't be "wrong" by a factor of two... Is the algorithm that chooses npme giving too much weight to the PME load estimate, and not enough to ensuring large common factors for the leading decomposition dimensions? Does the algorithm need to vary with ncpus?

Might it be worthwhile adding a run-time (or compile-time?) option that will bias npme=-1 into choosing a multiple of the given number? Obviously, it could only take effect if ncpus is also such a multiple, and shouldn't stray too far from the sound PME load split at low cpus. There are certainly machine types where such a bias is clearly correct, but at the moment we require the user to know about it. Particularly at large supercomputing facilities, allowing the admin in charge of installation to guide the GROMACS default choice might make for much better throughput in the hands of a new user.

[I know that at ANU's supercomputer facility, the guy in charge of MD programs did some quick comparisons on large systems back in GROMACS 4.0 days and concluded that GROMACS scaling was hopeless on large problems. The 5DFFT should improve things outright, but I can understand his conclusion if based on such flawed default npme choices!]

By the way, the overall scaling was fairly pleasing - after merely guessing some sensible npme, I observed:

Initializing Domain Decomposition on 16 nodes

Domain decomposition grid 3 x 3 x 1, separate PME nodes 7

PME domain decomposition: 7 x 1 x 1

(Mnbf/s) (GFlops) (ns/day) (hour/ns)

Performance: 483.252 47.405 2.176 11.030

Initializing Domain Decomposition on 32 nodes

Domain decomposition grid 3 x 3 x 2, separate PME nodes 14

PME domain decomposition: 14 x 1 x 1

(Mnbf/s) (GFlops) (ns/day) (hour/ns)

Performance: 932.363 91.461 4.196 5.719

Initializing Domain Decomposition on 64 nodes

Domain decomposition grid 5 x 4 x 2, separate PME nodes 24

PME domain decomposition: 24 x 1 x 1

(Mnbf/s) (GFlops) (ns/day) (hour/ns)

Performance: 1773.699 173.881 7.984 3.006

Initializing Domain Decomposition on 96 nodes

Domain decomposition grid 7 x 4 x 2, separate PME nodes 40

PME domain decomposition: 40 x 1 x 1

(Mnbf/s) (GFlops) (ns/day) (hour/ns)

Performance: 2615.679 256.510 11.776 2.038

Initializing Domain Decomposition on 128 nodes

Domain decomposition grid 8 x 2 x 5, separate PME nodes 48

PME domain decomposition: 8 x 6 x 1

(Mnbf/s) (GFlops) (ns/day) (hour/ns)

Performance: 3112.842 305.174 14.012 1.713

Initializing Domain Decomposition on 256 nodes

Domain decomposition grid 8 x 6 x 3, separate PME nodes 112

PME domain decomposition: 8 x 14 x 1

(Mnbf/s) (GFlops) (ns/day) (hour/ns)

Performance: 4953.415 483.932 22.199 1.081

Doubtless, I could do better. If they're of interest, the .log files for these and similar runs are in the attached tarball.

### History

#### #1 Updated by Mark Abraham over 9 years ago

Rephrasing, the following test in domdec.c is far too likely to fail at high ncpus under the current default method for choosing npme, because comm->npmenodes is sometimes too prime w.r.t. dd->nc[] elements.

`if (dd->ndim >= 2 && dd->dim[0] XX && dd->dim[1] YY &&`

comm->npmenodes > dd->nc[XX] && comm->npmenodes % dd->nc[XX] 0 &&

getenv("GMX_PMEONEDD") NULL)
{

comm->npmedecompdim = 2;

comm->npmenodes_x = dd->nc[XX];

comm->npmenodes_y = comm->npmenodes/comm->npmenodes_x;

}

Either

A) optimize_ncells needs to consider a few values for npme, rather than simply the first one that satisified fits_pp_pme_perf in guess_npme, or

B) guess_npme needs to first look for npme values that divide ncpus,

then for values of npme that have a usefully-large common factor with ncpus, and

only **then** any value.

With B), in my case, I suspect that 8 is the the largest such useful divisor. One could factorize ncpus, and try factors downwards from ncpus/3 or something.

Note that in my case, ncpus/3 = 85 and that this is too restrictive to find the optimum when my system has a 0.40 PME load. I could refactor the load, of course. However, it looks to me like we're hard-coding the optimizer to work when the load is in the desirable 0.25-0.33 range, and to fail silently above that. grompp reported a 0.40 load, but didn't write a note about it - perhaps it should. Perhaps the maximum for these scans for npme values should be min(nnodes/2, nnodes * ratio * 1.1) or so.

I think best in my case would be if the optimizer would start from (nnodes+15)/16=16, and try all factors of 256 up to 256/2-1 (i.e. 16, 32, 64), then start again and try all new multiples of 64 (i.e. none), then all new multiples of 32 (32, 96), then all new multiples of 16 (16, 48, 80, 112). fits_pp_pme_perf should probably be happy with 112 in my case. If not, go again with new multiples of 8, then 4, then 2, then (as in the current code) try multiples of 1 :-)

This way, when a user chooses nnodes to be a multiple of the magic machine number (i.e. usually the number of cores per network connection), npme values that are multiples of that number are always considered ahead of npme values that are not such multiples.

#### #2 Updated by Berk Hess over 9 years ago

This looks very strange indeed.

106 is a nasty number and not a multiple of 106.

This should not have happened.

Could you attach a tpr file?

Thanks,

Berk

#### #3 Updated by Mark Abraham over 9 years ago

Created an attachment (id=536)

.tpr for these sims

I think the nastiness comes from the fact that my 0.40 PME load is higher than 1/3, per my comment #1.

.tpr attached.

#### #4 Updated by Berk Hess over 9 years ago

The setup does try to use npme with a large factor that can be common with the pp division, but obviously something goes wrong here. I'll look into it this morning.

Berk

#### #5 Updated by Berk Hess over 9 years ago

I committed a fix.

I think this should solve the problem for all number of nodes.

But if you want you can have a look at the code, you can do that before

we close the bug.

I have also added a check for the total number of nodes being a prime.

Thanks for reporting this serious performance issue.

(Somehow I never noticed this before, probably because the pme load is below 0.33 in most cases)

Berk

#### #6 Updated by Mark Abraham over 9 years ago

Code looks good, and works in my case.

It seems to me that in practice the 2D-PME requirement for the leading DD and PME grid dimensions to be equal will lead to good utilization of machine-specific qualities, like ratio of cores to connections. Users will be running jobs with nnodes that are multiples of that number, and things should work OK.

Resolved bug as FIXED