Project

General

Profile

Bug #1607

Bug in DD code with small number of DD cells

Added by Alexey Shvetsov almost 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Hi all!

Seems like there is a bug in dd code.
Its reproducable with small number of DD cells

Its with 4 nodes each running 8 omp threads. However this bug reproducable with one node and GPU

tpr file for one of system that has this bug attaches

GROMACS: mdrun_mpi, VERSION 5.1-dev-20140922-20c00a9-dirty-unknown
Executable: /s/ls2/home/users/alexxy/prefix/usr/bin/mdrun_mpi
Library dir: /s/ls2/home/users/alexxy/prefix/usr/share/gromacs/top
Command line:
mdrun_mpi -deffnm psa_pep_ctrl.md_npt -maxh 168 -cpi -append

Reading file psa_pep_ctrl.md_npt.tpr, VERSION 5.1-dev-20140922-20c00a9-dirty-unknown (single precision)
The number of OpenMP threads was set by environment variable OMP_NUM_THREADS to 8

Reading checkpoint file psa_pep_ctrl.md_npt.cpt generated: Sat Sep 27 23:48:46 2014

Using 4 MPI processes
Using 8 OpenMP threads per MPI process

WARNING: This run will generate roughly 2924 Mb of data

starting mdrun '2ZCH_3 in water'
50000000 steps, 100000.0 ps (continuing from step 13459500, 26919.0 ps).

NOTE: Turning on dynamic load balancing

Step 13472500:
The charge group starting at atom 5999 moved more than the distance allowed by the domain decomposition (0.834000) in direction X
distance out of cell -0.907460
Old coordinates: 1.316 4.995 1.009
New coordinates: 1.316 4.995 1.009
Old cell boundaries in direction X: 1.488 2.962
New cell boundaries in direction X: 1.510 2.997

-------------------------------------------------------
Program mdrun_mpi, VERSION 5.1-dev-20140922-20c00a9-dirty-unknown
Source code file: /var/tmp/alexxy/portage/sci-chemistry/gromacs-9999/work/gromacs-9999/src/gromacs/mdlib/domdec.cpp, line: 4388

Fatal error:
A charge group moved too far between two domain decomposition steps
This usually means that your system is not well equilibrated
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Halting parallel program mdrun_mpi on rank 1 out of 4
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

psa_pep_ctrl.md_npt.tpr (416 KB) psa_pep_ctrl.md_npt.tpr Alexey Shvetsov, 09/28/2014 10:07 PM

Associated revisions

Revision 971191d8 (diff)
Added by Berk Hess almost 3 years ago

Domain decomposition now checks the rlist buffer

When a large pair-list buffer, which will appear with large nstlist,
atoms are allowed to displace the buffer size, i.e. a lot, in nstlist
steps. The limit this puts on the DD cell size is now checked.
Also updated cg_move_error, which now no longer prints the old atom
coordinates with the Verlet scheme, where the "old" coordinates are
actually the new ones.
Fixes #1607.

Change-Id: I784afa5ee620b51f555f4d1107f38cbbae2c55d1

Revision b9e2e841 (diff)
Added by Berk Hess almost 3 years ago

Fix DD bonded interaction range print

A recent fix caused the DD setup printing of the maximum bonded
distance to instead print the max of the bonded and the list buffer
to the log file. This was a printing issue only.
Refs #1607.

Change-Id: I685e2e5e07f2f1a0a39c5eef4264a77ddfcecb31

History

#1 Updated by Berk Hess almost 3 years ago

  • Status changed from New to In Progress

This is not a bug, you have nstlist set to 500, which causes this. In 500 steps atoms can move a lot.
It is difficult to add checks for (too) large nstlist values, so we don't warn the user about this.

#2 Updated by Berk Hess almost 3 years ago

I will remove the old (which are actually the new) coordinates from the DD error message.

#3 Updated by Alexey Shvetsov almost 3 years ago

Ahh, thats my fault. Am I right that large nstlist value should work just fine for large systems with large cells?

#4 Updated by Gerrit Code Review Bot almost 3 years ago

Gerrit received a related patchset '1' for Issue #1607.
Uploader: Berk Hess ()
Change-Id: I784afa5ee620b51f555f4d1107f38cbbae2c55d1
Gerrit URL: https://gerrit.gromacs.org/4107

#5 Updated by Berk Hess almost 3 years ago

  • Status changed from In Progress to Fix uploaded

I now actually "fixed" the issue, see: https://gerrit.gromacs.org/#/c/4107/
The limit due to the list buffer means that you can no longer run your current tpr file with domain decomposition. You need to decrease nstlist to decrease tlist to make it run.
Do you really get best performance with nstlist=500, which makes rlist twice the cut-off and increases the non-bonded cost by a factor of 2?

With large cells the issue is less likely to occur. But if you run a large systems massively parallel such that the cells get small, the same issue can occur.

#6 Updated by Mark Abraham almost 3 years ago

@Alexey We're going to making a 5.0.2 release ASAP to fix the nasty #1603, so if you're able to check that Berk's fix works for you, that'd be great.

#7 Updated by Alexey Shvetsov almost 3 years ago

I'll check it today if it will not fail with same tpr

#8 Updated by Alexey Shvetsov almost 3 years ago

@Berk what vallue for nstlist do you recomend?

#9 Updated by Berk Hess almost 3 years ago

Whatever turns out to be the fastest.
With a GPU usually 20 or 40 is optimal (depending on the system and hardware). Try both and see what performance you get.
mdrun usually automatically increases nstlist to 20 or 40 based on the input system, but not taken into account the hardaware, so this doesn't always produce the optimal nstlist.

#10 Updated by Alexey Shvetsov almost 3 years ago

What about regular nodes? So its good way to tune performance.

#11 Updated by Mark Abraham almost 3 years ago

  • Status changed from Fix uploaded to Closed
  • Target version deleted (5.1)

With only CPU, the payoff for larger nstlist is less clear, because searching is always on the CPU, and longer nstlist requires larger buffer, and thus more cycles computing zeros. Zeros on the GPU are cheaper because the GPU is anyway idle during search. I don't recall the default behavior, though.

#12 Updated by Mark Abraham almost 3 years ago

  • Target version set to 5.0.2

#13 Updated by Gerrit Code Review Bot almost 3 years ago

Gerrit received a related patchset '1' for Issue #1607.
Uploader: Berk Hess ()
Change-Id: I685e2e5e07f2f1a0a39c5eef4264a77ddfcecb31
Gerrit URL: https://gerrit.gromacs.org/4159

Also available in: Atom PDF