Project

General

Profile

Bug #2360

Updated by Aleksei Iupinov over 2 years ago

Counter resetting fails with when a separate PME rank. rank uses GPU offload. It seems that the internal state of the PME load balancer is not updated correctly and the balancing never reaches a stopped state.

<pre>
mdrun -ntmpi 4 -ntomp 8 -noconfout -pin on -npme 1 -nsteps 7000 -resetstep 5000 -nb gpu -pme gpu -v
[...]
Using 4 MPI threads
Using 8 OpenMP threads per tMPI thread

On host threadripper-gpu01 1 GPU auto-selected for this run.
Mapping of GPU IDs to the 4 GPU tasks in the 4 ranks on this node:
PP:0,PP:0,PP:0,PME:0

NOTE: DLB will not turn on during the first phase of PME tuning
starting mdrun 'Water'
7000 steps, 14.0 ps.
step 320: timed with pme grid 112 112 112, coulomb cutoff 0.900: 711.5 M-cycles
step 480: timed with pme grid 100 100 100, coulomb cutoff 0.997: 748.6 M-cycles
step 640: timed with pme grid 84 84 84, coulomb cutoff 1.187: 853.4 M-cycles
step 800: timed with pme grid 96 96 96, coulomb cutoff 1.038: 740.4 M-cycles
step 960: timed with pme grid 100 100 100, coulomb cutoff 0.997: 728.9 M-cycles
step 1120: timed with pme grid 104 104 104, coulomb cutoff 0.958: 805.1 M-cycles
step 1280: timed with pme grid 108 108 108, coulomb cutoff 0.923: 710.2 M-cycles
step 1440: timed with pme grid 112 112 112, coulomb cutoff 0.900: 701.8 M-cycles
step 1600: timed with pme grid 96 96 96, coulomb cutoff 1.038: 723.0 M-cycles
step 1760: timed with pme grid 100 100 100, coulomb cutoff 0.997: 775.2 M-cycles
step 1920: timed with pme grid 104 104 104, coulomb cutoff 0.958: 818.0 M-cycles
step 2080: timed with pme grid 108 108 108, coulomb cutoff 0.923: 704.7 M-cycles
step 2240: timed with pme grid 112 112 112, coulomb cutoff 0.900: 711.5 M-cycles
step 2400: timed with pme grid 96 96 96, coulomb cutoff 1.038: 722.0 M-cycles
step 2560: timed with pme grid 100 100 100, coulomb cutoff 0.997: 725.4 M-cycles
step 2720: timed with pme grid 108 108 108, coulomb cutoff 0.923: 713.2 M-cycles
step 2880: timed with pme grid 112 112 112, coulomb cutoff 0.900: 695.5 M-cycles
optimal pme grid 112 112 112, coulomb cutoff 0.900

NOTE: DLB can now turn on, when beneficial
step 4900, remaining wall clock time: 9 s imb F 0% pme/F 1.14
-------------------------------------------------------
Program: gmx mdrun, version 2018-beta2-dev-20171211-2e91fcf-dirty
Source file: src/programs/mdrun/md.cpp (line 1930)
MPI rank: 2 (out of 4)

Fatal error:
PME tuning was still active when attempting to reset mdrun counters at step
5000. Try resetting counters later in the run, e.g. with gmx mdrun -resetstep.

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
</pre>

Back