Incorrect forces with LJ potential-switch, GPU and PME tuning
we struggle to get a TI on our computer running. The specifications are
listed below. As you can see, its a two socket, two graphics cards
machine. Therefore, the plan is to run two simulations in parallel. But we
can't get a single one to run.
Running on 1 node with total 20 cores, 20 logical cores, 2 compatible GPUs Hardware detected: CPU info: Vendor: GenuineIntel Brand: Intel(R) Xeon(R) CPU E5-2640 v4 <at> 2.40GHz SIMD instructions most likely to fit this hardware: AVX2_256 SIMD instructions selected at GROMACS compile time: AVX2_256 GPU info: Number of GPUs detected: 2 #0: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC: no, stat: compatible #1: NVIDIA GeForce GTX 1080, compute cap.: 6.1, ECC: no, stat: compatible
The simulation system in question is a protein-ligand-complex in
TIP3P-water and amber ff99SB as force field.
Now lets get into the messy details. We tried different mdrun commandline
argument rotations, for example:
gmx mdrun -s md.tpr -pin on -ntomp 2 -ntmpi 5 -gpu_id 00000 -deffnm md (does not work)
gmx mdrun -s md.tpr -pin on -ntomp 5 -ntmpi 2 -gpu_id 00 -deffnm md
#1 Updated by Mark Abraham about 3 years ago
- Status changed from New to Accepted
- Affected version - extra info set to 2016-rc1
Thanks for the report. I have reproduced the tpr blowing up on tcbs23 (2 GPU, 32 cores) within a few hundred MD steps when run with mdrun defaults (5.1, 5.1.2 and 2016-rc1).
With mdrun -nb cpu it runs for > 1500 steps.
With mdrun -notunepme it runs for > 17000 steps even with GPUs, so we can probably exclude problems with CUDA 8 or latest-model GPUs. This probably gives Yannic a way to move forward, even though throughput will be inferior (because the auto-tuning is disabled), until we work out what is going wrong.
With mdrun -dlb no -tunepme it also crashes.
Yannic, can you please also upload a tarball of the inputs you used with grompp, so we can make a .tpr that runs with version 5, to see if the issue is present there also? (Or make one yourself if you prefer to do that, but the tarball will give us more extensive options.)
#4 Updated by Berk Hess about 3 years ago
- Status changed from Accepted to In Progress
- Assignee set to Berk Hess
The issue is also present in release 2016.
If I use -notunepme the issue does not occur.
I tried calculating and printing energies to see where the issue arises, but setting both to 1 or 2 steps makes the issue go away. This seems to indicate it is some kind of timing issue.
Adding -pforce 10000 shows large forces at step 162, just after the second PME grid setting:
step 80: timed with pme grid 108 108 108, coulomb cutoff 1.200: 11485.6 M-cycles
step 160: timed with pme grid 100 100 100, coulomb cutoff 1.251: 8075.2 M-cycles
The large forces don't seem to come from the free-energy kernels, so I guess the issue is on the GPU, but I'm not sure about that.
PS rlist=1.8 gives a buffer of 0.6 nm and an enormous increase in calculation cost. For production I would suggest to use a low verlet-buffer tolerance value instead.
#5 Updated by Berk Hess about 3 years ago
- Subject changed from Problems with TI on GPUs to Sudden incorrect forces with CUDA and PME tuning
- Priority changed from Normal to High
I have now reproduced this issue with v2016 without free-energy and with -ntmpi 1.
After the second PME grid change at step 160 things go wrong. At steps 160 and 161 the force rmsd CPU vs GPU is 4 (which indicates fully correct forces), at step 162 it is 300. So the strange thing is that it only seems to go wrong at the third step with the second grid change. It doesn't always go wrong, but if it does it always happens at step 162.
#6 Updated by Berk Hess about 3 years ago
- Subject changed from Sudden incorrect forces with CUDA and PME tuning to Incorrect forces with LJ potential-switch, GPU and PME tuning
In the GPU kernels, CUDA and OpenCL, the LJ cut-off masking is done before calculating the LJ forces. This means the forces will be incorrect when a twin-range cut-off is used, as happens with PME tuning.