Project

General

Profile

Bug #3056

Miscalculated LJ(SR) when running with GPU?

Added by Michael Shirts about 1 month ago. Updated 19 days ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Reported similar behavior with 2019-beta1
Affected version:
Difficulty:
uncategorized
Close

Description

Collaborators in Mobley Lab found an issue where there appears to be miscalculation of the LJ with GPU. Not sure if in the most recent current code (a bit harder for me to test on GPU), they reported similar issues were found in 2019-beta

I've attached the input files for both GPU and CPU; as you can see by looking at the mdout.mdp they are processed the same.

At the initial time step, if you look at the energy.xvg files, all of the entries are roughly the same (presumably what one would expect from single precision machine precision) . . . except for LJ.

I'm not an expert at the GPU code, so I did not try to investigate.

Entry                                           CPU                  GPU                 
legend length 2
s0 legend "Bond" 511.556519 511.556427
s1 legend "Harmonic Pot." 0.224793 0.224793
s2 legend "Angle" 1768.662231 1768.662842
s3 legend "Proper Dih." 9718.273438 9718.266602
s4 legend "Improper Dih." 0.405944 0.405945
s5 legend "Improper Dih." 75.552689 75.552696
s6 legend "LJ-14" 2799.869141 2799.871338
s7 legend "Coulomb-14" 39090.589844 39090.554688
s8 legend "LJ (SR)" 99445.546875 197122.046875 <--- ????
s9 legend "Disper. corr." -3431.618896 -3431.618896
s10 legend "Coulomb (SR)" -901030.000000 -901163.062500
@ s11 legend "Coul. recip." 1618.536377 1618.538086

Notes from the student:

The GPUs are Nvidia TitanX GPUs.
We have a Gromacs 2018-3 version and a 2019-beta version compiled for that partition.
The previous test I ran with 2018-3, I tried earlier also 2019-beta but if I remember correctly it gave me the same errors/issues.
I didn’t compile them, one of the students who did, sent me these instructions he used (for 2019-beta)

cmake3 .. -DGMX_GPU=on -DGMX_SIMD=AVX2_256 \
-DCMAKE_INSTALL_PREFIX=$TARGET \
-DGMX_BUILD_OWN_FFTW=ON \
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.2 \
-DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ \
-DREGRESSIONTEST_DOWNLOAD=OFF

inputs.zip (5.29 MB) inputs.zip Inputs for both CPU and GPU Michael Shirts, 08/11/2019 06:37 AM
outputs.zip (8.83 MB) outputs.zip Outputs for both CPU and GPU Michael Shirts, 08/11/2019 06:37 AM

Associated revisions

Revision a5409af7 (diff)
Added by Berk Hess 21 days ago

Fix incorrect rvdw on GPU with rvdw<rcoulomb

When rvdw < rcoulomb was set in the mdp file, rvdw would initially
be set to rcoulomb on the GPU. With default mdrun settings,
the correct rvdw would be set after 2*nstlist steps by PME tuning.

TODO: Add an mdrun test case with rvdw<rcoulomb, refs #3062

Fixes #3056

Change-Id: I7243f27e75e46adedd668822dcd6b9045ef98a3f

History

#1 Updated by Mark Abraham about 1 month ago

  • Affected version changed from 2018.4 to 2018.3

Thanks for the report, the attached log files report 2018.3, so I'm changing the affected version

#2 Updated by Michael Shirts about 1 month ago

Mark Abraham wrote:

Thanks for the report, the attached log files report 2018.3, so I'm changing the affected version

Ah, thanks, I didn't scroll down far enough with my UI - I didn't see anything earlier than 2018.4.

#3 Updated by Mark Abraham about 1 month ago

I replicated the difference in LJ-(SR) between GPU and CPU with this input (approximately double the magnitude...) with 2016.6, 2018.7 and 2019.3. I'm not sure yet what the issue is - hopefully just a glitch with sd not being tested much.

#4 Updated by Vytautas Gapsys about 1 month ago

The difference between CPU and GPU LJ-SR disappears when rcoulomb and rvdw are set to the same value (tested with 2018.7)

#5 Updated by Berk Hess about 1 month ago

  • Status changed from New to Fix uploaded
  • Assignee set to Berk Hess
  • Priority changed from Normal to High

Initially rvdw is set to rcoulomb for the GPU. With MD this gets fixed at the first PME tuning after 2*nstlist steps.

I uploaded a fix to release-2019. We should backport to release-2018.

#6 Updated by Berk Hess 21 days ago

  • Status changed from Fix uploaded to Resolved

#7 Updated by Paul Bauer 19 days ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF