Miscalculated LJ(SR) when running with GPU?
Collaborators in Mobley Lab found an issue where there appears to be miscalculation of the LJ with GPU. Not sure if in the most recent current code (a bit harder for me to test on GPU), they reported similar issues were found in 2019-beta
I've attached the input files for both GPU and CPU; as you can see by looking at the mdout.mdp they are processed the same.
At the initial time step, if you look at the energy.xvg files, all of the entries are roughly the same (presumably what one would expect from single precision machine precision) . . . except for LJ.
I'm not an expert at the GPU code, so I did not try to investigate.
Entry CPU GPU
legend length 2s0 legend "Bond" 511.556519 511.556427
s1 legend "Harmonic Pot." 0.224793 0.224793s2 legend "Angle" 1768.662231 1768.662842
s3 legend "Proper Dih." 9718.273438 9718.266602s4 legend "Improper Dih." 0.405944 0.405945
s5 legend "Improper Dih." 75.552689 75.552696s6 legend "LJ-14" 2799.869141 2799.871338
s7 legend "Coulomb-14" 39090.589844 39090.554688s8 legend "LJ (SR)" 99445.546875 197122.046875 <--- ????
s9 legend "Disper. corr." -3431.618896 -3431.618896s10 legend "Coulomb (SR)" -901030.000000 -901163.062500
@ s11 legend "Coul. recip." 1618.536377 1618.538086
Notes from the student:
The GPUs are Nvidia TitanX GPUs.
We have a Gromacs 2018-3 version and a 2019-beta version compiled for that partition.
The previous test I ran with 2018-3, I tried earlier also 2019-beta but if I remember correctly it gave me the same errors/issues.
I didn’t compile them, one of the students who did, sent me these instructions he used (for 2019-beta)
cmake3 .. -DGMX_GPU=on -DGMX_SIMD=AVX2_256 \
-DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ \
Fix incorrect rvdw on GPU with rvdw<rcoulomb
When rvdw < rcoulomb was set in the mdp file, rvdw would initially
be set to rcoulomb on the GPU. With default mdrun settings,
the correct rvdw would be set after 2*nstlist steps by PME tuning.
TODO: Add an mdrun test case with rvdw<rcoulomb, refs #3062
- Status changed from New to Fix uploaded
- Assignee set to Berk Hess
- Priority changed from Normal to High
Initially rvdw is set to rcoulomb for the GPU. With MD this gets fixed at the first PME tuning after 2*nstlist steps.
I uploaded a fix to release-2019. We should backport to release-2018.