ver 4.6.5 for GPU gives incorrect results
we have a serious problem with GROMACS 4.6.5 for GPU, which produces incorrect output. Comparing to the CPU version, calculated energies are totally different (see attached files from the official GROMACS gpu benchmark), what basically results in a fast collapse to a random coil structure. The system we are running on is 2 × 8-core 2.5 GHz Intel Xeon E5-2450 + 2 × NVIDIA Tesla K40m.
Could you help us with that?
Computational Chemistry Group
#1 Updated by Justin Lemkul almost 3 years ago
The CPU run reports version 4.6.4, while the GPU version reports 4.6.5; at least do a comparison with a consistent version. Many bugs get fixed between minor releases. Better yet, try again with 4.6.6 or 5.0; it is much more effective to troubleshoot the current release than it is an old version whose issues may have already been resolved.
#2 Updated by Marcin Nowosielski almost 3 years ago
In addition to 4.6.5 we have tested 4.5.7 and the latest 5.0. All of them give the wrong result.
Focusing on the 5.0, there are two major issues:
1) Incorrect Polarization and Nonpolar Sol. energies (here I mean orders of magnitude)
2) Presence of pressure (either positive or negative) with pcoupl = No
To be honest this is at least disturbing... .
Don't get me wrong, I appreciate your work a lot, and use GROMACS all the time.
Nevertheless, how can it be that all three versions are giving such huge errors (i.e. they do not work)?
Are we so unlucky with the system configuration?
#3 Updated by Justin Lemkul almost 3 years ago
- Status changed from New to Rejected
Well, to be fair, we advertise that implicit + GPU does not work:
The previous use of implicit solvent on GPU was totally reliant upon the OpenMM interface, which is no longer supported. It would be nice if the implicit code could be made more robust and work with GPU, but at present there are no developers with time to do it. There have been a number of discussions about this, but time is limited and implicit + GPU is a very low priority. We should probably just issue a fatal error in mdrun if anyone tries to do this.
#6 Updated by Thomas Geenen almost 3 years ago
The gromacs runs that gave the wrong results where performed on our system.
from the release notes of the 4.6 version I understand that the 4.5 version using openmm should be able to compute the correct results.
Marcin ran a model also with this version of gromacs and found that these results are also incorrect.
we build the 4.5 version of gromacs with these settings
cmake -DCMAKE_INSTALL_PREFIX=$installdir \
-DGMX_OPENMM=ON -DGMX_THREADS=OFF \
-DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/hpc/sw/cuda/5.5/ \
../$dir | tee ../log.cmake
should we expect to get the correct results from such a build with implicit calculations like the GPU benchmark examples impl 1nm and 2nm?
#7 Updated by Justin Lemkul almost 3 years ago
The benchmarks were created using very old versions of everything, CUDA 3.0 and OpenMM 2.0, IIRC. Any results you get will almost certainly be sensitive to changes. The combination of Gromacs 4.5 + OpenMM 2.0 had lots of bugs and missing features (on both the Gromacs and OpenMM sides), so I would expect that only the same combination of software versions might generate the same exact results. How "wrong" are your results from 4.5? Side-by-side comparisons are needed here.
#8 Updated by Marcin Nowosielski almost 3 years ago
Together with Thomas we have tried a whole bunch of things.
The setup has been taken from the official benchmark: dhfr-impl-2nm.bench (with amber96ff)
4.5 on CPU :
Step Time Lambda 0 0.00000 0.00000 Energies (kJ/mol) -7.27832e+04 Angle Proper Dih. Improper Dih.GB Polarization Nonpolar Sol. 1.74631e+03 3.15965e+03 4.17787e+01 -1.27414e+04 1.48139e+02 LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Potential 2.46211e+03 2.90393e+04 -5.17308e+03 -4.22534e+04 -2.35706e+04 Kinetic En. Total Energy Temperature Pressure (bar) Constr. rmsd 8.98330e+01 -2.34807e+04 4.37601e+00 0.00000e+00 8.09423e-06
4.5 for GPU:
Energies (kJ/mol) Potential Kinetic En. Total Energy Temperature Constr. rmsd -1.86574e+04 6.36140e+03 -1.22960e+04 3.09693e+02 2.10775e-06
Error (potential) ca. 20%
4.6 for GPU
Angle Proper Dih. Improper Dih.GB Polarization Nonpolar Sol. 1.74632e+03 3.15965e+03 4.17922e+01 -6.47831e+04 8.76577e+03 LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Potential 2.46211e+03 2.90393e+04 -5.15137e+03 -3.66485e+04 -6.13680e+04 Kinetic En. Total Energy Temperature Pressure (bar) Constr. rmsd 1.08079e+02 -6.12599e+04 5.26483e+00 2.99131e+01 7.35318e-06
Error (potential) ca. 160%
5.0 for GPU
Angle Proper Dih. Improper Dih.GB Polarization Nonpolar Sol. 5.51520e+03 4.10883e+03 5.02242e+02 -6.74749e+04 8.43644e+03 LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Potential 3.03242e+03 3.23932e+04 -4.38033e+03 -5.49163e+04 -7.27832e+04 Kinetic En. Total Energy Temperature Pressure (bar) Constr. rmsd 6.13076e+03 -6.66524e+04 2.98646e+02 -1.23727e+01 2.36228e-05
Error (potential) ca. 208%
Taking into account that a folded state is only slightly more stable than an unfolded (<100kJ/mol) none of the versions actually works (confirmed by analyzing trajectories)
Now on the good side - with the infinite cut-offs the 4.5 ver. works well
CPU Potential: -2.43293e+04
GPU Potential: -2.43574e+04
There is only one small note added to the release of 4.6v. which suggests that implicit solvent calculations may not work correctly and none in case of the 5.0. In case of 4.5 all configurations were suppose to work.
Reading the manuals (one small table at page 21 for ver.5.0) ,and looking at the benchmarks, one have an impression that it works. Especially that in the case of expl. solvent the speedup is marginal and may not justify costs of moving to GPU at all.
Please be more specific, that is a great software after all.
Getting back to bug hunting in my own code.
#10 Updated by Berk Hess almost 3 years ago
- Category set to preprocessing (pdb2gmx,grompp)
- Status changed from Accepted to Resolved
- Priority changed from High to Normal
- Target version set to 5.0.2
The incorrect GB results with GPU were due to a missing check for GB with the Verlet scheme in grompp. A grompp check has been added for 5.0.2.
We would like to have GB supported with the new SIMD and GPU kernels, but we currently don't have time to do this (but we do need it for 6.0 where the group cut-off scheme will no longer be present).
Were there more issues here?