Project

General

Profile

Bug #568

all the numbers produced by mdrun-gpu are 0.

Added by Jie empty about 9 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Hi,

We tried to run Gromacs on our Tesla s2050 GPU with CUDA toolkit 3.1. However, all the numbers it produced are 0. We have tested 4.5 b4 and 4.5.1 versions.

Log files have been attached in a tar archive. The input file is a bit large for an attachment. Please let me know if you do need some other files to investigate this problem.

Kind Regards,
Jie

log.tar (420 KB) log.tar md.log files Jie empty, 09/21/2010 06:59 AM
mdebin.c (38.7 KB) mdebin.c fix for not printing zero energies Berk Hess, 09/24/2010 11:39 AM
md_openmm.c (20.1 KB) md_openmm.c fir for turning off energy file summation with openmm (no effect on output) Berk Hess, 09/24/2010 11:40 AM
md.log (20.8 KB) md.log md.log for patched gromacs Jie empty, 10/05/2010 04:05 AM

History

#1 Updated by Jie empty about 9 years ago

Created an attachment (id=540)
md.log files

#2 Updated by Berk Hess about 9 years ago

Created an attachment (id=542)
fix for not printing zero energies

#3 Updated by Berk Hess about 9 years ago

Created an attachment (id=543)
fir for turning off energy file summation with openmm (no effect on output)

#4 Updated by Berk Hess about 9 years ago

I attached files which I think should fix the problem,
but somebody who can build and run an mdrun-gpu binary should check that it works.

Berk

#5 Updated by Rossen Apostolov about 9 years ago

fixed in commit 049a485802b1782133bdeab171bf766ac55d0377.

Thanks Berk!

#6 Updated by Jie empty about 9 years ago

Thanks Berk for your effort in fixing this bug.

I have looked into mdebin.c file you uploaded. The function init_energyhistory() is removed from this file. However, it is used in init.c, and declared in include/mdebin.h file.

Therefore, when I tried to compile the fixed Gromacs 4.5.1 (MPI version), a link error shows saying no reference to 'init_energyhistory'. To solve this, I temporarily put a empty init_energyhistory() function in Berk's mdebin.c file.

However, I don't know whether this is a proper way.

Please clarify this point a little bit further. Is this fix only for mdrun-gpu, not for other mdruns.

Regards,
Jie

#7 Updated by Jie empty about 9 years ago

a follow up comment on using fixed mdebin.c and md_openmm.c.
Instead of generating 0 entries, I have nan turns up (this is with a empty init_engergyhistory() function is mdebin.c).

Initial temperature: 0 K

Started mdrun on node 0 Tue Sep 28 11:34:22 2010

Step           Time         Lambda
100000 200.00000 0.00000
Energies (kJ/mol)
Potential Kinetic En. Total Energy Temperature Constr. rmsd
-2.50934e+06 4.93725e+05 -2.01562e+06 2.73070e+02 0.00000e+00
<======  ###############  >
<== A V E R A G E S ====>
<== ############### ======>
Statistics over 1 steps using 0 frames
Energies (kJ/mol)
Potential Kinetic En. Total Energy Temperature Constr. rmsd
nan nan nan nan nan
Box-X          Box-Y          Box-Z
nan nan nan
Total Virial (kJ/mol)
nan nan nan
nan nan nan
nan nan nan
Pressure (bar)
nan nan nan
nan nan nan
nan nan nan
Total Dipole (D)
nan nan nan

Post-simulation full memtest in progress...
Memory test completed without errors.
No MEGA Flopsen this time

I think think may due to lacking init_energyhistory() in mdebin.c file.
Please clarify the corresponding changes to other files as well.

Regards,
Jie

#8 Updated by Rossen Apostolov about 9 years ago

The outputs are correct, please try with the latest git in release-4-5-patches.

#9 Updated by Jie empty about 9 years ago

(In reply to comment #8)

The outputs are correct, please try with the latest git in release-4-5-patches.

I have downloaded release-4.5-patches tar file from http://repo.or.cz/w/gromacs.git/snapshot/afd66e48c4e608aa744ec3ec5dd457a221e70def.tar.gz

and also applied Berk's patches.

However the results produced by mdrun-gpu is again turns to all zeros.Please see the attached md.log file.

Please note that I downloaded release-4.5-patches directly from web instead of using git repo (I am not familar with git, not sure how to checkout a branch instead of working on the master copy). On the web page, it is said that this archive file is generated at 30th Sept 2010. I think it should be uptodate.

I have used following compile procedure:

In the source directory.

mkdir build
cd build
cmake ../ -DGMX_OPENMM=ON -DGMX_THREADS=OFF
make mdrun
make install-mdrun

running command as:

mdrun-gpu -device "OpenMM:platform=Cuda,memtest=15,deviceid=0,force-device=no" -s topol.tpr -cpi state.cpt -append

#10 Updated by Jie empty about 9 years ago

Created an attachment (id=548)
md.log for patched gromacs

#11 Updated by Rossen Apostolov about 9 years ago

You're probably fetching from the master branch. Try that:

$ git clone git://git.gromacs.org/gromacs.git
$ cd gromacs
$ git checkout -t origin/release-4-5-patches

Now you will have the latest release-4-5-patches branch checked out. Later, you can run in the gromacs/ directory

$ git pull

to keep it up to date.

#12 Updated by Jie empty about 9 years ago

(In reply to comment #11)

You're probably fetching from the master branch. Try that:

$ git clone git://git.gromacs.org/gromacs.git
$ cd gromacs
$ git checkout -t origin/release-4-5-patches

Now you will have the latest release-4-5-patches branch checked out. Later, you
can run in the gromacs/ directory

$ git pull

to keep it up to date.

I check out the release-4-5-patches as you described above. It is actually identical with what I downloaded from git web interface.

I still end up with zero entries.

Here is some warning messages:

----------------
Precision mismatch for state entry box, code precision is double, file precision is float
Precision mismatch for state entry box-rel, code precision is double, file precision is float
Precision mismatch for state entry pres_prev, code precision is double, file precision is float
Precision mismatch for state entry x, code precision is double, file precision is float
Precision mismatch for state entry v, code precision is double, file precision is float
Reading file topol.tpr, VERSION 4.5 (single precision)

Reading checkpoint file state.cpt generated: Fri Sep 17 18:29:11 2010

Gromacs binary or parallel settings not identical to previous run.
Continuation is exact, but is not guaranteed to be binary identical,
see the log file for details.

Precision mismatch for state entry box, code precision is double, file precision is float
Precision mismatch for state entry box-rel, code precision is double, file precision is float
Precision mismatch for state entry pres_prev, code precision is double, file precision is float
Precision mismatch for state entry x, code precision is double, file precision is float
Precision mismatch for state entry v, code precision is double, file precision is float

WARNING: OpenMM does not support leap-frog, will use velocity-verlet integrator.

WARNING: OpenMM supports only Andersen thermostat with the md/md-vv/md-vv-avek integrators.

WARNING: OpenMM supports only Monte Carlo barostat for pressure coupling.

WARNING: OpenMM provides contraints as a combination of SHAKE, SETTLE and CCMA. Accuracy is based on the SHAKE tolerance set by the "shake_tol" option.
---------------

does these warning affects the result?

#13 Updated by Berk Hess about 9 years ago

The warnings shouldn't matter.

Did you really compile in double precision?
That does not make sense with OpenMM, but I don't know if this
would cause problems.

Berk

#14 Updated by Jie empty about 9 years ago

(In reply to comment #13)

The warnings shouldn't matter.

Did you really compile in double precision?
That does not make sense with OpenMM, but I don't know if this
would cause problems.

Berk

I compiled both OpenMM and Gromacs as the installation guide. Nothing changed for those.

Do you mean OpenMM does not support double precision? The GPU I used is Tesla S2050 (Fermi), which supports double precision. Will this be a problem?

#15 Updated by Berk Hess about 9 years ago

Your hardware is not a problem.
But I think you do have a problem when you compile Gromacs in double precision
with OpenMM, since I assume OpenMM expect single precision input/output arrays.

Could you check if you compiled in double precision by doing:
grep GMX_DOUBLE CMakeCache.txt

I get:
GMX_DOUBLE:BOOL=OFF
...

Berk

#16 Updated by Szilárd Páll about 9 years ago

(In reply to comment #14)

(In reply to comment #13)

The warnings shouldn't matter.

Did you really compile in double precision?
That does not make sense with OpenMM, but I don't know if this
would cause problems.

Berk

I compiled both OpenMM and Gromacs as the installation guide. Nothing changed
for those.

Do you mean OpenMM does not support double precision? The GPU I used is Tesla
S2050 (Fermi), which supports double precision. Will this be a problem?

Gromacs-GPU does not support double precision.

#17 Updated by Jie empty about 9 years ago

(In reply to comment #15)

Your hardware is not a problem.
But I think you do have a problem when you compile Gromacs in double precision
with OpenMM, since I assume OpenMM expect single precision input/output arrays.

Could you check if you compiled in double precision by doing:
grep GMX_DOUBLE CMakeCache.txt

I get:
GMX_DOUBLE:BOOL=OFF
...

Berk

I have GMX_DOUBLE switched off, however, GMX_DOUBLE-ADVANCED seems on. Can these two flags conflict with each other?

$ grep GMX_DOUBLE CMakeCache.txt
GMX_DOUBLE:BOOL=OFF
//ADVANCED property for variable: GMX_DOUBLE
GMX_DOUBLE-ADVANCED:INTERNAL=1

#18 Updated by Ben Hall almost 8 years ago

This bug (or another one with the same symptoms) is still present in 4.5.5. Running the benchmark suite system (with edits to write data) leads to trajectories where the first frame is correct but all other frames have cartesian coordinates set to zero. I'd assume this is a bug specific to the writing rather than the calculations, as the simulations don't crash. Furthermore, the benchmark speeds match those reported on the GPU webpage. CPU gromacs works without issue, but requires >60 processors to get the speed of the GPU version.

I don't get the double precision error messages, though the log indicates that the initial temperature is set to 0, as reported in http://redmine.gromacs.org/issues/757

Different software versions:
openmm 3.1.1
cuda 4.0
and all compiled using intel compiler version 11.1/072

Also available in: Atom PDF