Project

General

Profile

Bug #3234

Signal: Floating point exception Signal code: Floating point divide-by-zero

Added by Dave M 10 months ago. Updated 10 months ago.

Status:
Accepted
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Original post here (with error): https://mailman-1.sys.kth.se/pipermail/gromacs.org_gmx-users/2019-December/127603.html

There is no error printed in the log file but the error is printed in the terminal.
tpr file here: https://drive.google.com/open?id=1kZQGgJDgNrsKU16Wy1Qx-b5ROCNDOeC8
cpt file here: https://drive.google.com/open?id=13-zXhOCok7EHVZHn8K9i7FHhd2RI38Y3
To reproduce the error doing this should give error with in few ns:
  1. using 1 GPU 8CPU
    gmx_mpi mdrun -s my.tpr -cpi my.cpt -rdd 2.0

History

#1 Updated by Dave M 10 months ago

Dave M wrote:

Original post here (with error): https://mailman-1.sys.kth.se/pipermail/gromacs.org_gmx-users/2019-December/127603.html

There is no error printed in the log file but the error is printed in the terminal.
tpr file here: https://drive.google.com/open?id=1kZQGgJDgNrsKU16Wy1Qx-b5ROCNDOeC8
cpt file here: https://drive.google.com/open?id=13-zXhOCok7EHVZHn8K9i7FHhd2RI38Y3
To reproduce the error doing this should give error with in few ns:
  1. using 1 GPU 8CPU
    gmx_mpi mdrun -s my.tpr -cpi my.cpt -rdd 2.0

Log file here : https://drive.google.com/open?id=1uex3NOEK21hBejYyULDRYzKSxD1psdJo

#2 Updated by Paul Bauer 10 months ago

Thanks!
If you are fine with it I'll attach the files directly here so we have them for reference in the future

#3 Updated by Dave M 10 months ago

Hi Paul,

I understand it might take a while to fix it; if there is a bug. This error now I am getting quite often for my other systems using the same version of gromacs 2019.4.
I can't restart using cpt files as it gets killed again at the same time step. Though not a good idea but I think to restart the job I can use gro file containing velocity information and then write a new tpr file. I understand precision will be lost but may be fine because it is a coarse-grained MARTINI system. Thats the best I think I can do. If you have any suggestions please let me know. I need to stick to the same gromacs version as am in the middle of many simulations.

#4 Updated by Dave M 10 months ago

Ah! just to add I could restart the job which ran without any issues by generating a new.tpr file using same original.cpt file. And it works. e.g. like this:
grompp -f original.mdp -c original.tpr -o new -t original.cpt -p original.top -n index.ndx
mdrun -deffnm new

Surprisingly as I said before job gets killed, at the same time step it was killed before, if restart using: -s original.tpr -cpi original.cpt

#5 Updated by Paul Bauer 10 months ago

Dave M wrote:

Hi Paul,

I understand it might take a while to fix it; if there is a bug. This error now I am getting quite often for my other systems using the same version of gromacs 2019.4.
I can't restart using cpt files as it gets killed again at the same time step. Though not a good idea but I think to restart the job I can use gro file containing velocity information and then write a new tpr file. I understand precision will be lost but may be fine because it is a coarse-grained MARTINI system. Thats the best I think I can do. If you have any suggestions please let me know. I need to stick to the same gromacs version as am in the middle of many simulations.

Hello Dave,

I hope to find more time to dwell on this on Monday.
To make debugging quicker, can you upload a cpt file from just before the crash, so I don't need to run a long simulation to trigger the bug?
It is actually good if this is reproducible for you, because it will hope fully be easier for me to find the bug as well.
If you have time for one more test, can you build a debug version of GROMACS and try to trigger the bug there?
If not I'll do it, but it might help us to find the issue faster.

Cheers

Paul

#6 Updated by Dave M 10 months ago

Paul Bauer wrote:

Dave M wrote:

Hi Paul,

I understand it might take a while to fix it; if there is a bug. This error now I am getting quite often for my other systems using the same version of gromacs 2019.4.
I can't restart using cpt files as it gets killed again at the same time step. Though not a good idea but I think to restart the job I can use gro file containing velocity information and then write a new tpr file. I understand precision will be lost but may be fine because it is a coarse-grained MARTINI system. Thats the best I think I can do. If you have any suggestions please let me know. I need to stick to the same gromacs version as am in the middle of many simulations.

Hello Dave,

I hope to find more time to dwell on this on Monday.
To make debugging quicker, can you upload a cpt file from just before the crash, so I don't need to run a long simulation to trigger the bug?
It is actually good if this is reproducible for you, because it will hope fully be easier for me to find the bug as well.
If you have time for one more test, can you build a debug version of GROMACS and try to trigger the bug there?
If not I'll do it, but it might help us to find the issue faster.

Cheers

Paul

Hi Paul,
Thanks for your time.
Sure I can do a test but please let me know when you say build a debug version of GROMACS. In my installation I have already used -DCMAKE_BUILD_TYPE=Debug. Did you mean something else? I have attached cpt file to your mail.

#7 Updated by Berk Hess 10 months ago

Can you please upload the files directly to redmine instead of on a server where we need to request access?

#8 Updated by Dave M 10 months ago

Should be available now (for a while). Paul downloaded the files and then I deactivated the link. I requested him not to keep the files public for few months.

#9 Updated by Berk Hess 10 months ago

  • Status changed from New to Accepted
  • Assignee set to Erik Lindahl

The division by zero issue in the gamma distribution code, as your error log also shows.
In particular, the uniform distributed variable v is 0, which leads to z=0, which means we call log(0).
We should handle the case z=0. But I don't know how. Erik, who wrote this code, should fix this.

I also see that we would get a division by zero or large numbers when x=0. But that can simply be fixed by multipying all terms in the comparison by x.

            while (true)
            {
                const result_type u = uniformDist(g);
                const result_type v = uniformDist(g);
                const result_type w = u * (1 - u);

                if (w != 0)
                {
                    const result_type y = std::sqrt(c / w) * (u - result_type(0.5));
                    x                   = b + y;

                    if (x >= 0)
                    {
                        const result_type z = 64 * w * w * w * v * v;

                        if (z <= 1.0 - 2.0 * y * y / x)
                        {
                            break;
                        }
                        if (std::log(z) <= 2.0 * (b * std::log(x / b) - y))
                        {
                            break;
                        }
                    }
                }
            }

Also available in: Atom PDF