segmentation fault with free energy changes and multiple GPU's
I'm running some free energy transformations on a 20 CPU 2 GPU node. Every once and a while a simulation crashes with "Segmentation Fault" (mpirun notifies me that the process "exited on signal 11 (Segmentation Fault)". There is no other output that I can find that indicates the error.
This can happen several hours into the simulation. Typically a restart from the last checkpoint is all that is needed for the simulations to continue. In some simulations in the series (i.e., same system with different lambda values, or near identical systems at same lambda values) the segfault doesn't happen at all.
I don't experience these segfaults running on the same nodes when doing normal (non-free energy) simulations I have the issue with both 5.0.5 and 5.1-rc1.
#2 Updated by James Barnett about 4 years ago
I just tested it with a single GPU, and the segfault does happen when a single GPU is specified with -gpu_id (although I am using the single GPU on multiple ranks).
Attached are two tpr files. One is a solvation free energy, and the other uses topology A/B form.
Additionally I normally run with -multidir so I have five of these different systems going at once, but the segfault occurs even with just one system in a simulation.
#3 Updated by James Barnett about 4 years ago
I've ended up doing the first few lambda state of these free energy sims (about 50 individual sims so far) with "-nb cpu" and haven't experienced a segfault yet (I had a couple of them going with GPU's and one of them crashed with the segfault in the same time period).
So it just seems to be the combo of free energy and a GPU for me that causes the segfault.
#4 Updated by James Barnett about 4 years ago
It seems to be that once I get to a certain lambda value (depending on the system) even with just CPU's some of the systems segfault. It's interesting that starting a simulation over from the beginning will usually result in a segfault at the same time as where it appeared before. I tried 5.0.4, and it results in the same situation. I'll do some more troubleshooting on my setup.
#5 Updated by James Barnett about 4 years ago
The issue was I was using the 1-1-48 soft core path for these systems (with sc-r-power = 48, sc-alpha = 0.0025). Switching back to the GROMACS default path (with sc-alpha = 0.5, sc-r-power = 6) solved the issue with no segfaults. The segfaults started occuring after the charges had been turned off linearly and then the vdw interactions started to change via vdw-lambdas.
So, this issue can probably be closed. I can't reproduce the segfault on small sample systems (methane in water, two methanes in water, etc), so not sure if a bug or 1-1-48 just not appropriate for this type of large system.
I had forgotten I had changed the sc path (thought I had only changed from one GPU to two and upgraded from 5.0.4 to 5.0.5).
#6 Updated by Erik Lindahl about 4 years ago
Based on my experience, the problem with sc-r-power==48 appears to be that it can result in very large floating-point numbers in corner cases. For the actual free energy code (which never runs on the GPU) I've tried to work around that by always using using double precision for the directly affected variables. However, this might be a case of the large numbers spilling over onto normal forces (you could try to check them). In principle the CPU & GPU kernels are both single precision (unless you're using double for the CPU code), but since the algorithms to evaluate forces need to be slightly different it might only cause problems for the GPU kernels.
To be able to debug it we need a set of input files that reproduces the problem within a few steps, even if it's only possible to see with a large system. Try checkpointing frequently and restart from those.
#7 Updated by James Barnett about 4 years ago
- File segfault.cpt segfault.cpt added
- File segfault.tpr segfault.tpr added
- File test.edr test.edr added
I've attached a checkpoint file and tpr file. The segfault happened about 300 steps after this checkpoint.
Not sure if it'll help, but I ran from the last checkpoint and saved the energy file every step until it segfaulted again (test.edr). -nan on several values on the last step.