pme problem in 4.0.3 results in memory allocation crash
Created an attachment (id=350)
Tarball to reproduce problem
I have run into a bug evaluating energies of a single simulation snapshot. Energy evaluation works fine in Gromacs 3.3.1 and in 4.0.3 without PME, and also with PME on and with the partial charges on my small molecule turned off. But with the partial charges turned on, mdrun crashes instantly with the following error:
Program dmdrun, VERSION 4.0.3
Source code file: fftgrid.c, line: 78
Failed to allocated 578699296 bytes of aligned memory.
It seems to be a PME problem, since (a) it works fine without PME, and (b) the referenced code is the FFT code. At the same time, there is more to the problem than just PME, since running this works fine with the partial charges on the "Protein" (here a small molecule) turned off.
Tarball to reproduce the bug is attached. Either enclosed *.sh file will result in the described crash.
Note that I compiled with FFTW 2.
#3 Updated by David Mobley over 11 years ago
I didn't intentionally put that in my input file. See fullcoupling.mdp. Concerning PME, I just set:
fourierspacing = 0.1
; FFT grid size, when a value is 0 fourierspacing will be used =
fourier_nx = 0
fourier_ny = 0
fourier_nz = 0
; EWALD/PME/PPPM parameters =
pme_order = 6
ewald_rtol = 1e-06
epsilon_surface = 0
optimize_fft = no
The pme_order, ewald_rtol, and fourierspacing are what I want. I don't even know what nkx, nky, and nkz do...
If there is a problem with the topology let me know and I can fix it.
#4 Updated by Erik Lindahl over 11 years ago
I don't think this is a bug; the memory dimensions are quite correct based on your system and PME settings (spacing 0.1 with all dimensions 40nm will create a 416^3 grid, which in double requires ~580Mb).
This error message simply means the system returns NULL for out-of-memory when we call malloc() with this size, and there isn't a whole lot we can do about that.
FFTW2/FFTW3/Intel MKL might behave slightly differently, both depending on the amount of memory they allocate internally and whether we need an extra scratch array to copy data. We always try to copy in-place, but that is not always possible for all FFT libraries.
Make sure you're compiling in 64 bit mode - otherwise you might be hitting the 2Gb process limit.
#5 Updated by David Mobley over 11 years ago
OK, but this is weird:
- It works fine with exactly the same system as long as the partial charges are turned off on the solute, but not the solvent.
- It works fine in 3.3.x
I don't get how that fits with the "it's just too big a system" scenario.
I guess I can do a workaround -- I'm just trying to reproduce a calculation I did in 3.3.x; if 4.x can't handle it I'll just have to revisit 3.3.x and redo with something 4.x can handle.
#6 Updated by David van der Spoel over 11 years ago
The problem happens when the B array is being allocated (that is because of free energy there are two FFT grids). Is this the difference between having charges off and on, that when you rum them on you turn on free enrgy as well?
Nevertheless, the 2Gb limit is per block of memory even on 32 bit machine, as far as I know, so it shouldn't be a problem. Both machines I used for testing had 4 Gb RAM, one was an Apple (32 bit) the other an x86_64 Linux box.
#7 Updated by David Mobley over 11 years ago
The B array should be allocated in all cases -- in all cases I have free_energy on and there is a different B state; it is just that in the cases that work, the B state has modified charges on the solute, and in the cases that don't work (in 4.0.3), the A and B states have zero charges on the solute and it is the LJ interactions that are being modified.
I can compile with fftw3, I suppose, and see if that helps.
I'm on an x86_64 linux box.
#8 Updated by Berk Hess over 11 years ago
PME is only aware of free energy when charges are changed.
I assume the error is simply due to a too large memory block
request and the error is handled correctly, or not?
You probably didn't intend to use a box of more than 40 nm.
Can we close this bug?
#9 Updated by David Mobley over 11 years ago
Well, the problem did go away when I recompiled with fftw3.
I did actually intend to use a box that large. I was trying to minimize the PME contribution to the total energy for a test, but at the same time still use PME.
It is puzzling to me that, if this is not a bug, I can get away with doing this in (a) 3.3.x with FFTW2, (b) 4.0.x with FFTW3, but not (c) 4.0.x with FFTW2.
Anyway, I'm done with this, but I still think there may be some underlying problem here.
#10 Updated by Berk Hess over 11 years ago
I agree that it is somewhat strange that this depends on the fftw version.
But if the problem is running out of memory, it could simply be that
you are close to the limit and fftw3 is more memory efficient than fftw2.
But Gromacs seems to give a clear error message.
So I would say there are no Gromacs issues here.