mdrun did not support large file offsets on lustre file system
When I continued a run on my x86_64 linux clusters using the command "mdrun -deffnm prod -cpi prod.cpt -append", I got the errors as below:
Program mdrun, VERSION 4.5.4
Source code file: checkpoint.c, line: 1734 Fatal error:
The original run wrote a file called 'prod.xtc' which is larger than 2 GB, but mdrun did not support large file offsets. Can not append. Run mdrun with -noappend For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors
I also tried to recompile gromacs with alternative option --enable-largefile, but it still could not work.
When I compile gromacs using cmake, no large file related option was found in the cmake interface, however, I found the option for large file support in the CMakeCache.txt listed as bellow.
//Result of TEST_BIG_ENDIAN
//Result of test for large file support
I wonder if "GMX_LARGEFILES:INTERNAL=1" means that the it is OK for large file support in my cluster. If yes, why mdrun still tells to me that the large file offsets are not supported?
I further note that the lutre file system has 128-bit file identifiers. Therefore I guess it is the reason why mdrun mdrun did not support large file offsets.
#2 Updated by Roland Schulz over 8 years ago
Do you still have the previous CPT file from a point before the 2GB were reached? If so try to continue from that CPT file. Than the next time you continue it the xtc-file should be again over 2GB. And check whether now it is possible to continue.
The reason it is important to recreate the CPT file: If your earlier build didn't have large file support than the CPT file is not OK. But the CPT file before the xtc-file got too large should be OK to continue from.
#3 Updated by Anonymous over 8 years ago
Thanks for the reply. Yes, when the xtc file does not exceed 2 GB, the restart is OK. The problem is when the xtc file exceeds 2 GB, restart is failed with the cpt file generated after the 2 GB were reached.
My operating system is SUSE Linux 10SP2 with the kernel 2.6.18-53, and the file system is lustre. The cluster consists of 2000 Intel(R) Xeon(R) CPU E5450. The version of autoconf and automake are 2.59 and 1.96, respectively. Both gcc and icc have been tested.
Following the instructions from Berk in this post http://redmine.gromacs.org/issues/341, I made several tests as showed bellow:
(1) When the size of xtc file was 2.5 G, I ran the command "gmxdump -cp prod.cpt" and got the following output:
number of output files = 3 output filename = prod.log file_offset_high = 0 file_offset_low = 10072155 file_checksum_size = 1048576 file_checksum = 3e1969a8339568f1649381bbe0872650 output filename = prod.xtc file_offset_high = -1 file_offset_low = -1 file_checksum_size = -1 file_checksum = 00000000000000000000000000000000 output filename = prod.edr file_offset_high = 0 file_offset_low = 24934260 file_checksum_size = 1048576 file_checksum = e99f70e0b18b6ed79bb9df5090de5044
(2) grep SIZEOF src/config.h
#define SIZEOF_INT 4 #define SIZEOF_LONG_INT 8 #define SIZEOF_LONG_LONG_INT 8 #define SIZEOF_OFF_T 8 #define SIZEOF_VOIDP 8
(3) sizeof(int) = 4 and sizeof(off_t) = 8 are confirmed by the two programs provided by Berk.
(4) I modified the checkpoint.c to print some variables.
I do not know why the procedure xdr_int(xd,i) gives i = -1 with *xd larger than 2 GB. Could you give me any advice? Thank you very much!
#4 Updated by Roland Schulz over 8 years ago
- Status changed from New to Closed
The problem is that the binary which wrote the checkpoint had problems with large files. You need to restart from a previous checkpoint with the new binary. That new binary should create a checkpoint file which has an offset >0 (even after the XTC is bigger than 2GB) and than restarting from that should work.
Only bug reports should be but on redmine. This is asking for help. This should be done on the mailing list. Please only reopen this bug or post further comments after their is any evidence this is a bug. All other questions should go to the list.