Project

General

Profile

Bug #792

mdrun did not support large file offsets on lustre file system

Added by Anonymous about 8 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

When I continued a run on my x86_64 linux clusters using the command "mdrun -deffnm prod -cpi prod.cpt -append", I got the errors as below:

Program mdrun, VERSION 4.5.4
Source code file: checkpoint.c, line: 1734 Fatal error:
The original run wrote a file called 'prod.xtc' which is larger than 2 GB, but mdrun did not support large file offsets. Can not append. Run mdrun with -noappend For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors

I also tried to recompile gromacs with alternative option --enable-largefile, but it still could not work.

When I compile gromacs using cmake, no large file related option was found in the cmake interface, however, I found the option for large file support in the CMakeCache.txt listed as bellow.
.....
//Result of TEST_BIG_ENDIAN
GMX_INTEGER_BIG_ENDIAN:INTERNAL=0
//Result of test for large file support
GMX_LARGEFILES:INTERNAL=1
......

I wonder if "GMX_LARGEFILES:INTERNAL=1" means that the it is OK for large file support in my cluster. If yes, why mdrun still tells to me that the large file offsets are not supported?

I further note that the lutre file system has 128-bit file identifiers. Therefore I guess it is the reason why mdrun mdrun did not support large file offsets.

History

#1 Updated by Anonymous about 8 years ago

I searched this issue on the internet and found a same problem reported before, but unfortunately it is still not resolved. It should be pointed out that this bug is probably relevant to the file system, eg., lustre.
http://redmine.gromacs.org/issues/315

#2 Updated by Roland Schulz about 8 years ago

Do you still have the previous CPT file from a point before the 2GB were reached? If so try to continue from that CPT file. Than the next time you continue it the xtc-file should be again over 2GB. And check whether now it is possible to continue.

The reason it is important to recreate the CPT file: If your earlier build didn't have large file support than the CPT file is not OK. But the CPT file before the xtc-file got too large should be OK to continue from.

#3 Updated by Anonymous about 8 years ago

Thanks for the reply. Yes, when the xtc file does not exceed 2 GB, the restart is OK. The problem is when the xtc file exceeds 2 GB, restart is failed with the cpt file generated after the 2 GB were reached.

My operating system is SUSE Linux 10SP2 with the kernel 2.6.18-53, and the file system is lustre. The cluster consists of 2000 Intel(R) Xeon(R) CPU E5450. The version of autoconf and automake are 2.59 and 1.96, respectively. Both gcc and icc have been tested.

Following the instructions from Berk in this post http://redmine.gromacs.org/issues/341, I made several tests as showed bellow:

(1) When the size of xtc file was 2.5 G, I ran the command "gmxdump -cp prod.cpt" and got the following output:

number of output files = 3
output filename = prod.log
file_offset_high = 0
file_offset_low = 10072155
file_checksum_size = 1048576
file_checksum = 3e1969a8339568f1649381bbe0872650
output filename = prod.xtc
file_offset_high = -1
file_offset_low = -1
file_checksum_size = -1
file_checksum = 00000000000000000000000000000000
output filename = prod.edr
file_offset_high = 0
file_offset_low = 24934260
file_checksum_size = 1048576
file_checksum = e99f70e0b18b6ed79bb9df5090de5044

(2) grep SIZEOF src/config.h

#define SIZEOF_INT 4
#define SIZEOF_LONG_INT 8
#define SIZEOF_LONG_LONG_INT 8
#define SIZEOF_OFF_T 8
#define SIZEOF_VOIDP 8

(3) sizeof(int) = 4 and sizeof(off_t) = 8 are confirmed by the two programs provided by Berk.

(4) I modified the checkpoint.c to print some variables.
SIZEOF_GMX_OFF_T 8
INT_MAX 2147483647
LONG_MAX 9223372036854775807
SHRT_MAX 32767

I do not know why the procedure xdr_int(xd,i) gives i = -1 with *xd larger than 2 GB. Could you give me any advice? Thank you very much!

#4 Updated by Roland Schulz about 8 years ago

  • Status changed from New to Closed

The problem is that the binary which wrote the checkpoint had problems with large files. You need to restart from a previous checkpoint with the new binary. That new binary should create a checkpoint file which has an offset >0 (even after the XTC is bigger than 2GB) and than restarting from that should work.

Only bug reports should be but on redmine. This is asking for help. This should be done on the mailing list. Please only reopen this bug or post further comments after their is any evidence this is a bug. All other questions should go to the list.

Also available in: Atom PDF