Project

General

Profile

Feature #1489

Don't solely rely on filename extension

Added by Roland Schulz over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
core library
Target version:
-
Difficulty:
uncategorized
Close

Description

It should be possible to read files even if the file-name-extenions is incorrect or missing. Examples:
  • incorrect:
    • backup files: #conf.gro.1#
    • odd names for vmd plugin files: lammpstrj
  • missing: named pipes: <(zcat conf.gro.gz)

We could either provide a way to specify the file-type as an extra option following the file-name. I'm not sure how easy that is to integrate into the command-line option parsing. Or we could use something like libmagic to detect the file-type from the header of the file.

History

#1 Updated by Teemu Murtola over 5 years ago

  • Subject changed from Don't solely rely on file-name-extension to Don't solely rely on filename extension
  • Category set to core library

There are about 50 calls to fn2ftp() that would need to be changed. And the information about the user-selected file type passed to all these locations. This is the biggest change here; doing something for the command-line parsing is easy. In particular for transport through existing C code, this requires some changes.

We could possibly piggyback the information into the returned file name string to avoid changing all the code that needs to transport the information opaquely, but that would mean going through the whole code base to ensure that the string is never used for anything else than functions that are aware of this encoding.

Detecting the format from reading the file can also be challenging for some of the uses. Currently, fn2ftp() is used whether to open the file in text or binary mode in Windows. Also, in order to support pipes, it probably means that we can't rely on rewinding the input file after reading enough to determine the type, which complicates things. And for text format files, we may need to read quite a lot to recognize the format (and the recognition code can be difficult as well). VMD plugins also bring complications, since the set of file types we potentially need to recognize is unbounded. The plugin code also seems to currently rely on the extension to recognize the format.

#2 Updated by David van der Spoel over 5 years ago

This is not a good idea IMHO. It will break people's workflows and mess up the backups if you can read and write random backup files. Command completion will not work either. The only way this could work somewhat reliably would be if there were magic numbers in files. Since we have a history of a few dozen filenames to support it will not work for those. In fact only the new tng file format has a useful magic number.

#3 Updated by Roland Schulz over 5 years ago

I'm OK with it, if you want to prohibit writing to backup files. Not sure it is needed because if the user wants to shoot himself in the foot it is already possible (mv, cp, ..). The example was meant for reading, where I would find it useful. But it was just an example.

Command completion doesn't need to be affected by this. It could stay as is. If the user wants to use a non-std file extension, he needs to type it in manually.

File detection is possible for files without magic number. E.g. libmagic (used by the file command) detects PDB file format (at least the version on OpenSuse). Thus I don't see any reason why this wouldn't work also for other Gromacs files. But I think I also tend to prefer the extra command line option. Of course it would be only needed if the file-extension isn't correct. Thus current scripts/workflow would work as is.

Possible formats I think might be possible:
- add at the end of file name (e.g. "-f conf.gro:gro")
- add to the end of file flag (e.g. "-f-gro conf.gro")
- add as extra flag before or after file flag (e.g. "-f conf.gro -ft gro")

#4 Updated by Mark Abraham over 5 years ago

Recognizing our own backup-file naming format when reading should be fairly straightforward to do, which lets us determine the right file format. There's probably bad code that depends on being able to parse out an extension manually, which we'd have to go and fix to use the proper interface. I don't think we should support writing to arbitrary file names because it will be too easy to misuse, and now you get mysterious failures later with no good way to find out how to untangle yourself.

Currently no form of named pipe works (automatic, or correctly named with mkfifo). Fixing the ability to use

1$ mkfifo temp.gro
1$ zcat archive.gro.gz > temp.gro
2$ gmx tool -f temp

would be something to do first, I think. Currently gmx tool stalls for reasons I have not looked into (even though zcat completes because it has delivered through the pipe). I think that should be doable (e.g. http://stackoverflow.com/questions/21468856/check-if-file-is-a-named-pipe-fifo-in-c).

Then, handling an automatically-named pipe such as

$ gmx tool -f <(zcat archive.gro.gz)

should just be a matter of choosing a method to detect/convey file type. Offhand, -ft gro seems best to me.

Then, I think https://gerrit.gromacs.org/#/c/3397/2 will make more sense.

#5 Updated by Roland Schulz over 5 years ago

Using named pipe is actually not straight forward. The first step is simple and removing an obsolete work-around for feof (see 3397/3). But we have many places were we open the file more than once. E.g.

    get_stx_coordnum(infile, &natom);
    init_t_atoms(&atoms, natom, TRUE);
    snew(x, natom);
    snew(v, natom);
    read_stx_conf(infile, title, &atoms, x, v, &ePBC, box);

Of course this is bad practice. This mean that the file is readed unnecessarily more than once (and uncompressed for compressed files). Instead the file should be opened just once and the same handle to that file should be used by get_stx_coordnum and read_stx_conf or all should be combined into one function (with the allocation inside).

I'm not sure we want to make those changes to the C Api or whether we should simply wait for this for the C++ api and make sure we don't design in that practice for the C++ api. If so adding support for named pipe (and removing the build in uncompress) should just wait.

Support for specifying the file-type could be added independent because it would still be useful for reading files named incorrectly (see other examples than named pipes in original report).

#6 Updated by Teemu Murtola over 5 years ago

It is not enough to just open the file once, it also needs to be only read once from the beginning to the end. There are also several places where we do a rewind() on the input file (including reading .gro files). .g96 has an explicit note about rewind() not working for pipes, so it instead opens the same file twice.

#7 Updated by Teemu Murtola over 5 years ago

One approach to do the file type propagation would be to declare a

struct gmx_filename_t
{
    char *path;
    int   ftp;
};

(perhaps named differently to make it clear this also contains the type), and make t_filenm and all the ftp2fn(), opt2fn() etc. use this. This could be propagated as a pointer to all functions that currently call fn2ftp(), and only populated once, at the source where the file type is known. This way, the compiler should catch all places where the type needs to propagate, unlike in the alternative of piggybacking the type into the string itself.

And as Roland says, this doesn't need to break anything that used to work; it would only allow for additional things, like inspecting the contents of backup files without renaming them first (what's the point of backups if you can't check which of them is the one you want?). And then it is straightforward to change the behavior in one place to, e.g., only support this file typing for input files, and/or only for special devices like named pipes or /dev/null.

Also available in: Atom PDF