Project

General

Profile

Feature #1193

planning for new trajectory file reading component

Added by Mark Abraham over 6 years ago. Updated over 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
analysis tools
Target version:
-
Difficulty:
uncategorized
Close

Description

Objectives
  • Provide a GROMACS component that can manage the process of reading
    frames from a trajectory file
  • Use standard C++ idioms (e.g. design patterns) to achieve
    flexibility and extensibility - adding support for a new kind of I/O,
    trajectory format or trajectory frame object should be just a matter
    of subclassing an interface and coding up the details
  • Provide an interface such that client code does not need to be aware
    of any low-level details of file reading or trajectory formats
  • Roughly, replace the functionality of read_first_frame() and
    read_next_frame() by the creation of a management object and looping
    over calls to its readNextFrame() method
Features
  • Client code will have responsibility for creating selections after
    trajectory frames have been read (I hope that's consistent with the
    current form of the selection code!)
  • New memory will be allocated for each new frame read, and stored in
    a shared_ptr
  • Memory for old frames will be deallocated when the last shared_ptr
    referring to it goes out of scope (best practice: use only named
    shared_ptrs)

See attached graphic for draft class diagram. Methods and members have
'+', '-' and '#' prefixes for public, private and protected visbility.
Unfortunately, there's apparently no visual indicator for virtual or
static methods. Explanations of how things might work follows.

TrajectoryReadManager
  • is intended for use in tools and mdrun -rerun
  • contains handles to various objects (FileIOManager,
    TrajectoryReaderStrategy, CoordinateFrame)
    • for taking care of details that vary with context
    • that are constructed by methods that implement the Factory Method
      design pattern (createReader(), createFileManager)
  • uses the objects' interfaces to manage the control and data flow
    between objects (but exception to that principle below)
  • has a method readNextFrame() that
    • implements the Template design pattern for the high-level
      frame-reading algorithm (read header, apply conditions, read body,
      apply conditions
    • calls concrete protected methods/hooks to do the work that will
      permit it to return
      • a correct frame in the form the client code expects,
      • an end-of-file event, or
      • exceptions that only a client can handle
    • handles all exceptions generated in this component (even if only
      to re-throw some other exception)
    • has hooks so that future subclasses can override them to
      specialize readNextFrame() if there is need
    • builds a coordinate frame suitable for use by the client by using
      the Prototype design pattern - that is, by cloning a prototype
      supplied by the client. (That prototype should have empty
      containers). This avoids needing some kind of enumeration of types of
      CoordinateFrame just so that client code can tell this code what to
      construct - the object type is its own identifier. So a serial tool
      would just build an empty PrimitiveFrame to pass in (and
      PrimitiveFrame has a static method to do just that).
  • returns only coordinate frames that satisfy conditions specified by
    the client (e.g. frames every 2ns) (This could be a seperate
    responsibility in a separate class. Probably we want the feature of
    skipping reading the body of a frame whose header failed the
    condition. Implementing that anywhere outside TrajectoryReaderStrategy
    requires that it make available the partly-filled temporary
    PrimitiveFrame. Is keeping "condition testing" as a "management role"
    despite violating encapsulation a lesser evil than
    • making a new class for condition testing (also violates
      encapsulation), or
    • adding that responsibility to the TrajectoryReaderStrategy?
      I'm not bothered by the "violation of encapsulation" because the
      violation is internal to the component, and the purpose is making the
      implementation work smoothly.)
TrajectoryReaderStrategy
  • is an interface class with all pure virtual functions
  • is built by a factory method of TrajectoryReadManager based on the
    file type (this is the only place where code that queries the file
    type exists)
  • implements the Strategy design pattern for reading trajectories in
    different formats
  • declares an interface that permits an implementation to support
    reading only a header before deciding whether to read the body or skip
    it
  • is provided by TrajectoryReadManager with a handle to
    a FileIOManager object
  • has a shared_ptr to a PrimitiveFrame object, which is renewed for
    each frame
  • provides a getter so that TrajectoryReadManager can access the product
  • may or may not have an acceptable class name!
Subclasses of TrajectoryReaderStrategy
  • implement reading of particular trajectory formats (e.g. PDB, GRO,
    G96, TNG, TRR, XTC)
  • call the FileIOManager object to get chunks of data
  • provide the logic for handling chunks of data in the context of the
    intended format
  • provide the logic for handling checks for consistency between frames
  • throw exceptions about valid data formats
  • fill the PrimitiveFrame with the supplied data (so, even if we're
    doing parallel I/O and setting up a domain decomposition, this object
    knows nothing about that)
  • the TNG-reading class will have to have some way for client code
    to get access to the arbitrary data it could (also) contain. Magnus,
    there's your job at some point!
FileIOManager
  • is an interface class with all pure virtual functions
  • declares pure virtual methods to open and close files, query file
    status, and for reading chunks of data
  • does not have an interface that client code can access (maybe? With
    this, we inhibit crude hacks that read some new file format. If
    there's no interface, people might get the idea they're not supposed
    to want one.)
Subclasses of FileIOManager
  • implement reading of files under particular protocols (e.g. stdio,
    XDR, MPI-I/O, mock reads for tests)
  • thow exceptions about I/O conditions
CoordinateFrame
  • is an interface class with all pure virtual functions
  • has containers of data like t_trxframe (should it?)
  • declares getters and setters for the contained data
  • declares a method to fill itself with data from a PrimitiveFrame
    (This provides a hook so that code that does the low-level reading
    doesn't have to know what representation the client wants to use when
    reading it. For example, if we were doing some kind of parallel I/O
    that built a PrimitiveFrame object on each process such that each
    contained some known fraction of the full trajectory frame, and then
    we wanted to do some kind of domain decomposition on that, a subclass
    of CoordinateFrame would provide inside this method the logic to do
    the necessary re-organization and communication. Client code would
    just instantiate an object of that subclass as the prototype for
    TrajectoryReadManager object, and it calls this method without having
    to know about the details. For serial I/O when the prototype is a
    PrimitiveFrame, then the implementation will just do a shallow copy of
    the shared_ptr.)
  • declares a clone() method (so that a TrajectoryReaderManager object
    can implement the Prototype design pattern)
PrimitiveFrame
  • implements CoordinateFrame pretty much just like t_trxframe (for the
    moment, anyway)
  • is used as the intermediate container by TrajectoryReaderStrategy
  • might have to have data copied if the final CoordinateFrame object
    is not either a PrimitiveFrame or contains similar containers of
    primitives, but that is the price of having low-level I/O code not
    having to know about the final data format)
  • has a static method to construct an empty PrimitiveFrame (e.g. for
    use by client code as a prototype to pass to TrajectoryReadManager)
DistributedFrame
  • would implement CoordinateFrame for some hypothetical
    distributed-data analysis tool
TNGArbitraryFrame
  • would maybe implement CoordinateFrame, but there's a lot of unique
    infrastructure required to permit client code to get arbitrary data
    that might not even be present

Related issues

Related to GROMACS - Feature #1500: Post-5.0 feature clean-up planNew01/20/2014

Associated revisions

Revision 30113ccc (diff)
Added by Mark Abraham almost 6 years ago

Move mdrun trajectory writing into wrapper function

Refs #1292, #1193, #1137

Change-Id: I3f19e0995ff7fab465184d5bab8c2683260af853

Revision 8df8c14d (diff)
Added by Mark Abraham almost 6 years ago

Create fileio module

This patch contains only code motion. There are no functional code
changes. Moves lots of I/O code into src/gromacs/fileio. This means
lots of changes to avoid having everything as a compile-time dependency
of everything else because everything is pulled in via typedefs.h,
etc. Note that src/gromacs/legacyheaders/filenm.h and
src/gromacs/legacyheaders/types/filenm.h have been consolidated into
src/gromacs/fileio/filenm.h.

I/O code in files named stat*[ch] now lives in various new files in
fileio.

Files outside of the module now #include its header files in a proper
way, e.g. #include #include "../fileio/filenm.h" or
"gromacs/fileio/filenm.h" according to whether they are an installed
header, or not. Files within the module are blessed and do not need
that qualifier.

This module installs most of its headers (because they're almost all
inter-dependent; gmxfio_int.h is not installed because it is
only useful internally, vmdio.h is not installed because it
relies on a header from src/external)

Files in new module
  • conform to preferred include-guard format.
  • have up-to-date copyright headers thanks to Teemu's automatic
    script
  • that are installed headers refer to other GROMACS include files via
    relative paths

Moves mdrun trajectory writing into wrapper function.

Removes small pieces of I/O code that was hiding behind "#if 0".

Some pieces of I/O code specific to the gmxpreprocess module have
remained there.

Moved a cppcheck suppression to follow matio.cpp to its new home.

Minor fix to xdrf.h logic, since it is now the subject of a CMake
test.

Refs #1292, #1193, #1137

Change-Id: I820036298d574966d596ab9e258ed8676e359184

History

#1 Updated by Roland Schulz over 6 years ago

Mark Abraham wrote:

  • Memory for old frames will be deallocated when the last shared_ptr
    referring to it goes out of scope (best practice: use only named
    shared_ptrs)

In general it is better to have single owner design. I don't see any advantage of having shared ownership here. So the user of read-frame should either pass in an allocated frame pointer or it should return a unique_ptr. (This is true for all shared_ptrs in this concept - I think they would be all(/most) better unique_ptrs).

TrajectoryReadManager
  • returns only coordinate frames that satisfy conditions specified by
    the client (e.g. frames every 2ns)

For parallel reading we want these conditions and also a split between cores. So in addition to something like "every 2ns" we also need e.g. "the n/10 (or n%10) of trajectory goes to the n-th core".
Also the specific format reader (e.g. XTC) should inform what type of skipping is (in)efficient. E.g. for XTC n%10 is very slow but n/10 is fast. We have started some of this in 1316 and also have some more in code which isn't uploaded yet.

Subclasses of TrajectoryReaderStrategy
  • call the FileIOManager object to get chunks of data

It would be good if the XTC/TRR/... reader/writer could be used flexible and not only with a file as the input/output. E.g. for parallel writing we needed to compress a XTC frame in memory. I think most OO APIs use streams for that purpose. http://www.gromacs.org/index.php?title=Developer_Zone/Programming_Guide/Allowed_C%2B%2B_Features says not to use iostreams. Is the rule intended to be applied only for actually output or do we also not want to use streams for intermediate processing steps?

#2 Updated by Teemu Murtola over 6 years ago

  • Description updated (diff)
Two quick comments on the overall structure. Will look at the details later.
  • The FileIOManager is a more general concept than just for trajectory I/O, so it could be better to split it out into a separate issue. We will have need for this kind of file objects also elsewhere in the code. There may be something specific that is necessary for trajectory I/O, but mostly it is general, with requirements to be able to mock it from tests etc. (@Roland: in the current design, it is possible to implement a FileIOManager that writes into a memory buffer instead of a real file). This has something in common with #950, but in my mind is separate.
  • If we want to abstract away the representation of the data such that the caller can decide it, to me it would feel better to have a simple interface (e.g., ITrajectoryFrameBuilder) that provides methods to populate the structure. The caller could then simply pass an implementation of this interface to TrajectoryReadManager (or even individual readFrame() calls). Writing will have different requirements, though, but that may need a somewhat different design anyhow.

PS. Reformatted the description to make the bullet lists look right.

#3 Updated by Mark Abraham over 6 years ago

Thanks for the feedback so far guys. I have a few quick comments to add to the original post:

Magnus Lundborg and I sat down last week and hacked through a bunch of the issues with trajectory I/O and came up with a design that vaguely resembles this. I've done a bunch of work on it since, but if there's things you think I've missed, Magnus, do speak up!

Actually refactoring the code into such a design is quite a different thing altogether. I anticipate writing a pile of integration tests and using them to validate incremental refactorings as described in Feathers and similar books. In general, such refactoring doesn't need a design to target, but here we know enough about the problem that we should probably have one to target.

Offhand, trajectory writing feels like a similar job, but there's not a lot of code I anticipate re-using, so I've left that out for now.

#4 Updated by David van der Spoel over 6 years ago

Don't have time to read everything now, but I had a student implement a C++ front end on top of Magnus' library. Haven't seen the result yet, but I will meet him tomorrow. Then the plan was to merge in this code into the tng_io library repository.

#5 Updated by Magnus Lundborg over 6 years ago

I don't have anything to add, that I can think of right now, from our discussions Mark.

I think the C++ front end will be a very good start from the TNG point of view. Of course it would have been easier for your student if this discussion had been finished before he/she started, but hopefully we can just tweak it and have something that will fit the grand plan. The current C++ front end is more or less just a layer on the current TNG C library, right? So, then we can add what is needed for this GROMACS component and there will (hopefully) be lots of flexibility for using TNG from C++, allowing standard API calls or the higher level GROMACS calls.

#6 Updated by David van der Spoel over 6 years ago

Indeed he implemented a C++ wrapper and added a new test program based on the old one to use the C++ functions.

#7 Updated by Mark Abraham over 6 years ago

Thanks for the feedback!

Roland Schulz wrote:

Mark Abraham wrote:

  • Memory for old frames will be deallocated when the last shared_ptr
    referring to it goes out of scope (best practice: use only named
    shared_ptrs)

In general it is better to have single owner design. I don't see any advantage of having shared ownership here. So the user of read-frame should either pass in an allocated frame pointer

Not sure quite what you mean by an "pass in an allocated frame pointer". The client code cannot anticipate the contents of the trajectory file. If it does, it then has to code explicit checks for "did the (part of the) frame object I allocated get filled with data from the trajectory file?" Client code will always have to handle cases like "I need velocities and they weren't read in." At least if the object that does the reading doesn't allocate memory for a container that isn't filled then there's a simple way for the client to know there's no real data.

Accordingly, the frame that client code would pass in for each frame read would be an empty shell. If so, then it may as well be a frame created by the frame reader. I was trying to achieve client code that looks like

some_smart_ptr<TheConcreteFrameType> protypeFrame = new TheConcreteFrameType(whatever,args);
TrajectoryReader reader(whatever,args,prototypeFrame);
do
{
  some_smart_ptr<CoordinateFrame> theFrame; // named smart pointer makes sure destructor is called when you want it
  reader.doRead(theFrame);
  if(theFrame->ok())
  {
    do_processing();
  }
} while (reader.notEOF());

Is there something gained by requiring the client to code:

  some_smart_ptr<CoordinateFrame> theFrame = new TheConcreteFrameType(whatever,args);

in each loop iteration?

or it should return a unique_ptr. (This is true for all shared_ptrs in this concept - I think they would be all(/most) better unique_ptrs).

Some of the smart pointers I'd deliberately left ambiguous, so we could choose their real pointer type once the role was more clear. I hadn't looked at unique_ptr yet, but it does look suitable. I think you're suggesting the above fragment is better implemented by:

some_smart_ptr<TheConcreteFrameType> protypeFrame = new TheConcreteFrameType(whatever,args);
TrajectoryReader reader(whatever,args,prototypeFrame);
do
{
  gmx_unique_ptr<CoordinateFrame> theFrame(reader.doRead()); // named smart pointer makes sure destructor is called when you want it
  if(theFrame->ok())
  {
    do_processing(theFrame);
  }
} while (reader.notEOF());

This seems to suit well the notion that the Trajectory Reader component serves the role of a factory.

TrajectoryReadManager
  • returns only coordinate frames that satisfy conditions specified by
    the client (e.g. frames every 2ns)

For parallel reading we want these conditions and also a split between cores. So in addition to something like "every 2ns" we also need e.g. "the n/10 (or n%10) of trajectory goes to the n-th core".

Sure. That kind of partitioning will be the role of MPI_IO_Manager. The client triggered the particular FileIOStrategy when they constructed the TrajectoryReadManager, but only the client and the MPI_IO_Manager object should know about that. The other objects just pass around the container of numbers as required.

Also the specific format reader (e.g. XTC) should inform what type of skipping is (in)efficient. E.g. for XTC n%10 is very slow but n/10 is fast. We have started some of this in 1316 and also have some more in code which isn't uploaded yet.

OK, I'll take a look. I expect that the kind of place to put "uh XTC and n%10 is stupid" warnings would be in the code that is constructing the two kinds of manager.

Subclasses of TrajectoryReaderStrategy
  • call the FileIOManager object to get chunks of data

It would be good if the XTC/TRR/... reader/writer could be used flexible and not only with a file as the input/output. E.g. for parallel writing we needed to compress a XTC frame in memory.

This kind of thing is awkward to implement. If you might write various kinds of file type (e.g. XTC, TNG) with various kinds of output strategy (std, MPI) then you need something like multiple dispatching. Eventually, some object has to know that it is writing an XTC file from a CoordinateFrame that contains distributed data via MPI I/O (or whatever) because all those details are needed to write any code.

I think most OO APIs use streams for that purpose. http://www.gromacs.org/index.php?title=Developer_Zone/Programming_Guide/Allowed_C%2B%2B_Features says not to use iostreams. Is the rule intended to be applied only for actually output or do we also not want to use streams for intermediate processing steps?

I've interpreted that to mean "use stdio rather than iostream for console I/O" rather than "don't use operator<<()". But I'm yet to see why "using streams" helps solve the problem of implementing the kind of multiple dispatching that we might need to do.

#8 Updated by Mark Abraham over 6 years ago

Teemu Murtola wrote:

Two quick comments on the overall structure. Will look at the details later.
  • The FileIOManager is a more general concept than just for trajectory I/O, so it could be better to split it out into a separate issue. We will have need for this kind of file objects also elsewhere in the code. There may be something specific that is necessary for trajectory I/O, but mostly it is general, with requirements to be able to mock it from tests etc. (@Roland: in the current design, it is possible to implement a FileIOManager that writes into a memory buffer instead of a real file). This has something in common with #950, but in my mind is separate.

Agreed, a more general component is probably wise.

  • If we want to abstract away the representation of the data such that the caller can decide it, to me it would feel better to have a simple interface (e.g., ITrajectoryFrameBuilder) that provides methods to populate the structure. The caller could then simply pass an implementation of this interface to TrajectoryReadManager (or even individual readFrame() calls). Writing will have different requirements, though, but that may need a somewhat different design anyhow.

I think I was putting this kind of functionality in CoordinateFrame::fillFrom(). If there's a way to get data from the raw I/O straight into the containers from wh ich it'll be used, then I'm all ears. I suspect "populating the structure" will require some deep container copies for the non-trivial cases. Having a "filler" function avoids needing the kind of interface I think you are suggesting. In the trivial case, the implementation of that function is just a shallow container copy.

#9 Updated by Teemu Murtola over 6 years ago

  • Description updated (diff)

Mark Abraham wrote:

Teemu Murtola wrote:

  • If we want to abstract away the representation of the data such that the caller can decide it, to me it would feel better to have a simple interface (e.g., ITrajectoryFrameBuilder) that provides methods to populate the structure. The caller could then simply pass an implementation of this interface to TrajectoryReadManager (or even individual readFrame() calls). Writing will have different requirements, though, but that may need a somewhat different design anyhow.

I think I was putting this kind of functionality in CoordinateFrame::fillFrom(). If there's a way to get data from the raw I/O straight into the containers from which it'll be used, then I'm all ears. I suspect "populating the structure" will require some deep container copies for the non-trivial cases. Having a "filler" function avoids needing the kind of interface I think you are suggesting. In the trivial case, the implementation of that function is just a shallow container copy.

My proposal doesn't change much about the fillFrom() function nor what is possible and what is not in terms of copying data. At its simplest, the ITrajectoryFrameBuilder would have only a single method, identical to your fillFrom(). It could even be such that the frame objects themselves implement ITrajectoryFrameBuilder.

What one would gain with my proposal is that there would be no need for the prototype, nor any of the complexity it brings. No need for inheritance hierarchy for the different frame structures (of course this hierarchy would be transfered to the builders, but the purpose of the inheritance would be much more clear there). No need for clone() methods that the trajectory I/O component needs to know about, and that actually only need to be able to clone an empty frame. And no uncertainty on what is the type of the frame returned by readNextFrame(); in your proposal, unless one adds templates, the readFrame() method will only return a CoordinateFrame, and all the callers need to cast that to the type they expect. My proposal may also require some templates, though, if one wants to be able to write code like

SpecialFrameBuilder builder;
SpecialFrameTypePointer frame;
do
{
    SpecialFrameTypePointer frame = reader.readNextFrame(builder);
}
while (...);

PS. Also, another try at reformatting the description...

#10 Updated by Teemu Murtola over 6 years ago

Mark Abraham wrote:

Roland Schulz wrote:

Mark Abraham wrote:

TrajectoryReadManager
  • returns only coordinate frames that satisfy conditions specified by
    the client (e.g. frames every 2ns)

For parallel reading we want these conditions and also a split between cores. So in addition to something like "every 2ns" we also need e.g. "the n/10 (or n%10) of trajectory goes to the n-th core".

Sure. That kind of partitioning will be the role of MPI_IO_Manager. The client triggered the particular FileIOStrategy when they constructed the TrajectoryReadManager, but only the client and the MPI_IO_Manager object should know about that. The other objects just pass around the container of numbers as required.

How does the MPI_IO_Manager in your design know what is a "frame"? If all it gets is generic requests for chunks of data through the FileIOStrategy interface (the same requests it would for serial I/O), how can it do anything reasonable for parallel I/O? I have to admit that I don't have that much experience with parallel I/O, but something seems to be missing here.

Subclasses of TrajectoryReaderStrategy
  • call the FileIOManager object to get chunks of data

It would be good if the XTC/TRR/... reader/writer could be used flexible and not only with a file as the input/output. E.g. for parallel writing we needed to compress a XTC frame in memory.

This kind of thing is awkward to implement. If you might write various kinds of file type (e.g. XTC, TNG) with various kinds of output strategy (std, MPI) then you need something like multiple dispatching. Eventually, some object has to know that it is writing an XTC file from a CoordinateFrame that contains distributed data via MPI I/O (or whatever) because all those details are needed to write any code.

Doesn't your proposed design have exactly the same problem? The file format specific reader does not know whether it is reading using MPI or something else, as it only gets a generic FileIOStrategy. I'm not sure whether this can work, as doing parallel I/O efficiently is inherently bound to the file format. It may work if we simply say that the only way of doing parallel I/O is reading complete frames in each process, but even then, some coordination is probably required at a higher level than just inside the MPI_IO_Manager. I think that is a much bigger problem than adding a reader/writer that operates on a memory buffer instead of a file.

#11 Updated by Mark Abraham over 6 years ago

Teemu Murtola wrote:

Mark Abraham wrote:

Roland Schulz wrote:

Mark Abraham wrote:

TrajectoryReadManager
  • returns only coordinate frames that satisfy conditions specified by
    the client (e.g. frames every 2ns)

For parallel reading we want these conditions and also a split between cores. So in addition to something like "every 2ns" we also need e.g. "the n/10 (or n%10) of trajectory goes to the n-th core".

Sure. That kind of partitioning will be the role of MPI_IO_Manager. The client triggered the particular FileIOStrategy when they constructed the TrajectoryReadManager, but only the client and the MPI_IO_Manager object should know about that. The other objects just pass around the container of numbers as required.

How does the MPI_IO_Manager in your design know what is a "frame"? If all it gets is generic requests for chunks of data through the FileIOStrategy interface (the same requests it would for serial I/O), how can it do anything reasonable for parallel I/O? I have to admit that I don't have that much experience with parallel I/O, but something seems to be missing here.

My memory of MPI file I/O (admittedly a few years old) was that it was about the same interface as MPI communication. You set up an access pattern, passed a local buffer and it got filled with the data that matched the pattern. So I'm still thinking that a FileIOStrategy interface with things like readVector<T>() is workable. Doubtless there are issues we'll see once we're in the trenches...

Subclasses of TrajectoryReaderStrategy
  • call the FileIOManager object to get chunks of data

It would be good if the XTC/TRR/... reader/writer could be used flexible and not only with a file as the input/output. E.g. for parallel writing we needed to compress a XTC frame in memory.

This kind of thing is awkward to implement. If you might write various kinds of file type (e.g. XTC, TNG) with various kinds of output strategy (std, MPI) then you need something like multiple dispatching. Eventually, some object has to know that it is writing an XTC file from a CoordinateFrame that contains distributed data via MPI I/O (or whatever) because all those details are needed to write any code.

Doesn't your proposed design have exactly the same problem?

Sure. I was intending to note the scale of the problem, not that it was solved. In practice, I expect we will be very selective about what file formats are worth implementing/using for parallel I/O.

The file format specific reader does not know whether it is reading using MPI or something else, as it only gets a generic FileIOStrategy. I'm not sure whether this can work, as doing parallel I/O efficiently is inherently bound to the file format. It may work if we simply say that the only way of doing parallel I/O is reading complete frames in each process, but even then, some coordination is probably required at a higher level than just inside the MPI_IO_Manager. I think that is a much bigger problem than adding a reader/writer that operates on a memory buffer instead of a file.

I haven't yet thought of a sensible use case for parallel trajectory reading filling a distributed data structure with a single frame. I'm hoping Roland (or gerrit 1316) will shed some light there. The nearest example to it that GROMACS has right now is the mdrun -rerun functionality, where we do a serial read and then use domain decomposition. There, DD has responsibility for shuffling the atoms & information to where they need to be to do effective work. I am struggling to imagine that we could have useful parallel trajectory reading of a single frame that didn't fill an object of a sexy CoordinateFrame that had DD-like abilities to know what data was needed where and to communicate as needed. If so, then the present problem reduces to the client code indicating what parallel I/O access pattern is wanted, MPI_IO_Manager doing the dirty work, and the data ending up in said sexy CoordinateFrame object.

However, if we might be using parallel I/O to do embarrassingly parallel analysis tasks, then the scope of this component would need to increase accordingly. Currently, it assumes there's only one logical frame around.

#12 Updated by Mark Abraham over 6 years ago

Teemu Murtola wrote:

Mark Abraham wrote:

Teemu Murtola wrote:

  • If we want to abstract away the representation of the data such that the caller can decide it, to me it would feel better to have a simple interface (e.g., ITrajectoryFrameBuilder) that provides methods to populate the structure. The caller could then simply pass an implementation of this interface to TrajectoryReadManager (or even individual readFrame() calls). Writing will have different requirements, though, but that may need a somewhat different design anyhow.

I think I was putting this kind of functionality in CoordinateFrame::fillFrom(). If there's a way to get data from the raw I/O straight into the containers from which it'll be used, then I'm all ears. I suspect "populating the structure" will require some deep container copies for the non-trivial cases. Having a "filler" function avoids needing the kind of interface I think you are suggesting. In the trivial case, the implementation of that function is just a shallow container copy.

My proposal doesn't change much about the fillFrom() function nor what is possible and what is not in terms of copying data. At its simplest, the ITrajectoryFrameBuilder would have only a single method, identical to your fillFrom(). It could even be such that the frame objects themselves implement ITrajectoryFrameBuilder.

What one would gain with my proposal is that there would be no need for the prototype, nor any of the complexity it brings. No need for inheritance hierarchy for the different frame structures (of course this hierarchy would be transfered to the builders, but the purpose of the inheritance would be much more clear there). No need for clone() methods that the trajectory I/O component needs to know about, and that actually only need to be able to clone an empty frame. And no uncertainty on what is the type of the frame returned by readNextFrame(); in your proposal, unless one adds templates, the readFrame() method will only return a CoordinateFrame, and all the callers need to cast that to the type they expect. My proposal may also require some templates, though, if one wants to be able to write code like
[...]

OK I see your point now. I hadn't foreseen the casting issue. So you're suggestiong either there's a parallel hierarchy of FrameBuilders, or perhaps a virtual method in each implementation of CoordinateFrame to "build the right object on the end of a smart pointer of the type the client code expects." The important thing here is that it is easy for the author of a tool to say once "I'm going to expect this kind of CoordinateFrame" and not to have to write lots of boilerplate code in their inner loop to make it work. I'll think more about this tomorrow.

PS. Also, another try at reformatting the description...

I haven't found the interface to attempt that, yet! Where is it?

#13 Updated by Teemu Murtola over 6 years ago

Mark Abraham wrote:

OK I see your point now. I hadn't foreseen the casting issue. So you're suggestiong either there's a parallel hierarchy of FrameBuilders, or perhaps a virtual method in each implementation of CoordinateFrame to "build the right object on the end of a smart pointer of the type the client code expects." The important thing here is that it is easy for the author of a tool to say once "I'm going to expect this kind of CoordinateFrame" and not to have to write lots of boilerplate code in their inner loop to make it work. I'll think more about this tomorrow.

I think that the code that the caller needs to write is something along these lines (not all details may be the best possible, but you should get the idea):

TrajectoryReader reader;
TrajectoryFrameBuilder frameBuilder;
while (reader.readNextFrame(frameBuilder))
{
    TrajectoryFramePointer frame = frameBuilder.buildFrame();
}

I don't think that this is too verbose, and the implementation has some nice properties (like that it doesn't require any templates). It's possible to also write a routine that wraps the readNextFrame() + buildFrame() part into a single function, but internally, I think that something like this could be the way to go for the implementation if we want to support multiple output structures.

PS. Also, another try at reformatting the description...

I haven't found the interface to attempt that, yet! Where is it?

Click "Update", then on "More" next to "Change properties" (you should have the necessary rights).

#14 Updated by Teemu Murtola over 6 years ago

Mark Abraham wrote:

My memory of MPI file I/O (admittedly a few years old) was that it was about the same interface as MPI communication. You set up an access pattern, passed a local buffer and it got filled with the data that matched the pattern. So I'm still thinking that a FileIOStrategy interface with things like readVector<T>() is workable. Doubtless there are issues we'll see once we're in the trenches...

Yes, this is how it works in general, but I see two issues:
  1. To set up an access pattern, you need to know the file format that you are reading, and possibly also how much data you are reading and how you intend to use it. The caller or the MPI IO manager cannot do this without help from the file format specific reader, and the file format specific reader can't do this without information from the caller.
  2. Static access patterns may work for some cases, but for trajectory formats, I would guess that in most cases one would want to adjust the pattern depending on the data. E.g., xtc does not have fixed-length frames, and a hypothetical DD-aware trajectory format that is easy to write in parallel would potentially have blocks of varying numbers of atoms within a single frame. And again this information needs to flow from the file format specific reader to the IO manager, which isn't easy to do generally in this design.

#15 Updated by Mark Abraham over 6 years ago

Teemu Murtola wrote:

Mark Abraham wrote:

OK I see your point now. I hadn't foreseen the casting issue. So you're suggestiong either there's a parallel hierarchy of FrameBuilders, or perhaps a virtual method in each implementation of CoordinateFrame to "build the right object on the end of a smart pointer of the type the client code expects." The important thing here is that it is easy for the author of a tool to say once "I'm going to expect this kind of CoordinateFrame" and not to have to write lots of boilerplate code in their inner loop to make it work. I'll think more about this tomorrow.

I think that the code that the caller needs to write is something along these lines (not all details may be the best possible, but you should get the idea):
[...]
I don't think that this is too verbose, and the implementation has some nice properties (like that it doesn't require any templates). It's possible to also write a routine that wraps the readNextFrame() + buildFrame() part into a single function, but internally, I think that something like this could be the way to go for the implementation if we want to support multiple output structures.

Yes, something like that is what I had in mind. Obviously, there will be arguments to those constructors that trigger what implementations will get used. And TrajectoryFrameBuilder needs some kind of polymorphism, so that the pointer returned from buildFrame has the right type to avoid a cast. As a side effect, the various Frame classes need not implement a common interface.

I'm still not sure why frameBuilder needs to be passed to reader.nextFrame. If the frameBuilder object is constant over the reading process, isn't it better to simplify the client code and do:

TrajectoryReader reader(various,args); // builds and owns an internal TrajectoryFrameBuilder
while (reader.readNextFrame())
{
    TrajectoryFramePointer frame = reader.getBuilder().buildFrame();
}

(where buildFrame() or its class is polymorphic so the right kind of pointer is returned). It's tempting to take the TrajectoryFrameBuilder out of the component interface entirely (which would require TrajectoryReader::buildFrame<FrameType>() or something), but that would couple the Reader to the implementations of the Frame classes too closely?

PS. Also, another try at reformatting the description...

I haven't found the interface to attempt that, yet! Where is it?

Click "Update", then on "More" next to "Change properties" (you should have the necessary rights).

Aha. Thanks. (Now you've removed one of the motivations I had for upgrading Redmine version some time soon!)

#16 Updated by Teemu Murtola over 6 years ago

Mark Abraham wrote:

Yes, something like that is what I had in mind. Obviously, there will be arguments to those constructors that trigger what implementations will get used. And TrajectoryFrameBuilder needs some kind of polymorphism, so that the pointer returned from buildFrame has the right type to avoid a cast. As a side effect, the various Frame classes need not implement a common interface.

Yes. But buildFrame() does not need to be virtual. In my example, the declarations would look something like this (most details omitted):

class ITrajectoryFrameBuilder
{
    virtual void fillFrom(unique_ptr<PrimitiveFrame>) = 0;
};

class TrajectoryFrame
{
};

class TrajectoryFrameBuilder : public ITrajectoryFrameBuilder
{
    TrajectoryFramePointer buildFrame();
};

class TrajectoryFrameReader
{
    bool readNextFrame(ITrajectoryFrameBuilder &);        
};

Support for another type of output frame would simply require implementing a different frame object (which has no relation to the TrajectoryFrame class), and a builder for it (which again only implements ITrajectoryFrameBuilder).

I'm still not sure why frameBuilder needs to be passed to reader.nextFrame. If the frameBuilder object is constant over the reading process, isn't it better to simplify the client code and do:
[...]
(where buildFrame() or its class is polymorphic so the right kind of pointer is returned).

This has the same return value problem as your original design. Since the getBuilder() method can only return a generic builder, it is not possible to return a correct type. When it is the caller's responsibility to construct the builder and own it, the caller can directly call methods in the concrete builder that return that specific type. It is possible to pass the builder to the constructor of the reader instead of each readNextFrame() call, but I think it is easier to understand the code when it is only passed where it is really needed.

It's tempting to take the TrajectoryFrameBuilder out of the component interface entirely (which would require TrajectoryReader::buildFrame<FrameType>() or something), but that would couple the Reader to the implementations of the Frame classes too closely?

It isn't very easy to make the builder invisible in the interface, unless you make the whole reader a TrajectoryReader<FrameType>, or alternatively construct a new builder for each frame. Adding a template method that has either the builder type or the frame type as a template parameter allows for some trickery, but it doesn't simplify the calling code much, so it may not be worth it.

#17 Updated by Mark Abraham over 6 years ago

Teemu Murtola wrote:

Mark Abraham wrote:

My memory of MPI file I/O (admittedly a few years old) was that it was about the same interface as MPI communication. You set up an access pattern, passed a local buffer and it got filled with the data that matched the pattern. So I'm still thinking that a FileIOStrategy interface with things like readVector<T>() is workable. Doubtless there are issues we'll see once we're in the trenches...

Yes, this is how it works in general, but I see two issues:
  1. To set up an access pattern, you need to know the file format that you are reading, and possibly also how much data you are reading and how you intend to use it. The caller or the MPI IO manager cannot do this without help from the file format specific reader, and the file format specific reader can't do this without information from the caller.
If a client had an XTC file to read with MPI I/O into a ConcreteCoordinateFrame distributed over 8 nodes, then the construction of the TrajectoryReadManager object would specify all these things (and perhaps whether the file access will be strided or chunked by 8). At that point, the TrajectoryReadManager configures an MPI I/O fileManager with the knowledge that it's expected to deliver the specified access pattern, and configures an XTC trajectoryReader. Later, in aTrajectoryReadManager.readNextFrame():
  • readNextHeader() is called, since there might be one to read
  • that calls trajectoryReader->readHeader(fileManager)
  • that is an XTC reader, so it really knows it needs to read whatever XTC header exists, and calls the fileManager to get the job done using methods that implement a broadcast access pattern (or something)
  • the XTC reader should now know enough to know how much data it expects later (but not that it'll be distributed over 8 processors)
  • now aTrajectoryReadManager can call areConditionsSatisfied()
  • then presumably readNextBody()
  • that calls trajectoryReader->readBody(fileManager)
  • which is an XTC reader, so it calls the file manager to give it whatever chunk of stuff comes next in the XTC format, perhaps with the size anticipated from the header
  • and that file manager knows it wants a chunk of stuff and still remembers it was configured with an access pattern
  • so it can allocate a container of a sufficient size for the size of the chunk modulo the access pattern and do the MPI I/O call
  • and the XTC reader can keep asking for whatever chunks it expects, getting fed back those chunks modulo the access pattern

The aTrajectoryReadManager and its trajectoryReader never need to know that the size of the buffers in the resulting PrimitiveFrame are about one eighth of the size they'd be if we were using serial I/O. Only the ConcreteCoordinateFrame type specified by the client needs to know that.

  1. Static access patterns may work for some cases, but for trajectory formats, I would guess that in most cases one would want to adjust the pattern depending on the data. E.g., xtc does not have fixed-length frames, and a hypothetical DD-aware trajectory format that is easy to write in parallel would potentially have blocks of varying numbers of atoms within a single frame. And again this information needs to flow from the file format specific reader to the IO manager, which isn't easy to do generally in this design.

This is what the header-reading phase is for, somewhat like we do now. A frame in such a DD-aware format was written in the knowledge of how much data was being written, and that kind of information should be written first and thus available to trajectoryReader once it has read the header. In this case, the MPI I/O reader has to be configured to provide an irregular access pattern. After reading the header, the reader must know that (say) the 8-fold DD had (102, 99, 97, 101, 100, 99, 101, 97) atoms and the reader had to be configured with its process ID, so the second process knows to tell its MPI I/O reader that it wants data range 102 through 200.

File formats like PDB that lack a proper header will probably have an empty readHeader(), and have to have their readBody() peek at the next line to work out when to stop. All that peeking/caching/EOF code is within the PDBTrajectoryReader object, so that's as good as it ever can be.

#18 Updated by Teemu Murtola over 6 years ago

In response to comment 17 by Mark Abraham:

All that may be workable, but I think my point is more that does doing this generically bring any value? The FileIOStrategy interface still needs a lot of parallelization-specific things in the interface (like specifying what is read using a "broadcast", what is the minimum element that can be chunked, configuring the irregular access pattern etc.). And since this will be part of the interface, also the serial strategies become more complex since they need to implement/call these functions. Making compromises in the design of the interface to keep it simpler may also mean that it is less future-proof in terms of ability to efficiently read different file formats in parallel.

An alternative design that avoids most of these issues would be that there would be a separate strategy for parallel I/O (e.g., SerialTRRTrajectoryReader and ParallelTRRTrajectoryReader). These can still share whatever implementation they can. And the I/O strategies can share some common interface, but this way the parallel reader could much more easily use the parallelization-specific interfaces without complicating everything else in the code (since it has a handle to the real MPI_IO_Manager instead of a generic interface). In this design, the reading strategy is responsible for constructing the I/O strategy (based on information it receives from the ReaderManager). Again, code for processing common caller options can be shared. This approach would also be more flexible for the cases where the trajectory is read using an external library, where it may be impossible to replace the I/O layer with our FileIOStrategy implementation.

#19 Updated by Mark Abraham over 6 years ago

Teemu Murtola wrote:

In response to comment 17 by Mark Abraham:

All that may be workable, but I think my point is more that does doing this generically bring any value? The FileIOStrategy interface still needs a lot of parallelization-specific things in the interface (like specifying what is read using a "broadcast", what is the minimum element that can be chunked, configuring the irregular access pattern etc.).

Yes, needing a generic interface that anticipates that some implementations will need broadcast semantics is a bit nasty.

And since this will be part of the interface, also the serial strategies become more complex since they need to implement/call these functions. Making compromises in the design of the interface to keep it simpler may also mean that it is less future-proof in terms of ability to efficiently read different file formats in parallel.

True.

An alternative design that avoids most of these issues would be that there would be a separate strategy for parallel I/O (e.g., SerialTRRTrajectoryReader and ParallelTRRTrajectoryReader). These can still share whatever implementation they can. And the I/O strategies can share some common interface, but this way the parallel reader could much more easily use the parallelization-specific interfaces without complicating everything else in the code (since it has a handle to the real MPI_IO_Manager instead of a generic interface). In this design, the reading strategy is responsible for constructing the I/O strategy (based on information it receives from the ReaderManager). Again, code for processing common caller options can be shared. This approach would also be more flexible for the cases where the trajectory is read using an external library, where it may be impossible to replace the I/O layer with our FileIOStrategy implementation.

Good points. I like that approach much more.

Assuming there's some implementation code common to *TRRTrajectoryReader, would we have an abstract TRRTrajectoryReaderStrategy partially implement TrajectoryReaderStrategy?

I'll redraft the component class diagram based on the feedback, have a think about what the object for frame-acceptance conditions should look like, post for further review, and start generating some more regression tests.

#20 Updated by Teemu Murtola over 6 years ago

Mark Abraham wrote:

Assuming there's some implementation code common to *TRRTrajectoryReader, would we have an abstract TRRTrajectoryReaderStrategy partially implement TrajectoryReaderStrategy?

I think it is better to have a separate class (e.g., TRRTrajectoryReaderHelper) that contains that common code, and then both trajectory readers would use that object by containment. There is an OO design principle "inherit to be reused, not to reuse", and this is one example of that. Some of my existing code still violates this, but I may work on refactoring those parts at some point, when I need to touch them for some other reason...

#21 Updated by Mark Abraham over 5 years ago

  • Target version changed from 5.0 to future

#22 Updated by Mark Abraham over 5 years ago

#23 Updated by Mark Abraham over 3 years ago

  • Status changed from New to Rejected
  • Target version deleted (future)

Life has gone in other directions

Also available in: Atom PDF