Project

General

Profile

Task #1170

mdlib reorganization

Added by David van der Spoel about 6 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Difficulty:
uncategorized
Close

Description

Following discussions on the organization of code in Stockholm on 2013/02/27 here a (not-so short) resume by Mark.

1. We plan to work hard on reducing header file dependencies. They're everywhere right now and the cost will explode in C++ if we don't fix it and keep it fixed. The good news is that our existing practice of using opaque pointers is still one of the good solutions in C++ (known by various names, including Handle-Body and the Pimpl idiom). An alternative is declaring an abstract class with the virtual functions of the desired interface and deriving implementation classes to contain the guts. Either way, client code only has to include a light header file and recompile only when the API changes. I've been doing a bunch of reading and should be able to come up with guidelines for when to use which approach. I've added my thoughts for how to fix t_forcerec below.

2. We'd like to have a way for mdrun to read its inputs and construct some kind of settings object (e.g. with a container of strings to query for settings) that can then get queried as we do construction of actual objects. It follows that the inner loop will not know about the settings object (else for every test we'll have to make a mock of that settings object...). This means that code for the inner loop only has to pass around values that change over the course of the simulation, not every little flag setting.

3. We discussed that it might be nice if we didn't have to require that people use grompp. David cited a case of making your .tpr and waiting 3 days for mdrun to start on the cluster, before you find out that the settings are invalid. That's not a problem that is solved by moving the grompp functionality inside mdrun - you still have to wait for mdrun to start! Some aspects of such invalidity depend on how mdrun was compiled, and grompp can't know that. For example in 4.6, GPUs require Verlet which doesn't work with implicit solvent. If the queue system doesn't let you jump the queue to do a quick small-scale test of your settings, then the sysadmins need to be reminded of the real-world needs of users :-). Was there another reason why we wanted mdrun to incorporate grompp functionality?

4. Other grompp issues:
  • We don't want to remove grompp entirely - being able to package a portable, reusable run input file is rather useful, particularly for bug reports.
  • We do want to reduce the code footprint of constructing and serializing the pre-processed run input file. Teemu has already done some work in this area, which we'd likely do well to take seriously ;-)
  • Some of the command line options for grompp should probably be moved into the input file (e.g. filename for position restraints reference?). Mark thinks this is essential if we want mdrun to be able to accept grompp-style inputs.

5. Berk suggested having the .tpr file be a fully-featured checkpoint. So we'd merge the 4.6 functionality of .cpt and .tpr files into whatever we call it. grompp and tpbconv would write that new kind of file, and mdrun would write it at checkpoint intervals. That will mean mdrun has to keep writing out the topology and run input settings every checkpoint interval, but that can just be a ready-to-go binary blob, so no big deal.

======
Mark's thoughts on reorganizing data structures

Teemu flagged needing to reduce header file cross-dependencies in teleconference with him a few weeks back. I think we all agree there. We plan to be pretty defensive about mdrun functionality while we grapple with C++ design decisions, but there's some of tools dependencies on forcerec, too. Those have to be broken.

Leaping in to do a massive untangle is a guaranteed recipe for disaster, particularly with our low test coverage. Controlled refactoring after introducing any relevant new test cases is essential.

C-style opaque pointers are already used in various parts of the code to encapsulate bits of t_forcerec. We should extend that approach, so that (say) t_forcerec is pretty much a list of opaque pointers. The declarations and definitions of those data structures will go into new respective .h and .c files. Basically we can lean on the compiler at all points to tell us when stuff isn't visible. If it compiles, then it's pretty likely to run unless we've mis-managed the memory.

Now we can grep for #includes of forcerec.h, make any new test case, comment out such a line and see what breaks each time. Then re-introduce the minimal set of .h files that will make it compile and pass tests. Rinse and repeat for a week or two :-)

Later on, those new .h and .c files will become the nucleus for C++ objects (either via Pimpl or interface construction, TBD later). By now, it will be rather more clear which existing functions operate on which kinds of objects and we can better see what should be member functions or not.

The same approach probably goes for any other data structure that contains data for more than one kind of process. Perhaps t_inputrec will need some of the same kind of treatment - but point 2 above might be better done as a total re-write.


Related issues

Related to GROMACS - Task #1140: Class design for passing options and dataNew
Related to GROMACS - Feature #1182: improve trajectory writing to support parallel I/ONew2013-03-07
Related to GROMACS - Feature #1670: create mdrun option checking mini-toolNew

History

#1 Updated by David van der Spoel about 6 years ago

My point was rather that grompp can catch most errors instantaneously without queueing for mdrun to start. We can in principle make a grompp a library with a wrapper program that calls do_grompp(argc,argv). mdrun can either call that function or read a tpr file. The only problem there would be command line clutter but in fact grompp does not have a lot of options! It should be rather uncontroversial to move options to the mdp file as there is little risk of introducing problems in peoples workloads.

Having mdrun print topology information to cpt file may actually be an issue on low memory machines in HPC centers (Blue Gene?).

#2 Updated by Teemu Murtola about 6 years ago

David van der Spoel wrote:

2. We'd like to have a way for mdrun to read its inputs and construct some kind of settings object (e.g. with a container of strings to query for settings) that can then get queried as we do construction of actual objects. It follows that the inner loop will not know about the settings object (else for every test we'll have to make a mock of that settings object...). This means that code for the inner loop only has to pass around values that change over the course of the simulation, not every little flag setting.

There is a separate Redmine issue for this: #1140. I'm still not convinced that the described approach is the best; the alternative is to first gather the list of known settings from the modules that need them, and then directly pass the information into those modules while parsing the information (there can also be other alternatives). Both approaches have their good and bad sides, but I don't think that we should fixate on only one of these approaches without first looking at which one would be better (both from ease of implementation and code understandability/maintainability/extensibility points of view). No matter what we choose, I would strongly prefer that the implementation in src/gromacs/options/ is used, after adapting it as appropriate (right now, it only supports the latter approach of first specifying the allowed settings and then parsing them). I can certainly do the implementation work in the options part (after it is more clear what we want, and in a somewhat unknown schedule), or provide guidance if someone else wants to do that.

The current tpr format causes some extra complications, no matter which approach we want to take: if the options are recognized based on strings, there needs to be a global mapping between the tpr binary contents and the strings they correspond to.

#3 Updated by Mark Abraham about 6 years ago

David van der Spoel wrote:

My point was rather that grompp can catch most errors instantaneously without queueing for mdrun to start. We can in principle make a grompp a library with a wrapper program that calls do_grompp(argc,argv). mdrun can either call that function or read a tpr file. The only problem there would be command line clutter but in fact grompp does not have a lot of options! It should be rather uncontroversial to move options to the mdp file as there is little risk of introducing problems in peoples workloads.

I'm still not sure what functionality you'd like to see. It would be nice if grompp could be the place that does most of the checking of valid option combinations. There are going to be some things grompp can't check for, because not all combinations of input can run on all mdrun binaries (e.g. require PD or DD, or real MPI). Doubtless there are checks currently made by mdrun that should be in grompp, and I think those should move.

We can shuffle and duplicate the calls to pre-processing functionality as much as we like, but that is orthogonal to which code is responsible for checking the input for validity.

Another aspect of this is the proliferation of mdrun command-line options. Things that relate to the run-time performance of mdrun, or what mdrun outputs, belong on the command line. Stuff that affects physics belongs elsewhere. So I would move the -table*, -ei, -j, -replex, -reseed, -nex, -membed, -mp, -mn to the .mdp level. I can see the convenience of having some of these things on the mdrun command line, but the number of things we have to document in mdrun -h is starting to explode. If you want to tweak the user tables, having to re-run grompp (or use the suggested new "grompp front end to mdrun") is reasonable.

Having mdrun print topology information to cpt file may actually be an issue on low memory machines in HPC centers (Blue Gene?).

BlueGene/L had 500MB/core for all system and user space code and data. P and Q have rather more. The size of a .tpr is a hard upper bound for the size of the binary blob for topology - obviously it includes the "checkpoint" size too. What's the biggest .tpr anybody's ever used? How much larger was it than a .cpt it produced?

#4 Updated by Roland Schulz about 6 years ago

Mark Abraham wrote:

BlueGene/L had 500MB/core for all system and user space code and data. P and Q have rather more. The size of a .tpr is a hard upper bound for the size of the binary blob for topology - obviously it includes the "checkpoint" size too. What's the biggest .tpr anybody's ever used? How much larger was it than a .cpt it produced?

Our largest system is tpr: 647M, cpt: 544M. To get around the memory problem we should support reading of tpr and writing of cpt/trr in parallel, so that we never need to store all coordinates/velocities/forces on a single MPI rank. That would allow us to simulate almost arbitrary large systems.

#5 Updated by Mark Abraham about 6 years ago

Roland Schulz wrote:

Mark Abraham wrote:

BlueGene/L had 500MB/core for all system and user space code and data. P and Q have rather more. The size of a .tpr is a hard upper bound for the size of the binary blob for topology - obviously it includes the "checkpoint" size too. What's the biggest .tpr anybody's ever used? How much larger was it than a .cpt it produced?

Our largest system is tpr: 647M, cpt: 544M.

OK, so we have 100M as an estimate for the size of the binary blob of topology info that we'd have to preserve/generate to implement Berk's suggestion. That's uncomfortably large - and running big simulation systems is what the big computers are for, and that's where we might run into tight memory constraints.

To get around the memory problem we should support reading of tpr and writing of cpt/trr in parallel, so that we never need to store all coordinates/velocities/forces on a single MPI rank. That would allow us to simulate almost arbitrary large systems.

Sure, that's something we should do better (and discuss in #1182). The (side) issue at hand is how big the non-[xvf] data might be.

Given the topology info can be a binary blob, it's not unreasonable to construct/copy it (before reading the .tpr [xvf] vectors, lest memory be really short!), then partition it up over (some of) the nodes to share the memory footprint. We can then tell the parallel I/O to do the pasting in the right order.

#6 Updated by Mark Abraham about 6 years ago

I found a possible replacement for the .mdp machinery. I haven't seen any serious flaws with it - see #1140.

#7 Updated by Mark Abraham almost 6 years ago

Bump. Still not clear to me what David is seeking with "grompp can catch most errors instantaneously without queueing for mdrun to start." Checks that depend on the execution context can't be run until that is set up. Other stuff can and should be in grompp so that a user can run a local grompp and get as much feedback as possible without queuing.

A decent design might be to have grompp read stuff in and sanitize, then run the generic checks. We could work out a way to have the mdrun checks that depend on the execution context be run in a sandbox. mdrun would actually use the same code for the real checks. But we could also call mdrun -check on a cluster head node with some kind of fake MPI environment that just lets the checks work without waiting to get through the queue. Setting up this kind of mock behaviour is something that C++ should be reasonably good at. It gets used often for testing.

Letting mdrun bundle grompp functionality is pretty orthogonal - the workflow would just be "read, santize, generic checks, execution-context checks, run" but now it's all in the same executable.

Also, if we run into issues with feeble C++ compilers, it might be possible to short-circuit mdrun to do no checking at all on such architectures, because one area where things like exception support seem useful is in structuring the checking code. So on some architecture with feeble C++ support, we might have to tell users do grompp, mdrun-local -check (if they want), mdrun-mpi (it's on their head if they didn't check).

#8 Updated by David van der Spoel almost 6 years ago

I just meant we can not do away with grompp entirely, as it is for practical for making many of the checks we need. Mark did not say he wanted to remove the functionality either, so sorry for the confusion.
It should be straightforward to make a read_tpx_from_top_and_gro_and_mdp function based on the grompp code, for use of mdrun without grompp.

#9 Updated by Mark Abraham almost 6 years ago

David van der Spoel wrote:

I just meant we can not do away with grompp entirely, as it is for practical for making many of the checks we need. Mark did not say he wanted to remove the functionality either, so sorry for the confusion.
It should be straightforward to make a read_tpx_from_top_and_gro_and_mdp function based on the grompp code, for use of mdrun without grompp.

Oh, OK. I thought you were after some particular new functionality to help avoid wasting time iterating "queue, get told off by mdrun, fix. mdp settings." We can do that along the lines of some of the stuff in comment 7, but it might not be a priority unless it also helps us test MPI stuff in a mock context.

I definitely want to keep grompp - the market for bundled grompp+mdrun is niche.

#10 Updated by Mark Abraham almost 5 years ago

  • Target version changed from 5.0 to future

#11 Updated by Mark Abraham over 4 years ago

  • Related to Feature #1670: create mdrun option checking mini-tool added

#12 Updated by Mark Abraham over 2 years ago

  • Tracker changed from Feature to Task

Also available in: Atom PDF