Task #2916

Updated by Mark Abraham over 1 year ago

We have a bunch of strings, particularly in the topology data structures (e.g. atom name, name of atom type, name of residue, name of residue type, element, FEP B-state versions of some of those).

For reasons that probably include minimizing size, we have a legacy t_symtab structure, into which strings can be inserted. Data structures like t_atom then contain pointers that refer to entries in the symtab. We need to modernize this so that we can improve the data structures that use these strings.

Option 1)
Migrate to using a std::string everywhere. This makes use, allocation and serialization trivial. Small-string optimizations are sometimes our friend here. However the .tpr file might e.g. double in size, because now alongside per-atom data like @x[3],v[3],q@ x[3],v[3],q we also have a string by value for at least some of the above types of data. That doubling is probably no big deal.

Option 2)
A staged replacement to use a new StringTable type that encapsulates some storage, provides an interface to insert into, can be serialized predictably, can return handles into it suitable for serialization of dependent data structures, and is at least somewhat efficient to look up once we're done inserting things into it. That could be implemented with e.g. a std::vector<string> that perhaps gets sorted when insertion is done, and if we can find a suitable C++14 version of then data structures like t_atom can hold such views and be serialized in terms of an index into the underlying container (which is roughly what t_symtab does now).

Option 2 would usefully generalize to a (future) data structure that is a std::unordered_map<std::string, StringTable> so that we can have separate tables for atom names, atom types, etc. that we refer to like @stringTables["atomname"].insert("CA")@. That will hopefully map naturally to a future key-value format to replace the .tpr where for each particle we index into a table of particle names, a table of possible residue types (which itself might index into a table of residue type names?). Having deserialized all the tables first, now when we read the particular per-particle fields that refer into them, we can have the correct string_view.