evaluate different storage layouts for GPU coordinates/changes/forces
In GPU code we currently use two different AoS layouts:
- for coordinates:
xyzq for coordinates & charges in the nonbonded kernels and
xyz / separate
q in PME
- for forces:
There are significant drawbacks to using AoS layot with 3-element short vectors (at least 2x global memory transactions, shared/local memory bank conflicts). Padding to 4 elements to be able to use 16-byte/thread vectorized gmem loads does have a 33% extra bandwidth need and the same amount of overhead translates to the amount of shared/local memory needed, but this will often not pose a limitation.
At the same time, while AoS is convenient, SoA does avoid the above AoS drawbacks but it can translate into wasted L1/L2 cache in case of scattered access patterns.
We should evaluate the options and decide whether we can live with a single storage layout across GPU kernels.