Future of single-node parallel coding
We have already discussed the future task-level parallelism quite a bit, since that's the way we are heading on the algorithm level, but as I've started to look more into this I've also realized there are a couple of choices to be discussed/made for the lower level language and library features we use for to implement those algorithms.
A few things:
- We should be restrictive with direct OpenMP usage. This is starting to proliferate a lot in the code since it's a quick hack, but OpenMP scaling frequently leaves things to be desired for higher node counts, so it is NOT a silver bullet. The more such direct code we write, the harder it will be to modularize and get rid of it in the future - we are already using around OpenMP pragmas in ~160 different places....
- Can we come up with any way to start using our own thread/sync layer roughly on the current level of OpenMP? This would help us avoid even more OpenMP dependencies before we have the full task parallelism implemented.
- Should we design a general fine-grained lock structure for forces and other arrays where we do random memory access updates? I think we could do something reasonably efficient by having a list where each entry corresponds to N bytes, or alternatively a hash covering all of memory with ~1000 different spinlock positions. Note that atomic operations will not work directly on floating-point data, and much less SIMD.
- We should consider creating locking structures that can work with transactional memory extensions. Intel appears to have screwed up theirs big time so they are disabling it in Haswell-EP, but on Power8 it should work. Any ideas how to do this best?
#1 Updated by Roland Schulz over 6 years ago
2. My understanding is we can use the TBB API to replace our OpenMP constructs. In case we decide to go with the TBB API it seems odd to use some other API in between.
3. In what cases would random access memory updates be better than reduction? My impression is that it scales worse on CPUs and isn't reproducible (unless using fixed point).
4. It should be fixed in Haswell-EX. For what operations do you expect it to help in Gromacs?