Task #2675: bonded CUDA offload task
implement heuristic fallback to CPU when there is too little work for GPU offload
When the bonded task is too small to be efficiently executed on the GPU, we should always fall back to the CPU path to avoid taking the cost of an ~13 us GPU kernel launch + some GPU-side delay.
- determine the rough cross-over where the CPU would always take less than a GPU kernel launch;
- expose the the bonded sparse reduction so this can be used when reducing forces