Task #2675: bonded CUDA offload task
bonded GPU module timing
The timing implementation is straightforward, but not critical given that, due to the buggy cudaEven-timing facilities, it is not possible to time kernels when there is concurrent work launched (i.e. multiple streams).
Host-side (launch) timing should also be added to avoid leaking time into "Rest".