Feature #2816: GPU offload / optimization for update&constraits, buffer ops and multi-gpu communication
Add cycle counting to StatePropagatorDataGpu
The time spent on H2D and D2H copy in StatePropagatorDataGpu needs accounting, which require introducing the cycle counters to the object.
Add wallcycle counting to StatePropagatorDataGpu
Launch overheads are counted in the main GPU launch overhead counter and
a separate subcounter is used for the launch and a main counter for the
CPU blocking wait timing.
Note that this chnge introduces mdtypes->timing->mdtypes cyclic dependency,
the warning on which is suppressed.