Project

General

Profile

Feature #1181

Implementing asymmetric offload to MIC (Xeon Phi Co-processor)

Added by Mikhail Plotnikov over 4 years ago. Updated over 1 year ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
-
Difficulty:
uncategorized
Close

Description

There is highly optimized GPU offload scheme implemented in GROMACS with auto-detection of GPU and dynamic balancing between CPU<->GPU. It offloads PP computations to GPU and computes PME part on CPU. The same asymmetric parallelization approach can be implemented for MIC co-processor and is highly appreciated as MIC has been already launched and is accessible for users.


Related issues

Related to GROMACS - Feature #1187: Optimized Verlet SIMD kernels for native MIC (Xeon Phi co-processor)Closed
Related to GROMACS - Feature #1394: 16-wide MIC SIMD bonded interaction kernelsClosed
Related to GROMACS - Feature #1420: Fast force reduction for large number of threadsIn Progress

Associated revisions

Revision d28336a2 (diff)
Added by Roland Schulz over 3 years ago

Add SIMD support for Intel MIC

Only single precision is supported so far.

To compile in native mode:
CFLAGS="-mmic" CXXFLAGS="-mmic" cmake

Regressiontests pass with ICC 14.

Instructions are not updated because only offload mode (#1181)
is useful to the user.

Part of #1187

Change-Id: I81a2022cfcecf634fdfaff5ce63ad82f0a5d4dee

History

#1 Updated by Szilárd Páll over 4 years ago

  • Category set to mdrun
To clarify things a bit, what we'd need for this is:
  • either the full task-parallelization framework;
  • or separate "PP nodes" similar to the PME ones that do nothing but calculate non-bonded (and perhaps bonded) forces and possibly do reduction on them.

The latter is probably considerably easier to implement. However, with the latter we might run into issues related to the (current) lack of task splitting and balancing meaning that e.g the non-bonded workload can not be flexibly split between CPU and GPU and therefore, especially if the accelerator is relatively slow (compared to the CPU), the CPU can end up idling a lot (especially with non-PME runs).

#2 Updated by Roland Schulz over 4 years ago

Why is either of this required? Can't we simply do an asynchronous offload of the kernel? This should be easier than adding PP-nodes and have the same effect. And wouldn't some very simple task splitting for the force kernel very simple? Also we would need to have kernels with the MIC instruction set to get good performance (either MIC intrinsics or ispc).

#3 Updated by Szilárd Páll over 4 years ago

@Roland, Mikhail is working on the Verlet kernels for MIC, so that part of the issue is being addressed. Ideally, we should get LRNbi kernels for bondeds as well (@Mikhail: added you to the reviewers of patch set 2206 relavant in this context).

The reason why I think PP nodes would be a far better solution is that offloading is pragma-driven, hence rather inflexible and ugly as well.

Workload splitting is not hard, but splitting without any load balancing is not very useful, I think and the load-balancing part can be a bit more tricky.

All in all, I think the tasks are not difficult, but they require deep familiarity with the code and right now it is not sure who will have time to collaborate with Mikhail on these features in the near future.

#4 Updated by Mikhail Plotnikov over 4 years ago

Roland Schulz wrote:

Why is either of this required? Can't we simply do an asynchronous offload of the kernel? This should be easier than adding PP-nodes and have the same effect.

Do you mean pragma offload in OpenMP loop? I don't think it will be efficient to make offload at this level due to overhead. Task decomposition is required with balancing between PP<->PME tasks.

#5 Updated by Szilárd Páll over 4 years ago

Mikhail Plotnikov wrote:

Roland Schulz wrote:

Why is either of this required? Can't we simply do an asynchronous offload of the kernel? This should be easier than adding PP-nodes and have the same effect.

Do you mean pragma offload in OpenMP loop? I don't think it will be efficient to make offload at this level due to overhead. Task decomposition is required with balancing between PP<->PME tasks.

That's what I thought, but I am not very familiar with the offloading mechanisms on MIC. Are there any ways to implement efficient asynchronous execution similar to what one would do with GPUs? How about _Cilk_offload? Does that only work with the implicit memory copy model? I assume the implicit memory copy model is not very efficient, is it?

In its most simple form offloading for the non-bonded calculations would involve executing the following operations asynchronously:
  • (transfer pair-list every pair search step)
  • transfer coordinates to accelerator
  • launch non-bonded kernel
  • transfer forces from accelerator
  • wait for forces to arrive

If the above operations can be launched asynchronously and implementing this is simple enough, even if it is not optimal, it could be advantageous to try it.

#6 Updated by Mikhail Plotnikov over 4 years ago

Szilárd Páll wrote:

That's what I thought, but I am not very familiar with the offloading mechanisms on MIC. Are there any ways to implement efficient asynchronous execution similar to what one would do with GPUs?

Pragma offload should work fine. I just meant that offload should be done at the highest possible level, not at the main computational loop. There is a possibility to emulate GPU offload with environment variable documented in manual.
GMX_EMULATE_GPU: emulate GPU runs by using algorithmically equivalent CPU reference code instead of GPU-accelerated functions.

This one should enable all the mechanism that used for GPU offload. I think it worth to start with offloading such reference code. Do you agree?

#7 Updated by Berk Hess over 4 years ago

The GPU emulation code does exactly the opposite. It produces a pair list optimized for the GPU, but does not do the off-loading.
Off-loading only involves the 5 calls Szilard mentioned above, which are all called in src/mdlib/sim_util.c: nbnxn_cuda_...
I assume the off-loading should not be slower than MPI communication with MIC. Implementing off-loading will be less work than implementing separate non-bonded only nodes similar to the PME-only nodes. The latter could be somewhat more general, but it goes against our planned task parallezation scheme, where we would like to have one MPI rank per physical node.

#8 Updated by Szilárd Páll over 4 years ago

Mikhail Plotnikov wrote:

This one should enable all the mechanism that used for GPU offload. I think it worth to start with offloading such reference code. Do you agree?

All GPU offload code is (NVIDIA) GPU/CUDA specific and consists of:
  • GPU device and data management/initilization functionalities;
  • the actual offload of the non-bonded computation including asynchronously executed explicit data copies to/from the GPU and non-bonded kernel.

The asynchronous CPU-GPU execution is achieved by enqueueing the aforementioned tasks in the GPU's work queue (through the CUDA runtime API). The GPU "emulation" mode doesn't do anything fancy, it simply omits call to any CUDA API functions (including kernel invocation) and executes a plain C kernel algorithmically equivalent to the CUDA GPU kernels. Hence, the GPU acceleration does not require separate PP ranks as the GPU tasks are handled by the GPU's own task scheduler and therefore there isn't much to (re)use for the MIC offload. Additionally, it really does not matter which plain C reference kernels are used, the GPU 8x8x8 or "normal CPU" 4x4 ones.

[ text below copied over from the email I sent earlier today ]
If it is reasonable to implement the CPU-MIC data/control flow in a similar fashion as the current GPU acceleration, the starting point could be implementing the MIC offload module by mirroring the GPU offload functionality, see:
  • module API: src/mdlib/nbnxn_cuda/nbnxn_cuda.h
  • module source: src/mdlib/nbnxn_cuda/nbnxn_cuda.cu

I assume none of the complicated explicit data- and device-management functionality that GPUs require is needed for MIC, right?

#9 Updated by Roland Schulz almost 4 years ago

  • Target version changed from future to 5.0

#10 Updated by Szilárd Páll almost 4 years ago

Have you started working on the pragma-based offload?

This summer I've looked at the code and it seemed that the current coupling between the number of non-bonded tasks and force buffers needs to be broken and the buffers and redcution should probably be moved "closer" to the force kernel (perhaps thread-local storage + reduction immediately after the kernel and before transferring back a single array).

Additionally, it could be worth considering:
  • a binary reduction could be more efficient on MIC;
  • overlapping x/f transfers by splitting up the coordinates (we're working on this for GPUs);
  • allowing the non-local kernels to start before all local work is done, but this may not be possible without asynchronous MPI.

#11 Updated by Roland Schulz almost 4 years ago

No we haven't started on this. We (mostly John Eblen in our group) have started with #1187. Both Mikhail and Chris Neale said they might be interested with doing this. If no one has time John will do it after #1187 is done.

With buffer you mean memory buffer (before reduce) not verlet buffer (rlist-rcut), correct? Thanks for the comments. I think whoever starts this work should have a chat with you before.

#12 Updated by Chris Neale almost 4 years ago

I am currently looking into this to see if I am capable of doing it. I will let you know if I make progress or decide it's beyond my abilities. I am available for discussion if anybody has suggestions.

#13 Updated by Szilárd Páll almost 4 years ago

Roland Schulz wrote:

With buffer you mean memory buffer (before reduce) not verlet buffer (rlist-rcut), correct? Thanks for the comments. I think whoever starts this work should have a chat with you before.

I was referring to the force accumulation buffers which are currently allocated early in the pair-list setup phase. As we will (most likely) use asymmetric offload, we will need a different number of buffers than than the number of non-bonded pair lists (= number of cores the pair search uses). Even if these two numbers would match we'd anyway want to reduce on the MIC and transfer back a single force array (or two with DD) instead of sending back k*61 accumulation arrays for reduction on the CPU.

#14 Updated by Szilárd Páll almost 4 years ago

It is probably a good idea to discuss things before taking concrete action; please contact me (and/or Berk) for scheduling a meeting.

#15 Updated by Roland Schulz over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to Roland Schulz

#16 Updated by Roland Schulz over 3 years ago

  • Related to Feature #1420: Fast force reduction for large number of threads added

#17 Updated by Roland Schulz over 3 years ago

  • Target version changed from 5.0 to 5.x

#18 Updated by Mark Abraham almost 2 years ago

  • Status changed from In Progress to Closed

I think this is never going to happen.

#19 Updated by Roland Schulz almost 2 years ago

  • Status changed from Closed to In Progress

We are working on it. And if it gives a speedup it has a chance to get in right?

#20 Updated by Mark Abraham almost 2 years ago

Maybe, but if KNL won't work in offload mode it doesn't seem like a good focus for effort

#21 Updated by Mark Abraham over 1 year ago

  • Target version deleted (5.x)

Are we still working on this?

Also available in: Atom PDF