Performance Guidance for C++ AMP

Article
07/26/2012

Some of our existing and upcoming blog posts illustrate how to get the best performance from C++ AMP, with a focus on today’s DirectX 11 GPUs, and a focus on our v1 capabilities. This post will serve as an index into current and future blog posts that can help you with performance tuning.

Measuring Performance of C++ AMP computations

Here are some links to help you measure performance characteristics accurately:

Optimizing Command Submission

A major performance concern in GPU computing is coordinating between CPU and GPU. A large part of this (especially discrete parts) comes from transferring the data to and from the accelerator. This includes understanding how various CPU operations, copies and kernels may occur concurrently with each other. Another overhead is the one time launch of each kernel invocation from the CPU side. Some tips in that area can be found on these blog posts:

Understanding Compute-bound versus Memory-bound kernels

Simplistically, every thread in a kernel reads some data, does some arithmetic, and writes some results. By having many threads concurrently operating, these basic steps are overlapped so that while some threads are doing arithmetic others are reading data. We say that a kernel is memory bound when it spends more net time reading and writing memory than doing arithmetic. Otherwise, we say it is compute bound.

In the memory bound case, you will want to focus tuning first to do less reading or less writing. Even if a loop looks like it should be compute bound, particular memory access patterns may not use memory efficiently. If you make a change to reduce the amount of arithmetic and do not see a performance gain, the kernel is likely memory bound. This suggests that you should focus on memory usage and efficiency first.

Optimizing compute-bound kernels

Here are some resources to help you with optimizing compute bound kernels:

Avoid Aliased Invocation of parallel_for_each

Optimizing memory-bound kernels

Here are some resources to help you optimize memory bound kernels:

Performance Oriented Samples

You are encouraged to visit our official and comprehensive C++ AMP samples list and download them all. Here, in this blog post, we re-list some samples which are more performance-tuning-oriented than others:

Parallel Reduction using C++ AMP shows techniques for efficient use of memory and for coordination of threads within tiles.
Chunking data across multiple C++ AMP kernels illustrates optimization of memory copies between host and accelerators and between multiple accelerators.
Matrix transpose using C++ AMP highlights the importance of effective use of memory and the use of tiling to achieve it.
Matrix Multiplication Sample shows how to use tile memory to avoid redundant global memory loads by multiple threads in a tile.
Convolution Sample also shows how to use tile memory to avoid redundant memory accesses by multiple threads in a tile.

The best place to ask questions is our MSDN Forum and occasionally questions that we receive relate to the topic of C++ AMP performance. In case some of these questions and scnearios match the ones you are facing, you may want to visit these discussion threads: