Article
08/10/2015

April 2012

Volume 27 Number 04

C++ AMP - Introduction to Tiling in C++ AMP

By Daniel Moth | April 2012

This article covers a prerelease technology called C++ AMP that will ship with Visual Studio 11. All information is subject to change. For an introduction to C++ AMP, read "A Code-Based Introduction to C++ AMP" in this issue of MSDN Magazine.

Visual Studio 11 brings support for heterogeneous computing to the mainstream through C++ Accelerated Massive Parallelism (C++ AMP). I introduced C++ AMP in another article in this issue, which I consider prerequisite reading for this article. So if you haven’t read it yet, please do so, starting on p. 28.

Before I introduce you to the GPU programming performance-optimization technique called “tiling,” remember that in the previous article, you learned about index, extent, array_view, restrict(amp) and parallel_for_each. You’re able to use the C++ AMP API to implement your own data parallel algorithms, such as the matrix multiplication shared in the previous article and repeated here in Figure 1.

Figure 1 Matrix Multiplication, Simple Model

1  void MatMul(vector<int>& vC, const vector<int>& vA,
     const vector<int>& vB, int M, int N, int W )
2  {
3    array_view<const int, 2> a(M, W, vA), b(W, N, vB);
4    array_view<int, 2> c(M, N, vC);
5    c.discard_data();
6    parallel_for_each(c.extent, [=](index<2> idx) restrict(amp)
7    {
8      int row = idx[0]; int col = idx[1];
9      int sum = 0;
10     for(int i = 0; i < b.extent[0]; i++)
11       sum += a(row, i) * b(i, col);
12     c[idx] = sum;
13   });
14   c.synchronize();
15 }

If you aren’t familiar with matrix multiplication, here’s an online reference: bit.ly/HiuP.

In this article, I’ll introduce you to tiling, which is the most common optimization technique when programming GPUs. The only reason to introduce tiling in your algorithm is the extra level of performance that you can potentially achieve through data reuse and better memory-access patterns. Tiling allows you to take better advantage of the GPU memory hierarchy at a lower level than what you can do with the simple model, which you know from the introductory article.

There are two steps to tiling. First, you must explicitly tile the computation (this first step happens for you under the covers with the simple model); second, you must take advantage of tile_static memory (this second step doesn’t happen for you automatically, so you have to do it explicitly yourself).

tiled_extent Class

You know that parallel_for_each accepts an extent object as its first argument. The extent describes the compute domain—that is, how many threads (size) and of what shape (dimensionality) will execute the computation. Consider the following two extent examples:

extent<1> e(12);   // 12 threads in a single dimension
extent<2> ee(2,6); // 12 threads in two-dimensional space

There’s a tiled overload to parallel_for_each that accepts a tiled_extent argument. A tiled_extent describes how to break up the original extent into equally sized tiles. You can get a tiled_extent for up to a rank of only three. If you have more dimensions than that, you need to stick with the simple model or refactor your code. The total number of threads in a tile can’t exceed 1,024.

The easiest way to obtain a tiled_extent object is by calling the parameterless templated tile method on extent, which returns a tiled_extent for that extent object. For the two earlier extent examples you can write something such as the following to obtain corresponding tiled_extent objects:

tiled_extent<6> t_e = e.tile<6>();        // e.rank==t_e.rank
tiled_extent<2,2> t_ee = ee.tile<2, 2>(); // ee.rank==t_ee.rank

For a pictorial representation see Figure 2, which shows how tiling an extent partitions the data into smaller subsets, which in C++ AMP are called tiles.

Figure 2 tiled_extent Is an Extent Partitioned into Tiles

The numbers you choose to pass as template arguments must be known at compile time. They must evenly divide the global extent dimensions passed to the parallel_for_each:

e[0]=12 is divisible by t_e.tile_extent[0]=6
ee[0]=2 is divisible by t_ee.tile_extent[0]=2
ee[1]=6 is divisible by t_ee.tile_extent[1]=2

For performance reasons, the smallest tile size in the least-significant dimension should be 16. Tuning the tile size can yield better or worse performance results, depending on the hardware you use. My advice is, don’t try to do that—instead, pick a number that performs equally well across a broad set of hardware, starting with 16 and even multiples of it.

tiled_index Class

You know that in the parallel_for_each call you pass in a lambda with your code as the second argument. Your code receives an index object, and you can think of the index object as the thread ID. For example:

parallel_for_each(my_extent,[=](index<2> idx) restrict(amp){
  my_array_view[idx] = ...
});

When you tile the extent that you pass to the parallel_for_each, the signature of the lambda that you pass in accepts a tiled_index. A tiled_index consists of four index objects. For example, you can still get to the index object that you were expecting by using a property on the tiled_index object, as follows:

parallel_for_each(my_extent.tile<2,2>(),[=](tiled_index<2,2> t_idx) restrict(amp){
  my_array_view[t_idx.global] = ...
});

When writing tiled algorithms, you probably want to know the local index into the tile (not just the global index into the overall compute domain). You can get that index object through the local property of tiled_index. For some algorithms it’s useful to know the index of the tile in relation to the other tiles in the computation, and also the global index of the tile’s origin. You can access those index objects through the tile and tile_origin properties of tiled_index.

Using the two-dimensional extent of the earlier example (extent<2>(2,6).tile<2,2>()), you can see in Figure 3 the values of the aforementioned tiled_index properties for the highlighted square.

Figure 3 tiled_index Properties Returning Index Objects

Matrix Multiplication Revisited with Tiled parallel_for_each

In Figure 1 you saw a matrix multiplication implementation using the simple model of C++ AMP. How would you change that algorithm so it can be explicitly tiled with what you’ve learned so far (using tiled_extent and tiled_index)?

The solution is shown in Figure 4, with the changes from the earlier listing in bold.

Figure 4 Matrix Multiplication, Tiled, Step 1 Without Using tile_static

3  array_view<const int, 2> a(M, W, vA), b(W, N, vB);
4  array_view<int, 2> c(M, N, vC);
5  c.discard_data();
6  parallel_for_each(c.extent.tile<16,16>(),
     [=](tiled_index<16,16> t_idx) restrict(amp)
7  {
8  int row = t_idx.global[0]; int col = t_idx.global[1];
9  int sum = 0;
10 for(int i = 0; i < b.extent[0]; i++)
11   sum += a(row, i) * b(i, col);
12 c[t_idx.global] = sum;
13 });
14 c.synchronize();

On line 6 I invoked the tile method on the extent, picking a tile size (16x16 in this example), and I changed the lambda to accept a tiled_index with matching tile size template arguments. In the lambda body I replaced all idx occurrences with t_idx.global (lines 8 and 12). This mechanistic conversion is what you should do first for all your C++ AMP algorithms when you decide to tile them. It’s the first—but not the only—step on the journey from the simple model to the tiled model.

One thing to note as discussed earlier is that, with this change, you need to ensure the tile size picked evenly divides the global extent dimensions. My example assumes square matrices where each dimension is evenly divisible by 16. Also, it’s common practice to hoist the tile size out into a static const int variable or a template argument.

In the simple matrix multiplication sample in Figure 1, the system tiles the computation on your behalf, behind the scenes. So it’s implicitly tiled, instead of explicitly tiled, and you don’t have to worry about the divisibility requirement. What the simple model can’t do for you, and hence you have to do yourself, is the necessary step 2 of tiling: changing the algorithm to use tile_static memory and, typically, the usage of one or more of the other index objects. Before I delve into that, let’s take a detour to understand some hardware basics.

Short Background on GPU Memory Hierarchy

There’s a lot of online content that discusses GPU hardware characteristics. This content differs depending on which hardware vendor it focuses on, and even what hardware family from each vendor it describes. In this section I offer a high-level, cross-hardware and rough (each hardware vendor has its own terminology and its own subtleties) picture, from a C++ AMP developer’s perspective.

GPUs have global memory in which your array and array_view data resides. There are also registers for each thread, which is where your local variables typically go (unless your hardware driver has other ideas—for example, if you use too many registers for the hardware your code executes on, it will spill their contents into global memory). Accessing global memory is much slower than accessing registers—for example, it takes one GPU cycle for register access versus 1,000 cycles for a global memory access.

Additionally, near each of their processing elements, GPUs have a small scratchpad memory space (for example, 16KB to 48KB, depending on hardware). This is much faster to access than global memory; perhaps 10 cycles per access, for example. This memory area, also called the local data store, is a programmable cache. CPU caches are automatically and transparently managed for you, and hence any performance benefits are automatically granted to you. In contrast, you have to manage this GPU cache yourself by copying data into and out of it, hence, you have to opt in to the performance benefit. Some programming models call this “shared memory,” others call it “local memory” and yet others call it “group shared memory.” In C++ AMP this programmable cache is called “tile_static memory”—more on that later.

Next, let’s map a logical model to the GPU hardware model, starting with memory lifetimes and scope.

A piece of global memory is accessible to all threads, and its lifetime exceeds that of the lifetime of a parallel_for_each computation, so a subsequent parallel_for_each invocation can operate on the same data. A register value is only accessible by one thread, and its lifetime is that of the thread. A piece of tile_static memory is shared by a subset of all the threads, which in C++ AMP is called a tile of threads, and its lifetime is that of the tile. Now you’re starting to see why you need to tile your computation: Without knowing which tile a thread belongs to, you have no way of using this fast tile_static memory.

You can think of the tile_static memory as being “leased” by the tile of threads until the tile completes execution, at which point another tile takes over. So, logically speaking, threads from different tiles are isolated from each other with regard to tile_static memory and they appear to each have ownership of the tile_static memory.

Using the New tile_static Storage Class

To access tile_static memory you use the tile_static storage class. This is the second enhancement that C++ AMP makes to the C++ language (the other being restrict, which I covered in my previous article).

You can only use this new storage class in restrict(amp) functions, and only if the parallel_for_each is tiled, and only for variables declared within that function block. Pointers and references can’t be marked tile_static, and any implicit constructors/destructors of tile_static variables aren’t called. Typically (but not always) your tile_static variables are arrays, and they are typically proportional to the tile size, for example:

tile_static float t[16][16]; // Access t to access tile static memory

The most common way to take advantage of tile_static memory is to identify areas in global memory that you access more than once in your algorithm. You then copy those areas into tile_static memory (paying the price of global memory access only once) and then change your algorithm to use the tile_static copy of the data (accessing it very fast, repeatedly), instead of accessing the data in global memory multiple times. The programmable cache is small, so you can’t copy all of your array and array_view data to tile_static variables. A tiled algorithm is typically more complex because it needs to work around that by copying only a tile-size-worth of data from global memory.

Before you look at a more real-world example that shows the technique of the preceding paragraph, let’s look at a simplistic, contrived and buggy example, focused purely on the usage of tile_static where I try to sum all the elements of a matrix:

1  static const int TS = 2;
2  array_view<int, 2> av(2, 6, my_vector);
3  parallel_for_each(av.extent.tile<TS,TS>(),
     [=](tiled_index<TS,TS> t_idx) restrict(amp)
4  {       
5    tile_static int t[TS][TS];   
6    t[t_idx.local[0]][t_idx.local[1]] = av[t_idx.global];
7
8    if (t_idx.local == index<2>(0,0)) {
9      t[0][0] = t[0][0] + t[0][1] + t[1][0] + t[1][1];             
10     av[t_idx.tile_origin] = t[0][0];
11   }
12 });
13 int sum = av(0,0) + av(0,2) + av(0,4); // The three tile_origins

The technical term for the preceding code is: horrible. Lines 9 and 13 take advantage of the fact that the tile size is 2x2=4 and that the overall size is 2x6=12 (hence three tiles), when real implementations should be written in such a way so that changing the tile size or overall extent doesn’t mean changing any other code. The performance of this code is also horrible because it has a naïve algorithm and its branching is not amenable to data parallelization. There’s no reuse of the data in the algorithm, so the usage of tile_static memory isn’t even justified. There’s also a correctness error that I’ll mention later. However, as horrible as this code is, it’s simple enough for someone brand new to tile_static storage to be able to understand the code—that’s all that matters.

At line 6 each thread in each tile copies data to its tile_static memory from global memory. At line 9, only the first thread of each tile will sum the results for that tile into the first position of the tile_static memory for that tile. Then on line 10 that same thread (in each tile) will store the result back into global memory. Then the computation on the accelerator completes and, on the host side on line 13, the CPU thread adds up the three sums from the three tiles into the sum variable.

Did you spot a correctness error, specifically a race condition? On to the next section to learn about the final part of the tiled API, which helps us deal with that bug.

tile_barrier Class

The race condition in the previous example is between lines 6 and 9. On line 9 the thread with local index (0,0) assumes that all the values are already stored in the tile_static variable t, which is only true if all other threads in the tile have already performed line 6. Because threads generally execute independently of one another, that isn’t an assumption you can make.

What you need here is a way to codify the condition that line 9 must not be executed until all threads in the tile have executed line 6. The way you express that constraint is by using a tile_barrier and calling one of its wait methods. You can’t construct a tile_barrier object yourself, but you can obtain one through the barrier property from the tiled_index passed to your lambda. So you can fix the race condition by inserting the following line, after line 6:

7 t_idx.barrier.wait();

Note that you couldn’t have placed this line just before line 9 and within the conditional block. A call to tile_barrier::wait must appear at a location where all threads in a tile will either reach that location or all of them will bypass it. If the barrier was allowed to be placed in such locations, then if one thread in a tile didn’t execute the wait, the program would hang, waiting on that thread.

The compiler is able to flag many of these race conditions for you, and the ones it can’t, the debugger can find for you if in Visual Studio 11 you turn on GPU Memory Access Exceptions under the Exceptions dialog, from the Debug | Exceptions menu.

Example: Tiled Matrix Multiplication

Do you recall the matrix multiplication example in Figure 4? It’s now time for part 2 of the tiling exercise, where you’ll also take advantage of tile_static memory and inevitably will use local indexing and the tile_barrier object.

Before showing the solution, I’ll remind you that on my laptop machine the simple C++ AMP matrix multiplication yields a performance improvement of more than 40 times compared to the serial CPU code for M=N=W=1024. The tiled solution that you’re about to see, with TS=16 (so 16x16 = 256 threads per tile), is an additional two times faster than the simple model! That measurement includes the data transfer, so if you had transferred the data already and performed a number of computations on the data, then the kernel execution time would be much more than two times faster than the variant that doesn’t use tile_static memory. So what complexity price are you paying for that kind of performance boost?

The fully tiled matrix multiplication code is shown in Figure 5.

Figure 5 Matrix Multiplication, Tiled Plus Using tile_static Memory

1  void MatMul(vector<int>& vC, const vector<int>& vA,
     const vector<int>& vB, int M, int N, int W )
2  {
3    array_view<const int,2> a(M, W, vA), b(W, N, vB);
4    array_view<int,2> c(M, N, vC);  
5    c.discard_data();
6
7    parallel_for_each(c.extent.tile<TS,TS>(),
       [=](tiled_index<TS,TS> t_idx) restrict(amp) 
8    {
9      int row = t_idx.local[0]; int col = t_idx.local[1];
10     tile_static int locA[TS][TS], locB[TS][TS];
11     int sum = 0;
12     for (int i = 0; i < a.extent[1]; i += TS) {
13       locA[row][col] = a(t_idx.global[0], col + i);
14       locB[row][col] = b(row + i, t_idx.global[1]);
15       t_idx.barrier.wait();
16
17       for (int k = 0; k < TS; k++)
18         sum += locA[row][k] * locB[k][col];           
19       t_idx.barrier.wait();
20     }
21     c[t_idx.global] = sum;
22   });
23   c.synchronize();
24 }

While the code in Figure 5works for any tile size and any total extent size (conforming to the rules described earlier), the easiest way to understand what it does is to use small numbers. Small numbers don’t perform well at run time, but they do help us get a grip on what’s going on. So let’s assume tiles that are 2-by-2 (TS =2) and M=2, N=6 and W=4. This configuration is pictured in Figure 6.

Figure 6 Matrix Multiplication Example with 12 Threads in 2-by-2 Tiles

To understand the code, follow just one tile, the second tile (of the three) and just one specific thread: the thread that calculates the result 160 (this is highlighted in Figure 6—the highlighting includes the row of A and the column of B that this thread needs access to in order to calculate the results into the cell of C).

Let’s walk through the code of Figure 5 while keeping an eye on the helpful picture of Figure 6.

When i=0, the Parallel Watch window in Figure 7 shows the values of the relevant variables up until the inner loop on line 16.

Figure 7 When i=0, the Values of Other Variables Are Shown in Each Column, Where Each Row Is a Thread

As you can see, the four threads of the tile cooperated to copy data into the two tile_static arrays locA and locB from matrices A and B. Then, after they all met at the barrier on line 15, they entered the inner loop on line 17. See Figure 8 for the relevant variable values for both k=0 and then k=1.

Figure 8 Still i=0, the Values of Other Variables Are Shown for k=0 and k=1

At this point the thread you’re following calculates a partial sum of 1x4 and in the second iteration of variable k adds it to 2x10, making a total of 24 (see Figure 8). It then meets the other threads at the barrier on line 19. They’re now ready to go into a second iteration on the outer loop. Notice that the variable i will have the value of 2, and Figure 9 shows the relevant new values of the variables up until the barrier.

Figure 9 When i=2, the Values of Other Variables Are Shown in This Table

Once again, the four threads of the tile cooperated to copy data into the two tile_static arrays locA and locB from matrices A and B, moving to the right in matrix A and downward for matrix B. Then, after they all meet at the barrier on line 15, they again enter the inner loop on line 17; see Figure 10 for the relevant variable values for both k=0 and k=1.

At this point, according to Figure 10, the thread you’re following adds to 24 the product of 3x16 and on the second k iteration it further adds 4x22, making a grand-total sum of 160. It then meets the other threads at the barrier on line 19. The threads are now ready to go into a third iteration on the outer loop, but they realize the loop condition is no longer satisfied for variable i, so they skip to line 21. The thread you’re following uses its global index to update the global memory in C.

Figure 10 Still i=2, the Values of Other Variables Are Shown for k=0 and k=1

You can see now how each matrix element was copied once from global memory by each thread and then reused by each of the other threads in the tile. As I stated earlier, the performance increase comes from copying a piece of data once from global memory into tile_static memory, and then reusing that piece of data multiple times in the code.

To help you understand the code, I chose a tile size of 4 (TS=2) plus small extents, but in real code both of those would be larger. The performance numbers I shared earlier relied on 16x16=256 threads per tile, so I was reusing a piece of data 256 times from fast memory instead of accessing it each time from the global memory. The tiled matrix multiplication implementation additionally benefits from better memory-access patterns on matrix A (while matrices B and C are accessed efficiently in all implementations), but memory coalescing is beyond the scope of this article.

That’s the conclusion of the tiled matrix multiplication that takes advantage of tile_static memory. If you’d like more practice working out tiled algorithms, see the samples on the “Parallel Programming in Native Code” team blog at bit.ly/xZt05R.

The Tradeoff

In this article I introduced you to tiling, the most common optimization technique for optimizing your C++ AMP code.

You learned how to use three new C++ AMP classes (tiled_extent, tiled_index and tile_barrier), plus the tile_static storage class. You know that you can start with a simple (not explicitly tiled) implementation. Then you can modify the implementation by calling the tile function on the extent object (picking a tile size) and modifying your lambda signature to accept a tiled_index (with the same tile size template arguments).

That completed the mechanistic step 1. For the second and final step, you rewrote your algorithm to use tile_static memory with appropriate use of synchronization using a tile_barrier. That’s the creative step, where for each algorithm you have to come up with a brand-new solution. The simple and tiled implementations of matrix multiplication demonstrate the complexity introduced with tiling, and also why its usage can result in performance gains. The tradeoff is yours to opt in to, or not.

It’s worth mentioning that this kind of cache-aware programming can result in speed gains for your CPU code, too, because you’d be helping the automatic transparent cache-management system do a better job. Also, your tiled code becomes more amenable to vectorization by smart compilers.

Beyond my two articles, there’s a wealth of C++ AMP content (documentation, tips, patterns, links to videos and samples) on the Parallel Programming team blog (bit.ly/bWfC5Y) that I strongly encourage you to read.

Daniel Moth is a principal program manager in the Microsoft Developer Division.

Thanks to the following technical experts for reviewing this article: Steve Deitz, Yossi Levanoni, Robin Reynolds-Haertle, Stephen Toub and Weirong Zhu