Walkthrough: Debugging a C++ AMP Application

Article
07/10/2012

This topic demonstrates how to debug an application that uses C++ Accelerated Massive Parallelism (C++ AMP) to take advantage of the graphics processing unit (GPU). It uses a parallel-reduction program that sums up a large array of integers. This walkthrough illustrates the following tasks:

Launching the GPU debugger.
Inspecting GPU threads in the GPU Threads window.
Using the Parallel Stacks window to simultaneously observe the call stacks of multiple GPU threads.
Using the Parallel Watch window to inspect values of a single expression across multiple threads at the same time.
Flagging, freezing, thawing, and grouping GPU threads.
Executing all the threads of a tile to a specific location in code.

Prerequisites

Before you start this walkthrough:

Read C++ AMP Overview.
Make sure that line numbers are displayed in the text editor. For more information, see How to: Display Line Numbers in the Editor.
Make sure you are running Windows 8 or Windows Server 2012 to support debugging on the software emulator.

Note

Your computer might show different names or locations for some of the Visual Studio user interface elements in the following instructions. The Visual Studio edition that you have and the settings that you use determine these elements. For more information, see Visual Studio Settings.

To create the sample project

Start Visual Studio.
On the menu bar, choose File, New, Project.
Under Installed in the templates pane, choose Visual C++.
Choose Win32 Console Application, type AMPMapReduce in the Name box, and then choose the OK button.
Choose the Next button.
Clear the Precompiled header check box, and then choose the Finish button.
In Solution Explorer, delete stdafx.h, targetver.h, and stdafx.cpp from the project.

Open AMPMapReduce.cpp and replace its content with the following code.

// AMPMapReduce.cpp defines the entry point for the program.
// The program performs a parallel-sum reduction that computes the sum of an array of integers. 

#include <stdio.h>
#include <tchar.h>
#include <amp.h>

const int BLOCK_DIM = 32;

using namespace concurrency;

void sum_kernel_tiled(tiled_index<BLOCK_DIM> t_idx, array<int, 1> &A, int stride_size) restrict(amp)
{
    tile_static int localA[BLOCK_DIM];

    index<1> globalIdx = t_idx.global * stride_size;
    index<1> localIdx = t_idx.local;

    localA[localIdx[0]] =  A[globalIdx];

    t_idx.barrier.wait();

    // Aggregate all elements in one tile into the first element.
    for (int i = BLOCK_DIM / 2; i > 0; i /= 2) 
    {
        if (localIdx[0] < i) 
        {

            localA[localIdx[0]] += localA[localIdx[0] + i];
        }

        t_idx.barrier.wait();
    }

    if (localIdx[0] == 0)
    {
        A[globalIdx] = localA[0];
    }
}

int size_after_padding(int n)
{
    // The extent might have to be slightly bigger than num_stride to 
    // be evenly divisible by BLOCK_DIM. You can do this by padding with zeros.
    // The calculation to do this is BLOCK_DIM * ceil(n / BLOCK_DIM)
    return ((n - 1) / BLOCK_DIM + 1) * BLOCK_DIM;
}

int reduction_sum_gpu_kernel(array<int, 1> input) 
{
    int len = input.extent[0];

    //Tree-based reduction control that uses the CPU.
    for (int stride_size = 1; stride_size < len; stride_size *= BLOCK_DIM) 
    {
        // Number of useful values in the array, given the current
        // stride size.
        int num_strides = len / stride_size;  

        extent<1> e(size_after_padding(num_strides));

        // The sum kernel that uses the GPU.
        parallel_for_each(extent<1>(e).tile<BLOCK_DIM>(), [&input, stride_size] (tiled_index<BLOCK_DIM> idx) restrict(amp)
        {
            sum_kernel_tiled(idx, input, stride_size);
        });
    }

    array_view<int, 1> output = input.section(extent<1>(1));
    return output[0];
}

int cpu_sum(const std::vector<int> &arr) {
    int sum = 0;
    for (size_t i = 0; i < arr.size(); i++) {
        sum += arr[i];
    }
    return sum;
}

std::vector<int> rand_vector(unsigned int size) {
    srand(2011);

    std::vector<int> vec(size);
    for (size_t i = 0; i < size; i++) {
        vec[i] = rand();
    }
    return vec;
}

array<int, 1> vector_to_array(const std::vector<int> &vec) {
    array<int, 1> arr(vec.size());
    copy(vec.begin(), vec.end(), arr);
    return arr;
}

int _tmain(int argc, _TCHAR* argv[])
{
    std::vector<int> vec = rand_vector(10000);
    array<int, 1> arr = vector_to_array(vec);

    int expected = cpu_sum(vec);
    int actual = reduction_sum_gpu_kernel(arr);

    bool passed = (expected == actual);
    if (!passed) {
        printf("Actual (GPU): %d, Expected (CPU): %d", actual, expected);
    }
    printf("sum: %s\n", passed ? "Passed!" : "Failed!"); 

    getchar();

    return 0;
}

On the menu bar, choose File, Save All.
In Solution Explorer, open the shortcut menu for AMPMapReduce, and then choose Properties.
In the Property Pages dialog box, under Configuration Properties, choose C/C++, Precompiled Headers.
For the Precompiled Header property, select Not Using Precompiled Headers, and then choose the OK button.
On the menu bar, choose Build, Build Solution.

Debugging the CPU Code

In this procedure, you will use the Local Windows Debugger to make sure that the CPU code in this application is correct. The segment of the CPU code in this application that is especially interesting is the for loop in the reduction_sum_gpu_kernel function. It controls the tree-based parallel reduction that is run on the GPU.

To debug the CPU code

In Solution Explorer, open the shortcut menu for AMPMapReduce, and then choose Properties.
In the Property Pages dialog box, under Configuration Properties, choose Debugging. Verify that Local Windows Debugger is selected in the Debugger to launch list.
Return to the Code Editor.
Set breakpoints on the lines of code shown in the following illustration (approximately lines 67 line 70).

CPU breakpoints
On the menu bar, choose Debug, Start Debugging.
In the Locals window, observe the value for stride_size until the breakpoint at line 70 is reached.
On the menu bar, choose Debug, Stop Debugging.

Debugging the GPU Code

This section shows how to debug the GPU code, which is the code contained in the sum_kernel_tiled function. The GPU code computes the sum of integers for each "block" in parallel.

To debug the GPU code

In Solution Explorer, open the shortcut menu for AMPMapReduce, and then choose Properties.
In the Property Pages dialog box, under Configuration Properties, choose Debugging.
In the Debugger to launch list, select Local Windows Debugger.
In the Debugger Type list, select GPU Only.
Choose the OK button.
Set a breakpoint at line 30, as shown in the following illustration.

GPU breakpoint
On the menu bar, choose Debug, Start Debugging. The breakpoints in the CPU code at lines 67 and 70 are not executed during GPU debugging because those lines of code are executed on the CPU.

To use the GPU Threads window

To open the GPU Threads window, on the menu bar, choose Debug, Windows, GPU Threads.

You can inspect the state the GPU threads in the GPU Threads window that appears.
Dock the GPU Threads window at the bottom of Visual Studio. Choose the Expand Thread Switch button to display the tile and thread text boxes. The GPU Threads window shows the total number of active and blocked GPU threads, as shown in the following illustration.

GPU Threads window

There are 313 tiles allocated for this computation. Each tile contains 32 threads. Because local GPU debugging occurs on a software emulator, there are four active GPU threads. The four threads execute the instructions simultaneously and then move on together to the next instruction.

In the GPU threads window, there are four GPU threads active and 28 GPU threads blocked at the tile_barrier::wait statement defined at about line 21 (t_idx.barrier.wait();). All 32 GPU threads belong to the first tile, tile[0]. An arrow points to the row that includes the current thread. To switch to a different thread, use one of the following methods:
- In the row for the thread to switch to in the GPU Threads window, open the shortcut menu and choose Switch To Thread. If the row represents more than one thread, you will switch to the first thread according to the thread coordinates.
- Enter the tile and thread values of the thread in the corresponding text boxes and then choose the Switch Thread button.
The Call Stack window displays the call stack of the current GPU thread.

To use the Parallel Stacks window

To open the Parallel Stacks window, on the menu bar, choose Debug, Windows, Parallel Stacks.

You can use the Parallel Stacks window to simultaneously inspect the stack frames of multiple GPU threads.
Dock the Parallel Stacks window at the bottom of Visual Studio.
Make sure that Threads is selected in the list in the upper-left corner. In the following illustration, the Parallel Stacks window shows a call-stack focused view of the GPU threads that you saw in the GPU Threads window.

Parallel Stacks window

32 threads went from _kernel_stub to the lambda statement in the parallel_for_each function call and then to the sum_kernel_tiled function, where the parallel reduction occurs. 28 out of the 32 threads have progressed to the tile_barrier::wait statement and remain blocked at line 22, whereas the other 4 threads remain active in the sum_kernel_tiled function at line 30.

You can inspect the properties of a GPU thread that are available in the GPU Threads window in the rich DataTip of the Parallel Stacks window. To do this, rest the mouse pointer on the stack frame of sum_kernel_tiled. The following illustration shows the DataTip.

GPU thread DataTip

For more information about the Parallel Stacks window, see Using the Parallel Stacks Window.

To use the Parallel Watch window

To open the Parallel Watch window, on the menu bar, choose Debug, Windows, Parallel Watch, Parallel Watch 1.

You can use the Parallel Watch window to inspect the values of an expression across multiple threads.
Dock the Parallel Watch 1 window to the bottom of Visual Studio. There are 32 rows in the table of the Parallel Watch window. Each corresponds to a GPU thread that appeared in both the GPU Threads window and the Parallel Stacks window. Now, you can enter expressions whose values you want to inspect across all 32 GPU threads.
Select the Add Watch column header, enter localIdx, and then choose the Enter key.
Select the Add Watch column header again, type globalIdx, and then choose the Enter key.
Select the Add Watch column header again, type localA[localIdx[0]], and then choose the Enter key.

You can sort by a specified expression by selecting its corresponding column header.

Select the localA[localIdx[0]] column header to sort the column. The following illustration shows the results of sorting by localA[localIdx[0]].

Results of sort

You can export the content in the Parallel Watch window to Excel by choosing the Excel button and then choosing Open in Excel. If you have Excel installed on your development computer, this opens an Excel worksheet that contains the content.
In the upper-right corner of the Parallel Watch window, there's a filter control that you can use to filter the content by using Boolean expressions. Enter localA[localIdx[0]] > 20000 in the filter control text box and then choose the Enter key.

The window now contains only threads on which the localA[localIdx[0]] value is greater than 20000. The content is still sorted by the localA[localIdx[0]] column, which is the sorting action you performed earlier.

Flagging GPU Threads

You can mark specific GPU threads by flagging them in the GPU Threads window, the Parallel Watch window, or the DataTip in the Parallel Stacks window. If a row in the GPU Threads window contains more than one thread, flagging that row flags all threads that are contained in the row.

To flag GPU threads

Select the [Thread] column header in the Parallel Watch 1 window to sort by tile index and thread index.
On the menu bar, choose Debug, Continue, which causes the four threads that were active to progress to the next barrier (defined at line 32 of AMPMapReduce.cpp).
Choose the flag symbol on the left side of the row that contains the four threads that are now active.

The following illustration shows the four active flagged threads in the GPU Threads window.

Active threads in the GPU Threads window

The Parallel Watch window and the DataTip of the Parallel stacks window both indicate the flagged threads.
If you want to focus on the four threads that you flagged, you can choose to show, in the GPU Threads, Parallel Watch, and Parallel Stacks windows, only the flagged threads.

Choose the Show Flagged Only button on any of the windows or on the Debug Location toolbar. The following illustration shows the Show Flagged Only button on the Debug Location toolbar.

Show Flagged Only button

Now the GPU Threads, Parallel Watch, and Parallel Stacks windows display only the flagged threads.

Freezing and Thawing GPU Threads

You can freeze (suspend) and thaw (resume) GPU threads from either the GPU Threads window or the Parallel Watch window. You can freeze and thaw CPU threads the same way; for information, see How to: Use the Threads Window.

To freeze and thaw GPU threads

Choose the Show Flagged Only button to display all the threads.
On the menu bar, choose Debug, Continue.
Open the shortcut menu for the active row and then choose Freeze.

The following illustration of the GPU Threads window shows that all four threads are frozen.

Frozen threads in the GPU Threads window

Similarly, the Parallel Watch window shows that all four threads are frozen.
On the menu bar, choose Debug, Continue to allow the next four GPU threads to progress past the barrier at line 22 and to reach the breakpoint at line 30. The GPU Threads window shows that the four previously frozen threads remain frozen and in the active state.
On the menu bar, choose Debug, Continue.
From the Parallel Watch window, you can also thaw individual or multiple GPU threads.

To group GPU threads

On the shortcut menu for one of the threads in the GPU Threads window, choose Group By, Address.

The threads in the GPU Threads window are grouped by address. The address corresponds to the instruction in disassembly where each group of threads is located. 24 threads are at line 22 where the tile_barrier::wait Method is executed. 12 threads are at the instruction for the barrier at line 32. Four of these threads are flagged. Eight threads are at the breakpoint at line 30. Four of these threads are frozen. The following illustration shows the grouped threads in the GPU Threads window.

Grouped threads in the GPU Threads window
You can also perform the Group By operation by opening the shortcut menu for the data grid of the Parallel Watch window, choosing Group By, and then choosing the menu item that corresponds to how you want to group the threads.

Running All Threads to a Specific Location in Code

You run all the threads in a given tile to the line that contains the cursor by using Run Current Tile To Cursor.

To run all threads to the location marked by the cursor

On the shortcut menu for the frozen threads, choose Thaw.
In the Code Editor, put the cursor in line 30.
On the shortcut menu for the Code Editor, choose Run Current Tile To Cursor.

The 24 threads that were previously blocked at the barrier at line 21 have progressed to line 32. This is shown in the GPU Threads window.