Very poor c++ AMP (memory) performance on XBox Series S

WTrei 6 Reputation points
2021-03-01T07:24:31.123+00:00

Hey Guys,

as mentioned yesterday in an other question I am trying to run a c++ AMP app on a XBox Series S.

To be more precise the workload comes from the domain of computational fluid dynamics and is mostly memory heavy - similar to a matrix multiplication, but with a bit more complicated kernel and with not so streamlined memory access patterns, but generally the bottleneck is loading multiple pages of data from large buffers and process them quickly.

The kernel needs to almost not communicate with the CPU side, so the main loop looks like

while (!done) {  
  
memcpy(&ioBuffer.data(), &updatedData, 1024);  
  
// Put work to card  
ioBuffer.synchronize_to(acc_v);  
  
// Run the work  
concurrency::extent<2> workSize(65536, 60);  
parallel_for_each(acc_v, workSize.tile<64,1>(), [=](tiled_index<64,1> idx) restrict(amp) {  
 work_GPU(idx, ioBuffer, tmpBuffer0, tmpBuffer1, tmpBuffer2);  
 });  
  
ioBuffer.synchronize();  
// Check if we are done  
}  

Note that the two tmpBuffers hold data that was filled before the main loop with help if the the ioBuffer, so they themselves never got synchronized with the host size - they are in total 3.5 GByte in size (I would like to use 4.5G, but well - that is an other topic). The ioBuffer got a size of 1kByte and every time the synchronize_to is called this happens with my primary accelerator. Same is true for every kernel call which always is performed on the accelerator.

Now the central question: Why this is so slow?
I use a Ryzen 2600 + RX 6800 as a development system. When I run my app locally I can measure the total time between ioBuffer.synchronize_to(acc_v); and ioBuffer.synchronize(); to be pretty much like 81 ms. This is completely in the expected range for the kind of work and compared to a OpenCL kernel I wrote earlier doing the same task (doing it in 79.5 ms on average). But when I load the app (that beside this is mostly a one page XAML written simple GUI) on the XBox Series S, the same code takes about 4.5 seconds. Expected would be around 160-200 ms in case the code did hit the quicker and ~750 ms in case the slower memory side of the XBox is used, but over 4.5 seconds is completely out of expected range.

Note that I made sure the accelerator I use is the GPU part of the APU (It reports itself as SraKmd_arden) - using the WARP accelerator or CPU only is even slower then the already too slow GPU. One thing I observed is that the GPU load on my Desktop is only a very fine sawtooth pattern, which is constantly between 85 and 100%, while on the XBox the GPU load seems to spike every few seconds shortly to 100% and then fully drops back to 0. The thread that does feed the GPU is a std::thread that was created in the GUI thread and then got detached, only reporting the duration the kernels took back to the GUI at the moment.

I appreciate any idea, why the performance might be so different between this two devices, although I tried to mimic the XBox hardware as good as possible on Desktop.

Edit:
Would like to append a question. I spoke above about the two memory segments. Is there any way to request the placement of memory object one creates, e.g. can I request to ask to put my AMP arrays or array_views into the fast segment while I state I am fine with other things to move to the slower one? In case this is not possible, is there any documentation about the memory assignment logic (e.g. always put GPU / host memory first) or other things that might trigger a prioritization?

Universal Windows Platform (UWP)
C++
C++
A high-level, general-purpose programming language, created as an extension of the C programming language, that has object-oriented, generic, and functional features in addition to facilities for low-level memory manipulation.
3,472 questions
{count} vote

1 answer

Sort by: Most helpful
  1. Mshzhb 6 Reputation points
    2022-06-21T16:11:01.033+00:00
    0 comments No comments