适用于游戏开发的 CPUSetsCPUSets for game development

简介Introduction

通用 Windows 平台 (UWP) 是范围广泛的消费电子设备的核心。The Universal Windows Platform (UWP) is at the core of a wide range of consumer electronic devices. 因此,要求通用 API 满足从游戏到嵌入式应用再到服务器上运行的企业软件在内的所有应用程序类型的需求。As such, it requires a general purpose API to address the needs of all types of applications from games to embedded apps to enterprise software running on servers. 通过利用该 API 提供的正确信息,你可以确保你的游戏在任何硬件上都可以完美运行。By leveraging the right information provided by the API, you can ensure your game runs at its best on any hardware.

CPUSets APICPUSets API

CPUSets API 控制可用于在其上调度线程的 CPU 集。The CPUSets API provides control over which CPU sets are available for threads to be scheduled on. 两个函数可用于控制调度线程的位置:Two functions are available to control where threads are scheduled:

  • SetProcessDefaultCpuSets – 如果新线程未分配给特定的 CPU 设置,可使用此函数指定新线程可在其上运行的 CPU 设置。SetProcessDefaultCpuSets – This function can be used to specify which CPU sets new threads may run on if they are not assigned to specific CPU sets.
  • SetThreadSelectedCpuSets – 此函数允许你限制特定线程可在其上运行的 CPU 设置。SetThreadSelectedCpuSets – This function allows you to limit the CPU sets a specific thread may run on.

如果从未使用过 SetProcessDefaultCpuSets 函数,则新创建的线程可以在适用于你的进程的任何 CPU 设置上调度。If the SetProcessDefaultCpuSets function is never used, newly created threads may be scheduled on any CPU set available to your process. 此部分介绍 CPUSets API 的基础知识。This section goes over the basics of the CPUSets API.

GetSystemCpuSetInformationGetSystemCpuSetInformation

用于收集信息的第一个 API 是 GetSystemCpuSetInformation 函数。The first API used for gathering information is the GetSystemCpuSetInformation function. 此函数将信息填充于标题代码提供的 SYSTEM_CPU_SET_INFORMATION 对象数组中。This function populates information in an array of SYSTEM_CPU_SET_INFORMATION objects provided by title code. 目标内存必须由游戏代码进行分配,而具体大小将通过调用 GetSystemCpuSetInformation 本身来确定。The memory for the destination must be allocated by game code, the size of which is determined by calling GetSystemCpuSetInformation itself. 这需要调用 GetSystemCpuSetInformation 两次,如以下示例中所示。This requires two calls to GetSystemCpuSetInformation as demonstrated in the following example.

unsigned long size;
HANDLE curProc = GetCurrentProcess();
GetSystemCpuSetInformation(nullptr, 0, &size, curProc, 0);

std::unique_ptr<uint8_t[]> buffer(new uint8_t[size]);

PSYSTEM_CPU_SET_INFORMATION cpuSets = reinterpret_cast<PSYSTEM_CPU_SET_INFORMATION>(buffer.get());
  
GetSystemCpuSetInformation(cpuSets, size, &size, curProc, 0);

每个返回的 SYSTEM_CPU_SET_INFORMATION 实例包含有关一个唯一的处理单元(也称为“CPU 集”)的信息。Each instance of SYSTEM_CPU_SET_INFORMATION returned contains information about one unique processing unit, also known as a CPU set. 这并不一定意味着它表示硬件的独特物理部分。This does not necessarily mean that it represents a unique physical piece of hardware. 利用超线程的 CPU 将具有在单个物理处理内核上运行的多个逻辑核心。CPUs that utilize hyperthreading will have multiple logical cores running on a single physical processing core. 在不同逻辑核心(位于同一物理核心上)上调度多个线程允许执行硬件级别的资源优化,否则会以内核级别执行额外工作。Scheduling multiple threads on different logical cores that reside on the same physical core allows hardware-level resource optimization that would otherwise require extra work to be done at the kernel level. 在同一物理核心的单独逻辑核心上调度的两个线程必须共享 CPU 时间,但相比于它们在同一逻辑核心上调度而言,可以更高效地运行。Two threads scheduled on separate logical cores on the same physical core must share CPU time, but would run more efficiently than if they were scheduled to the same logical core.

SYSTEM_CPU_SET_INFORMATIONSYSTEM_CPU_SET_INFORMATION

GetSystemCpuSetInformation 中返回的此数据结构的每个实例中的信息包含有关可以在其上调度的线程唯一处理单元的信息。The information in each instance of this data structure returned from GetSystemCpuSetInformation contains information about a unique processing unit that threads may be scheduled on. 根据给定的可能目标设备范围,SYSTEM_CPU_SET_INFORMATION 数据结构中的许多信息可能不适用于游戏开发。Given the possible range of target devices, a lot of the information in the SYSTEM_CPU_SET_INFORMATION data structure may not applicable for game development. 表 1 提供适用于游戏开发的数据成员的说明。Table 1 provides an explanation of data members that are useful for game development.

表1。数据成员对于游戏开发非常有用。Table 1. Data members useful for game development.

成员名称Member name 数据类型Data type 说明Description
类型Type CPU_SET_INFORMATION_TYPECPU_SET_INFORMATION_TYPE 结构中信息的类型。The type of information in the structure. 如果此成员的值不是 CpuSetInformation,应忽略它。If the value of this is not CpuSetInformation, it should be ignored.
IDId unsigned longunsigned long 指定 CPU 设置的 ID。The ID of the specified CPU set. 这是应与 CPU 设置函数(如 SetThreadSelectedCpuSets)结合使用的 ID。This is the ID that should be used with CPU set functions such as SetThreadSelectedCpuSets.
Group unsigned shortunsigned short 指定 CPU 设置的“处理器组”。Specifies the “processor group” of the CPU set. 处理器组允许电脑具有超过 64 个逻辑核心,并允许在系统运行期间热交换 CPU。Processor groups allow a PC to have more than 64 logical cores, and allow for hot swapping of CPUs while the system is running. 不是服务器但具有多个组的电脑并不常见。It is uncommon to see a PC that is not a server with more than one group. 除非你要编写的应用程序打算在大型服务器或服务器场上运行,否则最好使用单个组中的 CPU 设置,因为大多数消费者电脑只具有一个处理器组。Unless you are writing applications meant to run on large servers or server farms, it is best to use CPU sets in a single group because most consumer PCs will only have one processor group. 此结构中的其他所有值都与 Group 有关。All other values in this structure are relative to the Group.
LogicalProcessorIndexLogicalProcessorIndex unsigned charunsigned char CPU 设置的 Group 相关索引Group relative index of the CPU set
CoreIndexCoreIndex unsigned charunsigned char CPU 设置所在的物理 CPU 核心的 Group 相关索引Group relative index of the physical CPU core where the CPU set is located
LastLevelCacheIndexLastLevelCacheIndex unsigned charunsigned char 与此 CPU 设置关联的最后一级缓存的 Group 相关索引。Group relative index of the last cache associated with this CPU set. 此缓存的速度最慢,除非系统利用 NUMA 节点,通常为 L2 或 L3 缓存。This is the slowest cache unless the system utilizes NUMA nodes, usually the L2 or L3 cache.

其他数据成员提供的信息不太可能描述消费者电脑或其他消费者设备中的 CPU,也不太可能非常有用。The other data members provide information that is unlikely to describe CPUs in consumer PCs or other consumer devices and is unlikely to be useful. 然后返回的数据提供的信息可用于以多种方式组织线程。The information provided by the data returned can then be used to organize threads in various ways. 此白皮书的游戏开发注意事项部分详细介绍了利用此数据优化线程分配的多种方法。The Considerations for game development section of this white paper details a few ways to leverage this data to optimize thread allocation.

以下是一些从各种类型硬件上运行的 UWP 应用程序中收集的信息类型的示例。The following are some examples of the type of information gathered from UWP applications running on various types of hardware.

表2:从 Microsoft Lumia 950 上运行的 UWP 应用返回的信息。这是具有多个最后一级缓存的系统的示例。Lumia 950 具有一个 Qualcomm 808 Snapdragon 进程,其中包含一个双核 ARM Cortex-a9 A57 和四核 ARM Cortex-a9 A53 Cpu。Table 2. Information returned from a UWP app running on a Microsoft Lumia 950. This is an example of a system that has multiple last level caches. The Lumia 950 features a Qualcomm 808 Snapdragon process that contains a dual core ARM Cortex A57 and quad core ARM Cortex A53 CPUs.

表 2

表3:从在典型电脑上运行的 UWP 应用返回的信息。这是使用超线程的系统的示例;每个物理核心都具有两个逻辑内核,可以在这些内核上计划线程。在这种情况下,系统包含 Intel Xenon CPU E5-2620。Table 3. Information returned from a UWP app running on a typical PC. This is an example of a system that uses hyperthreading; each physical core has two logical cores onto which threads can be scheduled. In this case, the system contained an Intel Xenon CPU E5-2620.

表 3

表4。从在四核 Microsoft Surface Pro 4 上运行的 UWP 应用返回的信息。此系统具有 Intel Core i5 CPU。Table 4. Information returned from a UWP app running on a quad core Microsoft Surface Pro 4. This system had an Intel Core i5-6300 CPU.

表 4

SetThreadSelectedCpuSetsSetThreadSelectedCpuSets

既然提供了 CPU 设置的相关信息,就可以使用该信息来组织线程。Now that information about the CPU sets is available, it can be used to organize threads. 将向此函数传递使用 CreateThread 创建的线程的句柄以及可在其上调度线程的 CPU 设置的 ID 数组。The handle of a thread created with CreateThread is passed to this function along with an array of IDs of the CPU sets that the thread can be scheduled on. 使用以下代码演示函数使用情况的一个示例。One example of its usage is demonstrated in the following code.

HANDLE audioHandle = CreateThread(nullptr, 0, AudioThread, nullptr, 0, nullptr);
unsigned long cores [] = { cpuSets[0].CpuSet.Id, cpuSets[1].CpuSet.Id };
SetThreadSelectedCpuSets(audioHandle, cores, 2);

在此示例中,线程基于声明为 AudioThread 的函数创建。In this example, a thread is created based on a function declared as AudioThread. 然后,允许在两个 CPU 设置之一上调度此线程。This thread is then allowed to be scheduled on one of two CPU sets. CPU 设置的线程所有权不独占。Thread ownership of the CPU set is not exclusive. 在未锁定到特定 CPU 设置的情况下,通过 AudioThread 创建线程可能需要一些时间。Threads that are created without being locked to a specific CPU set may take time from the AudioThread. 同样,创建的其他线程稍后也可以锁定到这些 CPU 设置中的一个或两个。Likewise, other threads created may also be locked to one or both of these CPU sets at a later time.

SetProcessDefaultCpuSetsSetProcessDefaultCpuSets

SetThreadSelectedCpuSets 相反的是 SetProcessDefaultCpuSetsThe converse to SetThreadSelectedCpuSets is SetProcessDefaultCpuSets. 创建线程时,不需要将它们锁定到特定 CPU 设置。When threads are created, they do not need to be locked into certain CPU sets. 如果你不希望这些线程在特定 CPU 设置上运行(例如,呈现线程或音频线程使用的 CPU 设置),可以使用此函数指定允许在其上调度这些线程的核心。If you do not want these threads to run on specific CPU sets (those used by your render thread or audio thread for example), you can use this function to specify which cores these threads are allowed to be scheduled on.

游戏开发注意事项Considerations for game development

正如我们所见,CPUSets API 涉及到调度线程时,它可提供大量信息和灵活性。As we've seen, the CPUSets API provides a lot of information and flexibility when it comes to scheduling threads. 与采取自下而上的方法来尝试查找此数据的使用相比,采取自上而下的方法查找如何将数据用于适应常见方案会更有效。Instead of taking the bottom-up approach of trying to find uses for this data, it is more effective to take the top-down approach of finding how the data can be used to accommodate common scenarios.

使用时间关键线程和超线程Working with time critical threads and hyperthreading

如果你的游戏所具有的多个线程必须实时运行,而其他工作线程所需的 CPU 时间相对较少,则此方法非常有效。This method is effective if your game has a few threads that must run in real time along with other worker threads that require relatively little CPU time. 某些任务(如连续背景音乐)必须不间断地运行才可以实现最佳游戏体验。Some tasks, like continuous background music, must run without interruption for an optimal gaming experience. 因此每帧接收必要数量的 CPU 时间至关重要,即使音频线程的单帧匮乏可能导致爆音或噪音干扰。Even a single frame of starvation for an audio thread may cause popping or glitching, so it is critical that it receives the necessary amount of CPU time every frame.

结合 SetProcessDefaultCpuSets 使用 SetThreadSelectedCpuSets 可以确保你的大量线程不会因任何工作线程而中断。Using SetThreadSelectedCpuSets in conjunction with SetProcessDefaultCpuSets can ensure your heavy threads remain uninterrupted by any worker threads. SetThreadSelectedCpuSets 可用于将你的大量线程分配给特定的 CPU 设置。SetThreadSelectedCpuSets can be used to assign your heavy threads to specific CPU sets. 然后 SetProcessDefaultCpuSets 可用于确保任何已创建的未分配线程放置在其他 CPU 设置上。SetProcessDefaultCpuSets can then be used to make sure any unassigned threads created are put on other CPU sets. 即使 CPU 利用超线程,在同一物理核心上考虑使用逻辑核心也很重要。In the case of CPUs that utilize hyperthreading, it's also important to account for logical cores on the same physical core. 不应允许工作线程在共享物理核心(与你想要以实时响应性运行的线程相同)的逻辑核心上运行。Worker threads should not be allowed to run on logical cores that share the same physical core as a thread that you want to run with real time responsiveness. 以下代码演示如何确定电脑是否使用超线程。The following code demonstrates how to determine whether a PC uses hyperthreading.

unsigned long retsize = 0;
(void)GetSystemCpuSetInformation( nullptr, 0, &retsize,
    GetCurrentProcess(), 0);
 
std::unique_ptr<uint8_t[]> data( new uint8_t[retsize] );
if ( !GetSystemCpuSetInformation(
    reinterpret_cast<PSYSTEM_CPU_SET_INFORMATION>( data.get() ),
    retsize, &retsize, GetCurrentProcess(), 0) )
{
    // Error!
}
 
std::set<DWORD> cores;
std::vector<DWORD> processors;
uint8_t const * ptr = data.get();
for( DWORD size = 0; size < retsize; ) {
    auto info = reinterpret_cast<const SYSTEM_CPU_SET_INFORMATION*>( ptr );
    if ( info->Type == CpuSetInformation ) {
         processors.push_back( info->CpuSet.Id );
         cores.insert( info->CpuSet.CoreIndex );
    }
    ptr += info->Size;
    size += info->Size;
}
 
bool hyperthreaded = processors.size() != cores.size();

如果系统利用超线程,则默认 CPU 设置的设置在与任何实时线程的相同物理核心上不包含任何逻辑核心,这一点很重要。If the system utilizes hyperthreading, it is important that the set of default CPU sets does not include any logical cores on the same physical core as any real time threads. 如果系统不是超线程,只需确保默认 CPU 设置不包含与运行你的音频线程的 CPU 设置相同的核心即可。If the system is not hyperthreading, it is only necessary to make sure that the default CPU sets do not include the same core as the CPU set running your audio thread.

可以在其他资源部分链接的 GitHub 存储库上提供的 CPUSets 示例中找到基于物理核心组织线程的示例。An example of organizing threads based on physical cores can be found in the CPUSets sample available on the GitHub repository linked in the Additional resources section.

减少最后一级缓存的缓存一致性的开销Reducing the cost of cache coherence with last level cache

缓存一致性的概念为缓存的内存在处理同一数据的多个硬件资源之间是相同的。Cache coherency is the concept that cached memory is the same across multiple hardware resources that act on the same data. 如果线程在不同核心上调度,但处理同一数据,则这些线程就可以处理不同缓存中该数据的单独副本。If threads are scheduled on different cores, but work on the same data, they may be working on separate copies of that data in different caches. 为了获取正确的结果,这些缓存相互之间必须保持一致。In order to get correct results, these caches must be kept coherent with each other. 保持多个缓存之间的一致性相当耗费资源,但却是运行任何多核系统所必需的。Maintaining coherency between multiple caches is relatively expensive, but necessary for any multi-core system to operate. 此外,它完全失去对客户端代码的控制;基础系统独立工作以通过访问核心之间共享的内存资源来保持缓存的最新状态。Additionally, it is completely out of the control of client code; the underlying system works independently to keep caches up to date by accessing shared memory resources between cores.

如果你的游戏具有共享特别庞大的数据的多个线程,可以通过确保这些线程在共享最后一级缓存的 CPU 设置上调度来最大程度地减少缓存一致性的开销。If your game has multiple threads that share an especially large amount of data, you can minimize the cost of cache coherency by ensuring that they are scheduled on CPU sets that share a last level cache. 最后一级缓存是速度最慢的缓存,可用于不利用 NUMA 节点的系统上的核心。The last level cache is the slowest cache available to a core on systems that do not utilize NUMA nodes. 对于利用 NUMA 节点的游戏电脑而言相当少见。It is extremely rare for a gaming PC to utilize NUMA nodes. 如果核心不共享最后一级缓存,保持一致性就需要访问更高级别的内存资源,因此速度较慢。If cores do not share a last level cache, maintaining coherency would require accessing higher level, and therefore slower, memory resources. 将两个线程锁定到共享一个缓存和一个物理核心的单独 CPU 设置可以提供更好的性能(相比于在单独的物理核心上调度它们,如果它们在任何给定的框架中不需要超过 50% 的时间)。Locking two threads to separate CPU sets that share a cache and a physical core may provide even better performance than scheduling them on separate physical cores if they do not require more than 50% of the time in any given frame.

此代码示例显示如何确定频繁通信的线程是否可以共享最后一级缓存。This code example shows how to determine whether threads that communicate frequently can share a last level cache.

unsigned long retsize = 0;
(void)GetSystemCpuSetInformation(nullptr, 0, &retsize,
    GetCurrentProcess(), 0);
 
std::unique_ptr<uint8_t[]> data(new uint8_t[retsize]);
if (!GetSystemCpuSetInformation(
    reinterpret_cast<PSYSTEM_CPU_SET_INFORMATION>(data.get()),
    retsize, &retsize, GetCurrentProcess(), 0))
{
    // Error!
}
 
unsigned long count = retsize / sizeof(SYSTEM_CPU_SET_INFORMATION);
bool sharedcache = false;
 
std::map<unsigned char, std::vector<SYSTEM_CPU_SET_INFORMATION>> cachemap;
for (size_t i = 0; i < count; ++i)
{
    auto cpuset = reinterpret_cast<PSYSTEM_CPU_SET_INFORMATION>(data.get())[i];
    if (cpuset.Type == CPU_SET_INFORMATION_TYPE::CpuSetInformation)
    {
        if (cachemap.find(cpuset.CpuSet.LastLevelCacheIndex) == cachemap.end())
        {
            std::pair<unsigned char, std::vector<SYSTEM_CPU_SET_INFORMATION>> newvalue;
            newvalue.first = cpuset.CpuSet.LastLevelCacheIndex;
            newvalue.second.push_back(cpuset);
            cachemap.insert(newvalue);
        }
        else
        {
            sharedcache = true;
            cachemap[cpuset.CpuSet.LastLevelCacheIndex].push_back(cpuset);
        }
    }
}

图 1 中所示的缓存布局是你可以从系统中看到的布局的类型示例。The cache layout illustrated in Figure 1 is an example of the type of layout you might see from a system. 此图是在 Microsoft Lumia 950 中找到的缓存示意图。This figure is an illustration of the caches found in a Microsoft Lumia 950. CPU 256 和 CPU 260 之间发生的线程间通信将导致大量开销,因为需要系统保持它们的 L2 缓存一致性。Inter-thread communication occurring between CPU 256 and CPU 260 would incur significant overhead because it would require the system to keep their L2 caches coherent.

图1。在 Microsoft Lumia 950 设备上找到缓存体系结构。Figure 1. Cache architecture found on a Microsoft Lumia 950 device.

Lumia 950 缓存

“摘要”Summary

适用于 UWP 开发的 CPUSets API 提供了与你的多线程选项有关的大量信息和控制。The CPUSets API available for UWP development provides a considerable amount of information and control over your multithreading options. 相比于以前的适用于 Windows 开发的多线程 API,增加的复杂性具有一些学习曲线,但提升的灵活性最终允许在一些消费者电脑和其他硬件目标之间实现较好的性能。The added complexities compared to previous multithreaded APIs for Windows development has some learning curve, but the increased flexibility ultimately allows for better performance across a range of consumer PCs and other hardware targets.

其他资源Additional resources