CPUSets 遊戲開發CPUSets for game development

簡介Introduction

通用 Windows 平台 (UWP) 是許多消費者電子裝置的核心。The Universal Windows Platform (UWP) is at the core of a wide range of consumer electronic devices. 因此,它需要一般用途的 API 以處理所有應用程式類型的需求,從遊戲到內嵌的應用程式,以及在伺服器上執行的企業軟體。As such, it requires a general purpose API to address the needs of all types of applications from games to embedded apps to enterprise software running on servers. 利用 API 提供的正確資訊,您可以確保遊戲在任何硬體上都能發揮最佳效能。By leveraging the right information provided by the API, you can ensure your game runs at its best on any hardware.

CPUSets APICPUSets API

CPUSets API 可讓您控制要提供哪些 CPU 集合以供在上面排程執行緒。The CPUSets API provides control over which CPU sets are available for threads to be scheduled on. 有兩個函式可用來控制排程執行緒的所在位置:Two functions are available to control where threads are scheduled:

  • SetProcessDefaultCpuSets – 若新的執行緒未指派給特定的 CPU 集合,這個函式可用來指定新的執行緒可在哪些 CPU 集合上執行。SetProcessDefaultCpuSets – This function can be used to specify which CPU sets new threads may run on if they are not assigned to specific CPU sets.
  • SetThreadSelectedCpuSets – 這個函式可讓您將 CPU 集合限制為只有特定執行緒可在上面執行。SetThreadSelectedCpuSets – This function allows you to limit the CPU sets a specific thread may run on.

若從未使用 SetProcessDefaultCpuSets 函式,則新建立的執行緒可能會排程到可供處理程序使用的任何 CPU 集合。If the SetProcessDefaultCpuSets function is never used, newly created threads may be scheduled on any CPU set available to your process. 本節將詳細說明 CPUSets API 的基本知識。This section goes over the basics of the CPUSets API.

GetSystemCpuSetInformationGetSystemCpuSetInformation

用來收集資訊的第一個 API 是 GetSystemCpuSetInformation 函式。The first API used for gathering information is the GetSystemCpuSetInformation function. 這個函式會填入標題程式碼所提供之 SYSTEM_CPU_SET_INFORMATION 物件陣列的資訊。This function populates information in an array of SYSTEM_CPU_SET_INFORMATION objects provided by title code. 目的地記憶體必須由遊戲程式碼配置,其大小是由呼叫 GetSystemCpuSetInformation 本身來決定。The memory for the destination must be allocated by game code, the size of which is determined by calling GetSystemCpuSetInformation itself. 這需要呼叫 GetSystemCpuSetInformation 兩次,如下列範例所示。This requires two calls to GetSystemCpuSetInformation as demonstrated in the following example.

unsigned long size;
HANDLE curProc = GetCurrentProcess();
GetSystemCpuSetInformation(nullptr, 0, &size, curProc, 0);

std::unique_ptr<uint8_t[]> buffer(new uint8_t[size]);

PSYSTEM_CPU_SET_INFORMATION cpuSets = reinterpret_cast<PSYSTEM_CPU_SET_INFORMATION>(buffer.get());
  
GetSystemCpuSetInformation(cpuSets, size, &size, curProc, 0);

每個傳回的 SYSTEM_CPU_SET_INFORMATION 執行個體會包含一個唯一的處理單元資訊,又稱為 CPU 集合。Each instance of SYSTEM_CPU_SET_INFORMATION returned contains information about one unique processing unit, also known as a CPU set. 這不一定表示它代表唯一的實際硬體。This does not necessarily mean that it represents a unique physical piece of hardware. 利用超執行緒的 CPU 會有多個邏輯核心在單一實體處理核心上執行。CPUs that utilize hyperthreading will have multiple logical cores running on a single physical processing core. 將多個執行緒排程在位於相同實體核心上的不同邏輯核心,可讓硬體層級資源最佳化,否則會需要在核心層級完成額外工作。Scheduling multiple threads on different logical cores that reside on the same physical core allows hardware-level resource optimization that would otherwise require extra work to be done at the kernel level. 排程在相同實體核心上不同邏輯核心的兩個執行緒必須共用 CPU 時間,但比起將它們排程到同一個邏輯核心,執行起來會更有效率。Two threads scheduled on separate logical cores on the same physical core must share CPU time, but would run more efficiently than if they were scheduled to the same logical core.

SYSTEM_CPU_SET_INFORMATIONSYSTEM_CPU_SET_INFORMATION

此資料結構每個執行個體內的資訊 (由 GetSystemCpuSetInformation 傳回) 包含可能在上面排程執行緒的唯一處理單元資訊。The information in each instance of this data structure returned from GetSystemCpuSetInformation contains information about a unique processing unit that threads may be scheduled on. 基於目標裝置的可能範圍,SYSTEM_CPU_SET_INFORMATION 資料結構中的許多資訊可能不適用於遊戲開發。Given the possible range of target devices, a lot of the information in the SYSTEM_CPU_SET_INFORMATION data structure may not applicable for game development. 表 1 提供適用於遊戲開發的資料成員說明。Table 1 provides an explanation of data members that are useful for game development.

[表 1]。適用于遊戲開發的資料成員。Table 1. Data members useful for game development.

成員名稱Member name 資料類型Data type 描述Description
類型Type CPU_SET_INFORMATION_TYPECPU_SET_INFORMATION_TYPE 結構中的資訊類型。The type of information in the structure. 如果這個值不是 CpuSetInformation,則應該忽略。If the value of this is not CpuSetInformation, it should be ignored.
IdId unsigned longunsigned long 指定的 CPU 集合識別碼。The ID of the specified CPU set. 這是應該要搭配 CPU 集合函式 (例如 SetThreadSelectedCpuSets) 使用的識別碼。This is the ID that should be used with CPU set functions such as SetThreadSelectedCpuSets.
分組Group unsigned shortunsigned short 指定 CPU 集合的「處理器群組」。Specifies the “processor group” of the CPU set. 處理器群組可讓電腦擁有超過 64 個邏輯核心,並允許在系統執行時進行 CPU 熱交換。Processor groups allow a PC to have more than 64 logical cores, and allow for hot swapping of CPUs while the system is running. 非伺服器電腦配備超過一個群組的狀況是很少見的。It is uncommon to see a PC that is not a server with more than one group. 除非您正在撰寫的應用程式是在大型的伺服器或伺服器陣列上執行,否則最好使用單一群組中的 CPU 集合,因為大部分的消費者電腦只會有一個處理器群組。Unless you are writing applications meant to run on large servers or server farms, it is best to use CPU sets in a single group because most consumer PCs will only have one processor group. 此結構中的所有其他值都與 Group 有關。All other values in this structure are relative to the Group.
LogicalProcessorIndexLogicalProcessorIndex unsigned charunsigned char CPU 集合的群組相關索引Group relative index of the CPU set
CoreIndexCoreIndex unsigned charunsigned char CPU 集合所在位置之實體 CPU 核心的群組相關索引Group relative index of the physical CPU core where the CPU set is located
LastLevelCacheIndexLastLevelCacheIndex unsigned charunsigned char 和此 CPU 集合關聯之上次快取的群組相關索引Group relative index of the last cache associated with this CPU set. 除非系統使用 NUMA 節點,否則這是最慢的快取,通常是 L2 或 L3 快取。This is the slowest cache unless the system utilizes NUMA nodes, usually the L2 or L3 cache.

其他資料成員提供的資訊不太可能會描述消費者電腦或其他消費者裝置中的 CPU,所以不太可能是有用的資訊。The other data members provide information that is unlikely to describe CPUs in consumer PCs or other consumer devices and is unlikely to be useful. 傳回之資料所提供的資訊可接著用來以各種方式組織執行緒。The information provided by the data returned can then be used to organize threads in various ways. 此白皮書中遊戲開發的考量一節,詳細說明利用此資料來最佳化執行緒配置的幾種方式。The Considerations for game development section of this white paper details a few ways to leverage this data to optimize thread allocation.

以下是從各種不同類型硬體上執行的 UWP 應用程式所收集的一些資訊類型範例。The following are some examples of the type of information gathered from UWP applications running on various types of hardware.

[表 2]從在 Microsoft Lumia 950 上執行的 UWP 應用程式傳回的資訊。這是具有多個最後層級快取的系統範例。Lumia 950 具有一個 Qualcomm 808 Snapdragon 程式,其中包含雙核心 ARM Cortex-m A57 和四核心 ARM Cortex-m A53 Cpu。Table 2. Information returned from a UWP app running on a Microsoft Lumia 950. This is an example of a system that has multiple last level caches. The Lumia 950 features a Qualcomm 808 Snapdragon process that contains a dual core ARM Cortex A57 and quad core ARM Cortex A53 CPUs.

表 2

[表 3]。在一般電腦上執行的 UWP 應用程式所傳回的資訊。這是使用超執行緒的系統範例;每個實體核心都有兩個邏輯核心可供排程的執行緒。在此情況下,系統會包含 Intel Xenon CPU E5-2620。Table 3. Information returned from a UWP app running on a typical PC. This is an example of a system that uses hyperthreading; each physical core has two logical cores onto which threads can be scheduled. In this case, the system contained an Intel Xenon CPU E5-2620.

表 3

表4。從在四核心 Microsoft Surface Pro 4 上執行的 UWP 應用程式傳回的資訊。此系統有 Intel Core i5 6300 CPU。Table 4. Information returned from a UWP app running on a quad core Microsoft Surface Pro 4. This system had an Intel Core i5-6300 CPU.

表 4

SetThreadSelectedCpuSetsSetThreadSelectedCpuSets

現在與 CPU 集合有關的資訊已可使用,可用來組織執行緒。Now that information about the CPU sets is available, it can be used to organize threads. 利用 CreateThread 建立的執行緒控制代碼,會與可在上面排程執行緒之 CPU 集合的識別碼陣列一起傳遞到此函式中。The handle of a thread created with CreateThread is passed to this function along with an array of IDs of the CPU sets that the thread can be scheduled on. 其使用方式的其中一個範例如以下程式碼所示。One example of its usage is demonstrated in the following code.

HANDLE audioHandle = CreateThread(nullptr, 0, AudioThread, nullptr, 0, nullptr);
unsigned long cores [] = { cpuSets[0].CpuSet.Id, cpuSets[1].CpuSet.Id };
SetThreadSelectedCpuSets(audioHandle, cores, 2);

在此範例中,根據函式所建立的執行緒是宣告為 AudioThreadIn this example, a thread is created based on a function declared as AudioThread. 然後,此執行緒可排程到兩個 CPU 集合的其中之一。This thread is then allowed to be scheduled on one of two CPU sets. CPU 集合的執行緒擁有權不是專屬的。Thread ownership of the CPU set is not exclusive. 在未鎖定到特定 CPU 集合的情況下所建立的執行緒,可能會佔用 AudioThread 的時間。Threads that are created without being locked to a specific CPU set may take time from the AudioThread. 同樣地,其他已建立的執行緒也可能會在稍後鎖定到這些 CPU 集合的其中之一或兩者。Likewise, other threads created may also be locked to one or both of these CPU sets at a later time.

SetProcessDefaultCpuSetsSetProcessDefaultCpuSets

SetThreadSelectedCpuSets 相反的是 SetProcessDefaultCpuSetsThe converse to SetThreadSelectedCpuSets is SetProcessDefaultCpuSets. 當執行緒建立後,它們就不需要鎖定到特定的 CPU 集合。When threads are created, they do not need to be locked into certain CPU sets. 如果您不想要這些執行緒在特定 CPU 集合 (例如轉譯執行緒或音訊執行緒所使用的 CPU 集合) 上執行,您可以使用此函式指定允許在哪些核心上面排程這些執行緒。If you do not want these threads to run on specific CPU sets (those used by your render thread or audio thread for example), you can use this function to specify which cores these threads are allowed to be scheduled on.

遊戲開發的考量Considerations for game development

如我們所了解,在使用 CPUSets API 排程執行緒時,它可以提供許多資訊與彈性。As we've seen, the CPUSets API provides a lot of information and flexibility when it comes to scheduling threads. 相較於透過由下而上的方法來嘗試尋找此資料的用法,以由上到下的方式尋找如何配合一般案例使用資料會比較有效率。Instead of taking the bottom-up approach of trying to find uses for this data, it is more effective to take the top-down approach of finding how the data can be used to accommodate common scenarios.

使用時效性執行緒與超執行緒Working with time critical threads and hyperthreading

若您的遊戲有幾個執行緒必須即時和其他需要相對較少 CPU 時間的背景工作執行緒搭配執行,這個方法很有效。This method is effective if your game has a few threads that must run in real time along with other worker threads that require relatively little CPU time. 某些工作 (例如連續的背景音樂) 必須不間斷執行,以最佳化遊戲體驗。Some tasks, like continuous background music, must run without interruption for an optimal gaming experience. 即使有任一畫面格發生音訊執行緒耗盡,都可能會導致跳動或不順的情況,因此每個畫面格都接收到必要的 CPU 時間量是非常重要的。Even a single frame of starvation for an audio thread may cause popping or glitching, so it is critical that it receives the necessary amount of CPU time every frame.

使用 SetThreadSelectedCpuSets 搭配 SetProcessDefaultCpuSets 可確保您的重要執行緒維持不被任何背景工作執行緒中斷。Using SetThreadSelectedCpuSets in conjunction with SetProcessDefaultCpuSets can ensure your heavy threads remain uninterrupted by any worker threads. SetThreadSelectedCpuSets 可用來將您的大量執行緒指派到特定 CPU 集合。SetThreadSelectedCpuSets can be used to assign your heavy threads to specific CPU sets. SetProcessDefaultCpuSets 可接著用來確保任何未指派的已建立執行緒都會放置在其他 CPU 集合上。SetProcessDefaultCpuSets can then be used to make sure any unassigned threads created are put on other CPU sets. 如果是使用超執行緒的 CPU,考慮相同實體核心上的邏輯核心也很重要。In the case of CPUs that utilize hyperthreading, it's also important to account for logical cores on the same physical core. 如果您要執行的執行緒具有即時回應性,那麼就不應該允許背景工作執行緒在與其共用相同實體核心的邏輯核心上執行。Worker threads should not be allowed to run on logical cores that share the same physical core as a thread that you want to run with real time responsiveness. 下列程式碼示範如何判斷電腦是否使用超執行緒。The following code demonstrates how to determine whether a PC uses hyperthreading.

unsigned long retsize = 0;
(void)GetSystemCpuSetInformation( nullptr, 0, &retsize,
    GetCurrentProcess(), 0);
 
std::unique_ptr<uint8_t[]> data( new uint8_t[retsize] );
if ( !GetSystemCpuSetInformation(
    reinterpret_cast<PSYSTEM_CPU_SET_INFORMATION>( data.get() ),
    retsize, &retsize, GetCurrentProcess(), 0) )
{
    // Error!
}
 
std::set<DWORD> cores;
std::vector<DWORD> processors;
uint8_t const * ptr = data.get();
for( DWORD size = 0; size < retsize; ) {
    auto info = reinterpret_cast<const SYSTEM_CPU_SET_INFORMATION*>( ptr );
    if ( info->Type == CpuSetInformation ) {
         processors.push_back( info->CpuSet.Id );
         cores.insert( info->CpuSet.CoreIndex );
    }
    ptr += info->Size;
    size += info->Size;
}
 
bool hyperthreaded = processors.size() != cores.size();

如果系統使用超執行緒,預設 CPU 集合的集合中不得包含位於和任何即時執行緒所在之相同實體核心上的邏輯核心。If the system utilizes hyperthreading, it is important that the set of default CPU sets does not include any logical cores on the same physical core as any real time threads. 如果系統並未使用超執行緒,則僅需確定預設 CPU 集合不包含和執行音訊執行緒之 CPU 集合相同的核心。If the system is not hyperthreading, it is only necessary to make sure that the default CPU sets do not include the same core as the CPU set running your audio thread.

根據實體核心所組織之執行緒的範例,可在額外資源區段中連結之 GitHub 儲存機制上的 CPUSets 範例中找到。An example of organizing threads based on physical cores can be found in the CPUSets sample available on the GitHub repository linked in the Additional resources section.

利用末級快取降低快取一致性成本Reducing the cost of cache coherence with last level cache

快取一致性是一項概念,代表橫跨多種硬體資源,在相同資料上動作的快取記憶體相同。Cache coherency is the concept that cached memory is the same across multiple hardware resources that act on the same data. 如果在不同核心上排程執行緒,但使用相同資料,它們可能會在不同的快取中使用個別的資料複本。If threads are scheduled on different cores, but work on the same data, they may be working on separate copies of that data in different caches. 為了取得正確的結果,這些快取必須保持彼此間的一致性。In order to get correct results, these caches must be kept coherent with each other. 維護多個快取之間的一致性相當耗費資源,但對於任何多核心系統的運作而言是必要的。Maintaining coherency between multiple caches is relatively expensive, but necessary for any multi-core system to operate. 此外,它完全不受用戶端程式碼控制;基礎系統會存取核心之間的共用記憶體資源,獨立運作以維持快取的最新狀態。Additionally, it is completely out of the control of client code; the underlying system works independently to keep caches up to date by accessing shared memory resources between cores.

如果您的遊戲有共用特別大量資料的多個執行緒,您可以透過確認它們是否是排程在共用末級快取的 CPU 集合上,來將快取一致性成本最小化。If your game has multiple threads that share an especially large amount of data, you can minimize the cost of cache coherency by ensuring that they are scheduled on CPU sets that share a last level cache. 末級快取是最慢的快取,可供不使用 NUMA 節點的系統核心使用。The last level cache is the slowest cache available to a core on systems that do not utilize NUMA nodes. 對於遊戲電腦來說,使用 NUMA 節點非常罕見。It is extremely rare for a gaming PC to utilize NUMA nodes. 如果核心不共用末級快取,維護一致性會需要存取更高層級 (因而更慢) 的記憶體資源。If cores do not share a last level cache, maintaining coherency would require accessing higher level, and therefore slower, memory resources. 如果兩個執行緒在任何指定的時間範圍內都不需要超過 50% 的時間,那麼相較於將它們排程到個別的實體核心上,將兩個執行緒鎖定到共用快取與實體核心的個別 CPU 集合上可提供更好的效能。Locking two threads to separate CPU sets that share a cache and a physical core may provide even better performance than scheduling them on separate physical cores if they do not require more than 50% of the time in any given frame.

這個程式碼範例說明如何判斷經常通訊的執行緒是否可以共用末級快取。This code example shows how to determine whether threads that communicate frequently can share a last level cache.

unsigned long retsize = 0;
(void)GetSystemCpuSetInformation(nullptr, 0, &retsize,
    GetCurrentProcess(), 0);
 
std::unique_ptr<uint8_t[]> data(new uint8_t[retsize]);
if (!GetSystemCpuSetInformation(
    reinterpret_cast<PSYSTEM_CPU_SET_INFORMATION>(data.get()),
    retsize, &retsize, GetCurrentProcess(), 0))
{
    // Error!
}
 
unsigned long count = retsize / sizeof(SYSTEM_CPU_SET_INFORMATION);
bool sharedcache = false;
 
std::map<unsigned char, std::vector<SYSTEM_CPU_SET_INFORMATION>> cachemap;
for (size_t i = 0; i < count; ++i)
{
    auto cpuset = reinterpret_cast<PSYSTEM_CPU_SET_INFORMATION>(data.get())[i];
    if (cpuset.Type == CPU_SET_INFORMATION_TYPE::CpuSetInformation)
    {
        if (cachemap.find(cpuset.CpuSet.LastLevelCacheIndex) == cachemap.end())
        {
            std::pair<unsigned char, std::vector<SYSTEM_CPU_SET_INFORMATION>> newvalue;
            newvalue.first = cpuset.CpuSet.LastLevelCacheIndex;
            newvalue.second.push_back(cpuset);
            cachemap.insert(newvalue);
        }
        else
        {
            sharedcache = true;
            cachemap[cpuset.CpuSet.LastLevelCacheIndex].push_back(cpuset);
        }
    }
}

圖 1 中所示的快取配置,是您可能在系統中看到的配置類型範例。The cache layout illustrated in Figure 1 is an example of the type of layout you might see from a system. 下圖是在 Microsoft Lumia 950 中找到的快取圖例。This figure is an illustration of the caches found in a Microsoft Lumia 950. 在 CPU 256 與 CPU 260 之間發生的內部執行緒通訊會產生大量的額外負荷,因為它需要系統維持 L2 快取一致性。Inter-thread communication occurring between CPU 256 and CPU 260 would incur significant overhead because it would require the system to keep their L2 caches coherent.

圖1。在 Microsoft Lumia 950 裝置上找到快取架構。Figure 1. Cache architecture found on a Microsoft Lumia 950 device.

Lumia 950 快取

[摘要]Summary

適用於 UWP 開發的 CPUSets API 提供大量的資訊和控制多執行緒處理的選項。The CPUSets API available for UWP development provides a considerable amount of information and control over your multithreading options. 相較於之前適用於 Windows 開發的多執行緒 API,新增的彈性具有一些學習曲線,但增加的彈性最終可在各種消費者電腦和其他硬體目標上有更佳的效能。The added complexities compared to previous multithreaded APIs for Windows development has some learning curve, but the increased flexibility ultimately allows for better performance across a range of consumer PCs and other hardware targets.

其他資源Additional resources