While I was looking for an insightful paper, I stumbled upon a very nice one called "Scheduling Page Table Walks for Irregular GPU Applications"[1], published in ISCA 2018. This paper made me read another paper called "Architectural Support for Address Translation on GPUs"[2].
These papers are so fruitful and contain a lot of information to learn. Therefore, I would like to dive a little deep into the paging process and TLB in GPU.
Let me start by asking a simple question: What is Virtual Memory ?
Virtual memory is an abstraction that allows each application to believe it has its own large, contiguous address space, even though the physical memory may be smaller and memory may be fragmented or shared. This abstraction is maintained by the system's Memory Management Unit (MMU) and page tables, which translate virtual addresses (VA) to physical addresses (PA). Each virtual page (typically 4KB in size) maps to a physical frame in memory. This mapping is stored in a page table. Virtual Memory is useful because we can have several applications sandboxed from each other, developers use large, flat address space, common libraries or buffers can be mapped to multiple processes, and in CPUs, infrequently used pages can be moved to disk. While originally GPUs didn’t use virtual memory, modern GPUs (e.g., NVIDIA Pascal, AMD Vega, CDNA, etc.) now support it for the following reasons:
To support multiple applications running concurrently
To allow memory oversubscription
To enable Unified Memory (CPU and GPU sharing virtual address space)
However, GPU virtual memory faces unique challenges:
Massive thread counts: Tens of thousands of threads can issue memory.
Latency Sensitivity: Page table lookups are expensive and easily stall warps.
TLB pressure: GPUs generate many more TLB accesses per cycle than CPUs due to high warp-level parallelism.
What is the TLB in a GPU? Where is it located and what does it do?
A TLB is a small, fast cache that stores recently used virtual-to-physical address translations. Every memory instruction that uses a virtual address must first check the TLB to see if the translation is already checked. If it hits, then it gets the physical address immediately, and if miss, it goes to the page table which resides inside memory. GPUs fire off hundreds of memory accesses per warp, and with thousands of warps active, the TLB is under extreme pressure. Without a well-performing TLB, its miss rates spike, each miss causes a long page walk and it causes warps to stall and waiting for address translations to complete. This is especially bad in irregular workloads, where memory access patterns lack locality.
During pandemic, I confined myself at home and started doing fun projects that one of them was memory heap manager in the OS. You can find the source code and the project here. in CPU world, A thread requests an access to a data. It goes to the L1, if it hits, then we are fine. If not, then it goes to LLC or L2, if it hits we are fine and if not then it means it's LLC miss and the cache line must be brought from memory to LLC and L1 ( the concept of inclusive and exclusive cache can be brought here). If The page is not in the memory, then the MMU must bring that page from Disk to memory. To do so, it first checks the TLB to see if the VA to PA exists or not. If yes, then it quickly translates the VA to PA and we're fine. And if not, then it checks the page table to translate VA to PA which is time consuming. After the translation in this stage, the page is brought from disk to memory. But in modern GPUs, this process is taken care of differently. In GPUs, In practice, the address translation step (via TLB and page tables) happens before any cache lookup. Why ? Because caches are indexed by physical addresses. So we must translate VA to PA. Let’s walk through a memory access in GPU step-by-step:
STEP 0: A Thread Issues a Load or Store: it uses a virtual address. This happens for both CPU and GPU because modern systems use virtual memory even for GPU threads.
STEP 1: Address Translation Begins: The system tries to translate the VA to PA via TLB. It first checks the L1 TLB (per SM or per core). If it hits, then PA is generated immediately and it proceeds to L1 cache lookup. If it misses, then it checks L2 TLB (shared by SMs or entire GPU). If it hits, then PA is ready and it proceeds with cache lookup, otherwise, in case of miss, it must go through the page walk process.
STEP2: Page Table Walk: It may be possible there is a hierarchical page table organization in which the system should fetch the address of successive pages in order to translate the final result. Therefore, it initiates 3 to 4 dependent memory accesses. Once the mapping is resolved, a new VA to PA translation is inserted to the TLBs. This is where GPU warp stalls, and many such page walks can severely congest the memory system.
Cache hierarchy lookup: Now that we have the PA, we check L1 data cache. We go to L2/LLC lookup in case of experiencing L1 miss. And we go to DRAM acceess in case we encounter L2/LLC miss. If page is not present in DRAM, Page Fault happens and the OS brings it from disk.
This observation made me read the second paper[2] to understand TLB deeper.
When a program wants to read data from memory, it uses a virtual address. But before the hardware can fetch the actual data from memory, that virtual address must be translated to a physical address using the TLB. This raises a question: When should the TLB be accessed relative to the cache? There are three main ways to organize how caches and TLBs interact:
Virtually-Indexed, Virtually-Tagged (VIVT): Cache is accessed directly using the virtual address. It's fast because it skips address translation. But it can cause problems like synonyms(Two virtual addresses map to the same physical one) and Homonyms(Same virtual address maps to different physical addresses in different contexts.).
Physically-Indexed, Physically-Tagged (PIPT): It waits for the result of the TLB to translate VA to PA, then it uses PA to access the cache.
Virtually-Indexed, Physically-Tagged (VIPT): It uses part of the virtual address to index (look up) the cache, and uses physical address from TLB to confirm that it's the right data (tag match). This way, TLB lookup and cache indexing happen in parallel.
Thanks to geeksforgeeks
GPUs need Memory Management Units (MMUs) to support virtual memory, which includes: TLB and PTW. An efficient approach is to have 1 TLB per shader core, saving area and power and an acceptable performance optimizations. Typically, a TLB with 128 entry as well as 3-4 ports suffice to reach maximum performance. In practice, many threads within a warp access the same or nearby addresses, especially in workloads with good memory locality. Before addresses reach the TLB, they pass through a coalescer, which merges similar memory accesses. The result is far fewer than 32 distinct TLB lookups per warp. [2] shows page divergent is typically 3-4 on average. So, 3–4 TLB ports give you ~90% of the performance of a fully multi-ported (32-port) TLB at much lower area and power cost. In addition, there is a concept of blocking/non-blocking TLB in GPUs. In the case of blocking, If there’s a miss, the whole warp (or pipeline) may stall, whereas in the case of non-blocking, it allows hits to continue while a miss is being resolved with Miss Status Holding Registers (MSHRs). MSHR tracks a pending TLB miss and holds the warp/thread info waiting on it. When the page walk completes and fills the TLB, the MSHR entry resolves and wakes up the stalled warp.
Here, I got the attention from the coalescer. So I asked where is this coalescer located ? The coalescer operates after the address generation stage but before both the cache and TLB. Each warp's threads compute virtual addresses. Coalescer groups memory requests that Fall in the same cache line (to minimize cache accesses) and Fall in the same virtual page (to minimize TLB accesses). Coalescing is why you get fewer TLB accesses than threads per warp, enabling those 3–4 ports to suffice.
References
[1] Shin, S., Cox, G., Oskin, M., Loh, G. H., Solihin, Y., Bhattacharjee, A., & Basu, A. (2018). Scheduling Page Table Walks for Irregular GPU Applications. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 180–192.
[2] Pichai, B., Hsu, L., & Bhattacharjee, A. (2014). Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces. ACM SIGARCH Computer Architecture News, 42(1), 743-758.