Translation Lookaside Buffer (TLB)
Translation Lookaside Buffer (TLB)
Throughout the journey of understanding TLB mechanism in GPU systems, I read a very nice paper called Barre Chord[1] published in ISCA 2024. In this article, along with reviewing the innovations of this paper, I intend to describe and clarify the problems that address translation process faces in Multi-Chiplet GPU systems.
The authors of Barre Chord showcase the scalability problem of virtual memory translation in MCM-GPUs through a combination of microarchitectural simulation, performance profiling, and empirical observation of translation behavior across chiplets. Their methodology focuses on quantifying the translation overhead that arises when multiple GPU chiplets operate in parallel under the traditional GPU virtual memory model, especially when translation is handled by centralized or per-chiplet but naïvely distributed mechanisms.
To analyze the translation bottleneck, the authors model an MCM-GPU with multiple chiplets using MGPUSim. This model incorporates a unified virtual address space shared across chiplets, local TLBs and page walkers per chiplet, shared multi-level page tables and inter-chiplet network for communication and page table accesses. They simulate realistic GPU workloads across four chiplets, and measure the impact of translation requests and TLB behavior under different memory access patterns.
The authors conduct two quantitative measurements which reveals that TLB is one of the main bottlenecks of MCM-GPU systems. The first observation is tweaking the PTW count. Below figure shows almost linear speedup with more PTW. However, when the system shifts to infinite number of PTW, the average speedup plateaus around 2X. This implies that adding more PTW only reduces the queuing effect while leaving the other latencies in the translation process unchanged. Also, this approach incurs area and power overhead and it is not scalable.
The second measurement is about the the isolation of the impact of IOMMU from that of intra-MCM resources. This is done by increasing the capacity of MSHR register of the L2 TLB in every chiplet. On average, doubling the MSHR capacity only brings around 6% performance boost while vast majority of applications do not experience any speedup. This reveals that the bottleneck is not the capability to hold the outstanding translation misses but the capability to process them.
the authors also critically evaluate whether two well-known techniques such as PTE prefetching and superpages can alleviate address translation bottlenecks in Multi-Chip-Module (MCM) GPUs. Their analysis is clear: both methods are largely ineffective in this context, and here’s why. n traditional CPU and single-GPU systems, PTE prefetching works well when memory accesses exhibit spatial locality, that is, when addresses accessed by a workload are close together (e.g., consecutive virtual page numbers or VPNs). However, in MCM-GPUs, Each chiplet operates many threads simultaneously, leading to highly interleaved and less predictable memory access patterns, and Virtual Page Number gaps are large and irregular (as below figure illustrates). As a result, the PTE prefetcher can't accurately guess which PTEs to fetch, drastically reducing its effectiveness.
Superpages on the other hand aim to reduce TLB misses by mapping a large, contiguous range of virtual memory to physical memory (e.g., using 2MB pages instead of 4KB). While this reduces the number of TLB entries and page walks, MCM-GPUs introduce complications:
Contiguity in Physical Memory Is Rare: Each chiplet manages its own local memory. Allocating a large, contiguous physical block that spans across multiple chiplets is impractical due to fragmentation and NUMA constraints.
Distributed Page Mapping Creates Imbalance: With superpages, large contiguous allocations may end up being placed on fewer chiplets. This can lead to an increase in remote memory accesses and more frequent page migration.
Furthermore, the authors try to improve address translation performance by adding a shared L2 TLB across all chiplets. Their findings suggest that L2 TLB sharing offers only marginal performance benefits. Specifically, they run an oracle experiment by simulating a shared L2 TLB that is 4× larger in capacity, has 4× more bandwidth, has no inter-chiplet communication latency, and retains zero latency penalties for sharing. Even in this highly favorable and idealized setting, the shared L2 TLB improved performance by only about 6% on average. Below figure shows that only less than half of the benchmarked applications show any noticeable speedup at all. This minimal gain is attributed to the advanced page allocation policy and lack of contiguity in virtual and physical page mapping. Therefore, the authors conclude that L2 TLB sharing is not a scalable or effective approach for MCM-GPUs, especially in the presence of optimized page mapping strategies.
To sum up the current problem in address translation mechanism in MCM-GPU systems, we can say:
Problem 1: Translation Latency Breakdown Across Chiplets
They observe that as threads issue memory accesses, TLB misses and page table walks create substantial pressure on translation infrastructure. The study finds that:
A large fraction of TLB misses on one chiplet require accessing remote page table entries located in the memory of another chiplet.
Page table walks often span multiple levels, and with memory split across chiplets, each level may reside on a different chiplet, resulting in cross-chiplet traffic for translation metadata.
This cross-chiplet translation latency dominates memory access latency when TLB hit rates are not high.
Problem 2: Redundant Translations
The authors provide metrics showing that many threads in a warp or block access memory addresses that map to the same or nearby pages. Yet, under traditional translation models, each thread may independently trigger a page table walk, because the architecture does not coalesce translation requests. This leads to:
Duplicated page table lookups
Unnecessary bandwidth consumption on the inter-chiplet network
Contended page walker queues
This insight is critical: while data accesses are coalesced, translation requests are not, which exposes a mismatch in granularity that results in performance waste.