The evolution of GPU architectures has led to the emergence of Multi-Chip Module (MCM) GPUs, which provide a scalable solution to overcome the physical and power limitations of monolithic GPUs. By interconnecting multiple GPU modules within a package, MCM-GPUs enhance performance, efficiency, and scalability. However, as the number of GPU modules increases, the interconnect Network-on-Chip (NoC) becomes a critical performance bottleneck due to growing traffic demands.
My research focuses on characterizing the traffic behavior of MCM-GPUs, analyzing NoC performance, and exploring architectural improvements to support large-scale MCM-GPU systems. The study aims to scale current MCM-GPU designs from 4 GPU modules to 16, analyze traffic patterns, and explore synthetic traffic modeling for accelerated design-space exploration.
The first key aspect of this research is to scale the state-of-the-art MCM-GPU architecture from 4 GPU modules to 16 and investigate its performance behavior. As GPU module count increases, interconnect complexity, data traffic, and bandwidth requirements grow non-linearly. I already have conducted an in-depth performance sensitivity analysis by systematically increasing bandwidth within each GPU count scenario to assess how bandwidth impacts overall system efficiency. Moreover, understanding the nature of traffic among GPU modules is essential to optimize NoC designs and improve system performance. The second part of this research focuses on capturing and analyzing traffic characteristics to identify fundamental traffic burst behaviors, as well as studying how traffic evolves over time and space. The results I have obtained so far are very interesting and insightful. I have prepared the paper for one of IEEE Micro / ACM TACO / IEEE CAL journals.
The second research aspect of my PhD is about characterizing NoC performance in MCM-GPU systems. As we already observed in previous research, The communication efficiency between GPU modules significantly impacts overall IPC. One of the most critical factors affecting NoC performance is packet latency, particularly tail latency, which represents the slowest packet transmission times in the system. Our study reveals a direct and strong negative correlation between tail latency and IPC. In essence, this research dives deep into NoC performance characterization, focusing on packet latency analysis, identifying bottlenecks, and proposing improvements to mitigate high tail latency and enhance overall NoC efficiency. I am currently finalizing the draft am hoping that this work will be accpeted in IEEE Micro / ACM TACO / IEEE CAL / ACM SIGMETRICS / IEEE transactions on Parallel and Distributed Systems.
I already started a new reserch topic which is about generating synthetic traffic which is representative of the traffic in Multi-Chip GPU systems. The motivation behind this project is that, in generall it is very time consuming and tedious to evaluate the performance of NoC in Multi-Chip GPU systems, because workloads takes forever to execute using simulations such as AccelSim. If someone (like me) wants to explore potential areas within the context of MCM-GPU NoC, it has to wait for long time (as I did for my PhD!). My idea is to propose a single statistical model for every workload such that it reflects the traffic feature of it in MCM-GPU system. Having this statistical model in our hand, we can generate synthetic traffic in cycle-accurate simulators such as BookSim, and by tweaking NoC parameters we can reach similar yet accurate enough performance trend. For example, we run several workloads in Full-System simulation in different bandwidths. If we have the statistical traffic model of those workloads and we generate synthetic traffic out of those models for every workload, we can reproduce performance results similar to those of Full-System simulations, by several orders of magnitude faster turn around time. This allows researchers (like me) to accelerate the design space exploration of NoC. The simulator I am developing for this purpose is based on BookSim. I do believe that the result of this work can be published in ISPASS.
Potential areas to work in future
The Network-on-Chip (NoC) in MCM-GPU systems is still in its infancy phase, with significant research gaps in scalability, performance optimization, and traffic-aware design. Unlike traditional monolithic GPUs, where intra-GPU communication is handled efficiently with internal crossbars or hierarchical interconnects, MCM-GPUs introduce new interconnect challenges due to the need for efficient inter-module communication across separate GPU dies. These challenges arise from the lack of an optimized NoC infrastructure tailored for multi-GPU workloads, creating an urgent need for deeper architectural exploration. Through this work, I seek to contribute to the next-generation NoC designs that will shape the future of scalable multi-GPU computing.