Modern general-purpose graphic processing units (GPGPUs) have emerged as pervasive alternatives for parallel high-performance computing. The extreme multithreading in modern GPGPUs demands a large register file (RF), which is typically organized into multiple banks to support the massive parallelism. Although a heavily banked structure benefits RF throughput, its associated area and energy costs with diminishing performance gains greatly limit the future RF scaling. In this paper, we propose an improved RF design with bank stealing techniques, which enable a high RF throughput with compact area. By deeply investigating the GPGPU microarchitecture, we find that the state-of-the-art RF designs’ is far from optimal due to the deficiency in bank utilization, which is the intrinsic limitation to a high RF throughput and a compact RF area. We investigate the causes for bank conflicts and identify that most conflicts can be eliminated by leveraging the fact that the highly banked RF oftentimes experiences underutilization. This is especially true in GPGPUs, where multiple ready warps are available at the scheduling stage with their operands to be wisely coordinated. In this paper, we propose two lightweight bank stealing techniques that can opportunistically fill the idle banks and register entries for better operand service. Using the proposed architecture, the average GPGPU performance can be improved under a smaller energy budget with significant area saving, which makes it promising for sustainable RF scaling.
DOI 인용 스타일