Modern general-purpose graphic processing units (GPGPUs) have emerged as pervasive alternatives for parallel high-performance computing. The extreme multithreading in modern GPGPUs demands a large register file (RF), which is typically organized into multiple banks to support the massive parallelism...
Modern general-purpose graphic processing units (GPGPUs) have emerged as pervasive alternatives for parallel high-performance computing. The extreme multithreading in modern GPGPUs demands a large register file (RF), which is typically organized into multiple banks to support the massive parallelism. Although a heavily banked structure benefits RF throughput, its associated area and energy costs with diminishing performance gains greatly limit the future RF scaling. In this paper, we propose an improved RF design with bank stealing techniques, which enable a high RF throughput with compact area. By deeply investigating the GPGPU microarchitecture, we find that the state-of-the-art RF designs’ is far from optimal due to the deficiency in bank utilization, which is the intrinsic limitation to a high RF throughput and a compact RF area. We investigate the causes for bank conflicts and identify that most conflicts can be eliminated by leveraging the fact that the highly banked RF oftentimes experiences underutilization. This is especially true in GPGPUs, where multiple ready warps are available at the scheduling stage with their operands to be wisely coordinated. In this paper, we propose two lightweight bank stealing techniques that can opportunistically fill the idle banks and register entries for better operand service. Using the proposed architecture, the average GPGPU performance can be improved under a smaller energy budget with significant area saving, which makes it promising for sustainable RF scaling.
Modern general-purpose graphic processing units (GPGPUs) have emerged as pervasive alternatives for parallel high-performance computing. The extreme multithreading in modern GPGPUs demands a large register file (RF), which is typically organized into multiple banks to support the massive parallelism. Although a heavily banked structure benefits RF throughput, its associated area and energy costs with diminishing performance gains greatly limit the future RF scaling. In this paper, we propose an improved RF design with bank stealing techniques, which enable a high RF throughput with compact area. By deeply investigating the GPGPU microarchitecture, we find that the state-of-the-art RF designs’ is far from optimal due to the deficiency in bank utilization, which is the intrinsic limitation to a high RF throughput and a compact RF area. We investigate the causes for bank conflicts and identify that most conflicts can be eliminated by leveraging the fact that the highly banked RF oftentimes experiences underutilization. This is especially true in GPGPUs, where multiple ready warps are available at the scheduling stage with their operands to be wisely coordinated. In this paper, we propose two lightweight bank stealing techniques that can opportunistically fill the idle banks and register entries for better operand service. Using the proposed architecture, the average GPGPU performance can be improved under a smaller energy budget with significant area saving, which makes it promising for sustainable RF scaling.
참고문헌 (43)
10.1145/1165573.1165633
10.1109/MICRO.2001.991122
10.1145/2540708.2540715
Naifeng Jing, Li Jiang, Tao Zhang, Chao Li, Fengfeng Fan, Xiaoyao Liang.
Energy-Efficient eDRAM-Based On-Chip Storage Architecture for GPGPUs.
IEEE transactions on computers,
vol.65,
no.1,
122-135.
Jones, Timothy M., O'Boyle, Michael F. P., Abella, Jaume, González, Antonio, Ergin, Oğuz.
Energy-efficient register caching with compiler assistance.
ACM transactions on architecture and code optimization,
vol.6,
no.4,
1-23.
Tseng, J.H., Asanovic, K..
A speculative control scheme for an energy-efficient banked register file.
IEEE transactions on computers,
vol.54,
no.6,
741-751.
Proc 19th IEEE Int Symp High Perform Comput Archit (HPCA) Technology comparison for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM chang 2013 143
10.1145/1669112.1669140
10.1145/2000064.2000093
10.1145/2485922.2485964
10.1109/HPCA.2013.6522337
Proc Int Conf Comput -Aided Design (ICCAD) Architectural power models for SRAM and cam structures based on hybrid analytical/empirical techniques liang 2007 824
10.1109/MICRO.2007.40
10.1109/MICRO.2012.18
Pseudo-dual port memory where ratio of first to second memory access is clock duty cycle independent jung 2007
2011 38th Annual International Symposium on Computer Architecture (ISCA) ISCA SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading yu 2011 247
Nvidia’s next generation CUDA compute architecture: Kepler GK110 2012
Proc IEEE/ACM Int Symp Low Power Electron Design Bank stealing for conflict mitigation in GPGPU register file jing 2015 55
NVIDIA’s next generation CUDA compute architecture: Fermi 2009
Single-Port Register-File User Guide 2012
10.1109/ICCAD.2011.6105418
Proc 4th ACM/IEEE Int Symp Netw -Chip A $128\times 128\times 24$ Gb/s crossbar interconnecting 128 tiles in a single hop and occupying 6% of their area passas 2010 87
10.1109/MICRO.2014.11
NVIDIA Cuda Toolkit 2013
10.1109/HPCA.2014.6835938
10.1109/ISPASS.2009.4919648
GPGPU-Sim 3 x Simulator aamodt 2014
10.1109/HPCA.2013.6522351
Parboil: A revised benchmark suite for scientific and commercial throughput computing stratton 2012
※ AI-Helper는 부적절한 답변을 할 수 있습니다.