Systems and methods for dynamically choosing a processing element for a compute kernel
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-009/45
G06F-009/455
G06F-009/48
출원번호
US-0349427
(2012-01-12)
등록번호
US-8458680
(2013-06-04)
발명자
/ 주소
Crutchfield, William Y.
Grant, Brian K.
Papakipos, Matthew N.
출원인 / 주소
Google Inc.
대리인 / 주소
Morgan, Lewis & Bockius LLP
인용정보
피인용 횟수 :
13인용 특허 :
46
초록▼
A runtime system implemented in accordance with the present invention provides an application platform for parallel-processing computer systems. Such a runtime system enables users to leverage the computational power of parallel-processing computer systems to accelerate/optimize numeric and array-in
A runtime system implemented in accordance with the present invention provides an application platform for parallel-processing computer systems. Such a runtime system enables users to leverage the computational power of parallel-processing computer systems to accelerate/optimize numeric and array-intensive computations in their application programs. This enables greatly increased performance of high-performance computing (HPC) applications.
대표청구항▼
1. A computer-implemented method, comprising: at a runtime system executing at a parallel-processing computer that includes multiple types of processing elements using shared memory: at runtime:receiving one or more operation requests issued by an application; choosing a respective type of processin
1. A computer-implemented method, comprising: at a runtime system executing at a parallel-processing computer that includes multiple types of processing elements using shared memory: at runtime:receiving one or more operation requests issued by an application; choosing a respective type of processing element for at least one of the one or more operation requests, the at least one operation request corresponding to an intrinsic operation or a primitive operation, wherein the respective type of processing element is selected from the group consisting of single-core/multi-core central processing units, graphics processing units and single-core/multi-core co-processors;identifying, from a library, a precompiled, processor-specific compute kernel corresponding to the intrinsic operation or the primitive operation of the at least one operation request, the library comprising a plurality of processor-specific compute kernels corresponding to a plurality of intrinsic operations or primitive operations; andpreparing one or more compute kernels for the at least one operation request, wherein the one or more compute kernels include the at least one precompiled, processor-specific compute kernel, and are configured to execute on the respective type of processing element. 2. The computer-implemented method of claim 1, wherein the precompiled, processor-specific compute kernels in the library are compressed and/or encrypted. 3. The computer-implemented method of claim 1, wherein the precompiled, processor-specific compute kernels are hand-coded for a specific type of processor. 4. The computer-implemented method of claim 1, wherein the intrinsic operation corresponding to the at least one operation request is selected from a group consisting of: matrix multiplication, matrix solvers, fast Fourier transforms, convolutions and LU decomposition. 5. The computer-implemented method of claim 1, wherein the primitive operation corresponding to the at least one operation request is selected from a group consisting of: arithmetic operations and trigonometric functions. 6. The computer-implemented method of claim 1, wherein the respective type of processing element is chosen based, at least in part, on one of a predefined computer resource requirement metric of the one or more operation requests and a predefined workload requirement metric of the parallel-processing computer running the runtime system. 7. The computer-implemented method of claim 1, wherein the one or more operation requests correspond to application program interface function calls. 8. The computer-implemented method of claim 1, wherein the identified precompiled, processor-specific compute kernel corresponding to the intrinsic operation or the primitive operation is in a GPU binary instruction set architecture format. 9. A parallel-processing computer system, comprising: memory;multiple types of processing elements using shared memory; andat least one program stored in the memory and executed by the multiple types of processing elements, the at least one program including a runtime system comprising instructions for:at runtime: receiving one or more operation requests issued by an application;choosing a respective type of processing element for at least one of the one or more operation requests, the at least one operation request corresponding to an intrinsic operation or a primitive operation, wherein the respective type of processing element is selected from the group consisting of single-core/multi-core central processing units, graphics processing units and single-core/multi-core co-processors;identifying, from a library, a precompiled, processor-specific compute kernel corresponding to the intrinsic operation or the primitive operation of the at least one operation request, the library comprising a plurality of processor-specific compute kernels corresponding to a plurality of intrinsic operations or primitive operations; andpreparing one or more compute kernels for the at least one operation request, wherein the one or more compute kernels include the at least one precompiled, processor-specific compute kernel, and are configured to execute on the respective type of processing element. 10. The parallel-processing computer system of claim 9, wherein the precompiled, processor-specific compute kernels in the library are compressed and/or encrypted. 11. The parallel-processing computer system of claim 9, wherein the precompiled, processor-specific compute kernels are hand-coded for a specific type of processor. 12. The parallel-processing computer system of claim 9, wherein the intrinsic operation corresponding to the at least one operation request is selected from a group consisting of: matrix multiplication, matrix solvers, fast Fourier transforms, convolutions and LU decomposition. 13. The parallel-processing computer system of claim 9, wherein the primitive operation corresponding to the at least one operation request is selected from a group consisting of: arithmetic operations and trigonometric functions. 14. The parallel-processing computer system of claim 9, wherein the respective type of processing element is chosen based, at least in part, on one of a predefined computer resource requirement metric of the one or more operation requests and a predefined workload requirement metric of the parallel-processing computer running the runtime system. 15. The parallel-processing computer system of claim 9, wherein the one or more operation requests correspond to application program interface function calls. 16. The parallel-processing computer system of claim 9, wherein the identified precompiled, processor-specific compute kernel corresponding to the intrinsic operation or the primitive operation is in a GPU binary instruction set architecture format. 17. A non-transitory computer readable storage medium storing one or more programs comprising instructions for: at runtime:receiving one or more operation requests issued by an application;choosing a respective type of processing element for at least one of the one or more operation requests, the at least one operation request corresponding to an intrinsic operation or a primitive operation, wherein the respective type of processing element is selected from the group consisting of single-core/multi-core central processing units, graphics processing units and single-core/multi-core co-processors;identifying, from a library, a precompiled, processor-specific compute kernel corresponding to the intrinsic operation or the primitive operation of the at least one operation request, the library comprising a plurality of processor-specific compute kernels corresponding to a plurality of intrinsic operations or primitive operations; andpreparing one or more compute kernels for the at least one operation request, wherein the one or more compute kernels include the at least one precompiled, processor-specific compute kernel, and are configured to execute on the respective type of processing element; andwherein the one or more programs are configured to be executed by a parallel-processing computer that includes multiple types of processing elements using shared memory. 18. The non-transitory computer readable storage medium of claim 17, wherein the precompiled, processor-specific compute kernels in the library are compressed and/or encrypted. 19. The non-transitory computer readable storage medium of claim 17, wherein the precompiled, processor-specific compute kernels are hand-coded for a specific type of processor. 20. The non-transitory computer readable storage medium of claim 17, wherein the intrinsic operation corresponding to the at least one operation request is selected from a group consisting of: matrix multiplication, matrix solvers, fast Fourier transforms, convolutions and LU decomposition. 21. The non-transitory computer readable storage medium of claim 17, wherein the primitive operation corresponding to the at least one operation request is selected from a group consisting of: arithmetic operations and trigonometric functions. 22. The non-transitory computer readable storage medium of claim 17, wherein the respective type of processing element is chosen based, at least in part, on one of a predefined computer resource requirement metric of the one or more operation requests and a predefined workload requirement metric of the parallel-processing computer running the runtime system. 23. The non-transitory computer readable storage medium of claim 17, wherein the one or more operation requests correspond to application program interface function calls. 24. The non-transitory computer readable storage medium of claim 17, wherein the identified precompiled, processor-specific compute kernel corresponding to the intrinsic operation or the primitive operation is in a GPU binary instruction set architecture format.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (46)
Wu, Gansha; Lueh, Guei Yuan; Shi, Xiaohua, Apparatus and methods for restoring synchronization to object-oriented software applications in managed runtime environments.
Tang Jun ; So John Ling Wing, Computer operating process allocating tasks between first and second processors at run time based upon current processor load.
Alain Charles Azagury IL; Michael Factor IL; Gera Goft IL; Shlomit Pinter IL; Esther Yeger-Lotem IL, Group communication system with flexible member model.
Kielstra,Allan Henry; Stepanian,Levon Sassoon; Stoodley,Kevin Alexander, Method and apparatus for transforming Java Native Interface function calls into simpler operations during just-in-time compilation.
Gupta Rajiv ; Worley ; Jr. William S., Out-of-order execution using encoded dependencies between instructions in queues to determine stall values that control.
Wright, Gregory M.; Wolczko, Mario I.; Seidl, Matthew L., Reducing the overhead involved in executing native code in a virtual machine through binary reoptimization.
Spix George A. (Eau Claire WI) Wengelski Diane M. (Eau Claire WI) Hawkinson Stuart W. (Eau Claire WI) Johnson Mark D. (Eau Claire WI) Burke Jeremiah D. (Eau Claire WI) Thompson Keith J. (Eau Claire W, System and method for controlling a highly parallel multiprocessor using an anarchy based scheduler for parallel executi.
Craig Chambers ; Susan J. Eggers ; Brian K. Grant ; Markus Mock ; Matthai Philipose, System and method for performing selective dynamic compilation using run-time information.
Demetriou, Christopher G.; Papakipos, Matthew N.; Gibbs, Noah L., Systems and methods for debugging an application running on a parallel-processing computer system.
Crutchfield, William Y.; Grant, Brian K.; Papakipos, Matthew N., Systems and methods for dynamically choosing a processing element for a compute kernel.
Ankireddipally, Lakshmi Narasimha; Yeh, Ryh-Wei; Nichols, Dan; Devesetti, Ravi, Transaction data structure for process communications among network-distributed applications.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.