Retargetting an application program for execution by a general purpose processor
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-009/38
출원번호
US-0407711
(2009-03-19)
등록번호
US-8612732
(2013-12-17)
발명자
/ 주소
Grover, Vinod
Aarts, Bastiaan Joannes Matheus
Murphy, Michael
Beylin, Boris
Kolhe, Jayant B.
Saylor, Douglas
출원인 / 주소
NVIDIA Corporation
대리인 / 주소
Patterson + Sheridan, L.L.P.
인용정보
피인용 횟수 :
8인용 특허 :
18
초록▼
One embodiment of the present invention sets forth a technique for translating application programs written using a parallel programming model for execution on multi-core graphics processing unit (GPU) for execution by general purpose central processing unit (CPU). Portions of the application progra
One embodiment of the present invention sets forth a technique for translating application programs written using a parallel programming model for execution on multi-core graphics processing unit (GPU) for execution by general purpose central processing unit (CPU). Portions of the application program that rely on specific features of the multi-core GPU are converted by a translator for execution by a general purpose CPU. The application program is partitioned into regions of synchronization independent instructions. The instructions are classified as convergent or divergent and divergent memory references that are shared between regions are replicated. Thread loops are inserted to ensure correct sharing of memory between various threads during execution by the general purpose CPU.
대표청구항▼
1. A computer-implemented method for translating an application program for execution by a general purpose processor, the method comprising: receiving the application program written using a parallel programming model for execution on a multi-core graphics processing unit;partitioning the applicatio
1. A computer-implemented method for translating an application program for execution by a general purpose processor, the method comprising: receiving the application program written using a parallel programming model for execution on a multi-core graphics processing unit;partitioning the application program into regions of synchronization independent instructions to produce a partitioned application program;determining a plurality of variance vectors associated with a first region in the regions of synchronization independent instructions, wherein each variance vector indicates dependence of a different statement in the region on zero or more cooperative thread array dimensions; andinserting a loop around the first region to produce a translated application program for execution by the general purpose processor, wherein the loop iterates over only the cooperative thread array dimensions that correspond to the thread dimensions in the plurality of variance vectors. 2. The method of claim 1, further comprising, prior to the step of partitioning, identifying a synchronization barrier instruction within the application program. 3. The method of claim 2, wherein the first region includes instructions that are before the synchronization barrier instruction and a second region of the partitioned application program includes instructions that are after the synchronization barrier instruction. 4. The method of claim 3, wherein the step of inserting a loop includes inserting a first loop around the first region of the partitioned application program to ensure that all threads in a cooperative thread array will complete execution of the first region of the partitioned application program before any one of the threads in the cooperative thread array begins execution of the second region of the partitioned application program. 5. The method of claim 2, wherein the application program is represented as a control flow graph including basic block nodes connected by edges. 6. The method of claim 5, wherein the step of partitioning includes replacing the synchronization barrier instruction with an edge to separate one of the block nodes into a first basic block node corresponding to a first region and a second basic block node corresponding to a second region. 7. The method of claim 1, further comprising the step of classifying the partitioned application program to identify each statement as either convergent or divergent with respect to a cooperative thread array dimension in the cooperative thread array dimensions. 8. A non-transitory computer-readable medium that includes instructions that, when executed by a processing unit, cause the processing unit to translate an application program for execution by a general purpose processor, by performing the steps of: receiving the application program written using a parallel programming model for execution on a multi-core graphics processing unit;partitioning the application program into regions of synchronization independent instructions to produce a partitioned application program;determining a plurality of variance vectors associated with a first region in the regions of synchronization independent instructions, wherein each variance vector indicates dependence of a different statement in the region on zero or more cooperative thread array dimensions; andinserting a loop around the first region to produce a translated application program for execution by the general purpose processor, wherein the loop iterates over only the cooperative thread array dimensions that correspond to the thread dimensions in the plurality of variance vectors. 9. The non-transitory computer-readable medium of claim 8, further comprising, prior to the step of partitioning, identifying a synchronization barrier instruction within the application program. 10. The non-transitory computer-readable medium of claim 9, wherein the first region includes instructions that are before the synchronization barrier instruction and a second region of the partitioned application program includes instructions that are after the synchronization barrier instruction. 11. The non-transitory computer-readable medium of claim 10, wherein the step of inserting the loop includes inserting a first loop around the first region of the partitioned application program to ensure that all threads in a cooperative thread array will complete execution of the first region of the partitioned application program before any one of the threads in the cooperative thread array begins execution of the second region of the partitioned application program. 12. The non-transitory computer-readable medium of claim 9, wherein the application program is represented as a control flow graph including basic block nodes connected by edges. 13. The non-transitory computer-readable medium of claim 12, wherein the step of partitioning includes replacing the synchronization barrier instruction with an edge to separate one of the block nodes into a first basic block node corresponding to a first region and a second basic block node corresponding to a second region. 14. The non-transitory computer-readable medium of claim 8, further comprising the step of classifying the partitioned application program to identify each statement as either convergent or divergent with respect to a cooperative thread array dimension in the cooperative thread array dimensions. 15. A computing system configured to translate an application program for execution by a general purpose processor, comprising: a processor configured to execute a translator; anda system memory coupled to the processor and configured to store the translator, a first application program, and a second application program,the first application program written using a parallel programming model for execution on a multi-core graphics processing unit,the second application program configured for execution by the general purpose processor, andthe translator configured to:receive the first application program;partition the first application program into regions of synchronization independent instructions to produce a partitioned application program;determine a plurality of variance vectors associated with a first region in the regions of synchronization independent instructions, wherein each variance vector indicates dependence of a different statement in the region on zero or more cooperative thread array dimensions; andinsert a loop around the first region to produce a translated application program for execution by the general purpose processor, wherein the loop iterates over only the cooperative thread array dimensions that correspond to the thread dimensions in the plurality of variance vectors. 16. The computing system of claim 15, wherein the translator is further configured to identify a synchronization barrier instruction within the application program. 17. The computing system of claim 16, wherein the first region includes instructions that are before the synchronization barrier instruction and a second region of the partitioned application program includes instructions that are after the synchronization barrier instruction. 18. The computing system of claim 17, wherein the step of inserting the loop includes inserting a first loop around the first region of the partitioned application program to ensure that all threads in a cooperative thread array will complete execution of the first region of the partitioned application program before any one of the threads in the cooperative thread array begins execution of the second region of the partitioned application program. 19. The computing system of claim 16, wherein the application program is represented as a control flow graph including basic block nodes connected by edges and the step of partitioning includes replacing the synchronization barrier instruction with an edge to separate a one of the block nodes into a first basic block node corresponding to a first region and a second basic block node corresponding to a second region. 20. The computing system of claim 15, further comprising the step of classifying the partitioned application program to identify each statement as either convergent or divergent with respect to a cooperative thread array dimension in the cooperative array dimensions.
Ichinose,Katsumi; Moriya,Katsuyoshi, Information processing method and recording medium therefor capable of enhancing the executing speed of a parallel processing computing device.
Pan,Jielin; Yuan,Baosheng, Method, apparatus, and system for building a compact model for large vocabulary continuous speech recognition (LVCSR) system.
Callahan, II, Charles David; Shields, Keith Arnett; Briggs, III, Preston Pengra, Parallelism performance analysis based on execution trace information.
Tanaka, Yasuyuki, System for controlling assignment of a plurality of modules of a program to available execution units based on speculative executing and granularity adjusting.
Fetterman, Michael; Carlton, Stewart Glenn; Choquette, Jack Hilaire; Gadre, Shirish; Giroux, Olivier; Hahn, Douglas J.; Heinrich, Steven James; Hill, Eric Lyell; McCarver, Charles; Paranjape, Omkar; Rajendran, Anjana; Selvanesan, Rajeshwaran, Pre-scheduled replays of divergent operations.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.