IPC분류정보
국가/구분 |
United States(US) Patent
등록
|
국제특허분류(IPC7판) |
|
출원번호 |
US-0415075
(2009-03-31)
|
등록번호 |
US-8776030
(2014-07-08)
|
발명자
/ 주소 |
- Grover, Vinod
- Aarts, Bastiaan Joannes Matheus
- Murphy, Michael
|
출원인 / 주소 |
|
대리인 / 주소 |
Patterson & Sheridan, LLP
|
인용정보 |
피인용 횟수 :
12 인용 특허 :
20 |
초록
▼
One embodiment of the present invention sets forth a technique for translating application programs written using a parallel programming model for execution on multi-core graphics processing unit (GPU) for execution by general purpose central processing unit (CPU). Portions of the application progra
One embodiment of the present invention sets forth a technique for translating application programs written using a parallel programming model for execution on multi-core graphics processing unit (GPU) for execution by general purpose central processing unit (CPU). Portions of the application program that rely on specific features of the multi-core GPU are converted by a translator for execution by a general purpose CPU. The application program is partitioned into regions of synchronization independent instructions. The instructions are classified as convergent or divergent and divergent memory references that are shared between regions are replicated. Thread loops are inserted to ensure correct sharing of memory between various threads during execution by the general purpose CPU.
대표청구항
▼
1. A computer-implemented method for partitioning an application program that comprises a plurality of statements, the method comprising: selecting a statement in the plurality of statements to analyze;if the statement is not a synchronization barrier instruction, then adding the statement to a curr
1. A computer-implemented method for partitioning an application program that comprises a plurality of statements, the method comprising: selecting a statement in the plurality of statements to analyze;if the statement is not a synchronization barrier instruction, then adding the statement to a current partition, orif the statement is a synchronization barrier instruction or if the statement is a start of a control-flow construct that includes a synchronization barrier instruction, then ending the current partition and storing the current partition in an output list of partitions;beginning a new current partition and repeating the steps of selecting, adding, and ending until all statements in the plurality of statements have been analyzed;annotating each statement in the application program with a corresponding variance vector that is a representation of a set configured to indicate thread dimensions on which the statement depends;and reordering statements in a partition in the output list of partitions to cause statements in the partition that have fewer dimensions in their corresponding variance vectors to precede statements in the partition that have more dimensions in their corresponding variance vectors. 2. The method of claim 1, further comprising the step of partitioning the control-flow construct when the statement represents an end of the control flow construct. 3. The method of claim 1, further comprising the step of recursively partitioning the control-flow construct when the control-flow construct includes a synchronization barrier instruction. 4. The method of claim 1, further comprising the step of executing the partitioned application program by the general purpose processor. 5. The method of claim 1, wherein a first region of the partitioned application program includes instructions that are before the synchronization barrier instruction and a second region of the partitioned application program includes instructions that are after the synchronization barrier instruction. 6. The method of claim 5, further comprising the step of inserting a first loop nest around the first region of the partitioned application program to ensure that all threads in a cooperative thread array will complete execution of the first region of the partitioned application program before any one of the threads in the cooperative thread array begins execution of the second region of the partitioned application program. 7. The method of claim 1, wherein a new partition is created when two paths of control-flow that originate in different partitions meet to form a reconvergence point of a branch in the application program. 8. The method of claim 1, wherein a portion of the application program beginning at a reconvergence point of a branch is replicated and appended to each potentially preceding partition. 9. The method of claim 1, wherein each variance vector comprises a sequence of bits, each bit indicating whether a corresponding statement is dependent on a particular thread dimension. 10. The method of claim 9, wherein each variance vector includes one bit for each of three thread dimensions. 11. A non-transitory computer-readable medium that includes instructions that, when executed by a processing unit, cause the processing unit to partition an application program as part of translating the application program for execution by a general purpose processor, by performing the steps of: selecting a statement in the plurality of statements to analyze;if the statement is not a synchronization barrier instruction, then adding the statement to a current partition, orif the statement is a synchronization barrier instruction or if the statement is a start of a control-flow construct that includes a synchronization barrier instruction, then ending the current partition and storing the current partition in an output list of partitions;beginning a new current partition and repeating the steps of selecting, adding, and ending until all statements in the plurality of statements have been analyzed;annotating each statement in the application program with a corresponding variance vector that is a representation of a set configured to indicate thread dimensions on which the statement depends;and reordering statements in a partition in the output list of partitions to cause statements in the partition that have fewer dimensions in their corresponding variance vectors to precede statements in the partition that have more dimensions in their corresponding variance vectors. 12. The non-transitory computer-readable medium of claim 11, further comprising the step of partitioning the control-flow construct when the statement represents an end of the control flow construct. 13. The non-transitory computer-readable medium of claim 11, further comprising the step of recursively partitioning the control-flow construct when the control-flow construct includes a synchronization barrier instruction. 14. The non-transitory computer-readable medium of claim 11, wherein a first region of the partitioned application program includes instructions that are before the synchronization barrier instruction and a second region of the partitioned application program includes instructions that are after the synchronization barrier instruction. 15. The non-transitory computer-readable medium of claim 14, further comprising the step of inserting a first loop nest around the first region of the partitioned application program to ensure that all threads in a cooperative thread array will complete execution of the first region of the partitioned application program before any one of the threads in the cooperative thread array begins execution of the second region of the partitioned application program. 16. The computer-readable medium of claim 11, wherein a portion of the application program beginning at a reconvergence point of a branch is replicated and appended to each potentially preceding partition. 17. A computing system configured to partition an application program that comprises a plurality of statements, comprising: a processor configured to execute a translator;and a system memory coupled to the processor and configured to store the translator, a first application program, and a second application program, the first application program written using a parallel programming model for execution on a multi-core graphics processing unit, the second application program configured for execution by the general purpose processor, and the translator configured to:select a statement in the plurality of statements to analyze;if the statement is not a synchronization barrier instruction, then add the statement to a current partition, orif the statement is a synchronization barrier instruction or if the statement is a start of a control-flow construct that includes a synchronization barrier instruction, then end the current partition and store the current partition in an output list of partitions;store the current partition to an output list of partitions and starting a new partition when the statement is a synchronization barrier instruction or when the statement represents a start of a control-flow construct that includes a synchronization barrier instruction;begin a new current partition and repeat the steps of selecting, adding, and ending until all statements in the plurality of statements have been analyzed;annotate each statement in the application program with a corresponding variance vector that is a representation of a set configured to indicate thread dimensions on which the statement depends;and reorder statements in a partition in the output list of partitions to cause statements in the partition that fewer dimensions in their corresponding variance vectors to precede statements in the partition that have more dimensions in their corresponding variance vectors. 18. The computing system of claim 17, wherein the translator is further configured to partition the control-flow construct when the statement represents an end of the control flow construct. 19. The computing system of claim 17, wherein the translator is further configured to recursively partition the control-flow construct when the control flow construct includes a synchronization barrier instruction. 20. The computing system of claim 17, wherein a first region of the second application program includes instructions that are before the synchronization barrier instruction and a second region of the second application program includes instructions that are after the synchronization barrier instruction. 21. The computing system of claim 20, wherein the translator is further configured to insert a first loop nest around the first region of the second application program to ensure that all threads in a cooperative thread array will complete execution of the first region of the second application program before any one of the threads in the cooperative thread array begins execution of the second region of the second application program. 22. The computing system of claim 17, wherein the translator is further configured to replicate a portion of the first application program beginning at a reconvergence point of a branch and append the portion of the first application program to each potentially preceding partition.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.