[특허]Counter-based delay of dependent thread group execution

Counter-based delay of dependent thread group execution 원문보기

IPC분류정보
국가/구분	United States(US) Patent 등록
국제특허분류(IPC7판)	G06F-009/40
출원번호	UP-0535871 (2006-09-27)
등록번호	US-7526634 (2009-07-01)
발명자 / 주소	Duluk, Jr., Jerome F. Lew, Stephen D. Nickolls, John R.
출원인 / 주소	Nvidia Corporation
대리인 / 주소	Townsend and Townsend and Crew LLP
인용정보	피인용 횟수 : 34 인용 특허 : 4

초록 ▼

Systems and methods for synchronizing processing work performed by threads, cooperative thread arrays (CTAs), or "sets" of CTAs. A central processing unit can load launch commands for a first set of CTAs and a second set of CTAs in a pushbuffer, and specify a dependency of the second set upon completion of execution of the first set. A parallel or graphics processor (GPU) can autonomously execute the first set of CTAs and delay execution of the second set of CTAs until the first set of CTAs is complete. In some embodiments the GPU may determine that a third set of CTAs is not dependent upon the first set, and may launch the third set of CTAs while the second set of CTAs is delayed. In this manner, the GPU may execute launch commands out of order with respect to the order of the launch commands in the pushbuffer.

대표청구항 ▼

What is claimed is: 1. A method, comprising: executing a first set of thread arrays in a processor, wherein the first set of thread arrays comprises a first set of thread groups, wherein thread groups from the first set of thread groups execute, in parallel, instructions associated with the first process, and wherein a first set of thread groups is associated with a first reference counter that increments upon completion of execution of each thread group from the first set of thread groups; specifying a second set of thread arrays as dependent on a status of execution of the first set of thread arrays; and delaying execution of the second set of thread arrays in the processor based on the status of execution of the first set of thread arrays, wherein delaying execution of the second set of thread arrays further comprises: counting a number of thread groups that have completed execution in the first set of thread arrays; and delaying until the number of thread groups that have completed execution equals a number of thread groups in the first set of thread arrays. 2. The method of claim 1 wherein the status of execution of the first set of thread arrays includes an indication that each thread array in the set has completed execution. 3. The method of claim 1 wherein delaying execution comprises determining based on a comparison of a launch counter and a completion counter whether the first set of thread arrays has completed execution. 4. The method of claim 1 wherein executing the first set of thread arrays comprises incrementing a launch counter for each launched thread array in the first set of thread arrays. 5. The method of claim 1 wherein executing the first set of thread arrays comprises incrementing a completion counter for each thread array in the first set of thread arrays that has completed execution. 6. The method of claim 1 further comprising, while delaying execution of the second set of thread arrays, executing a third set of thread arrays in the processor. 7. A method, comprising: loading from a central processing unit into a pushbuffer coupled to a processing unit: a first launch command for a first process, wherein the first launch command is to be executed by a first set of thread groups, wherein thread groups from the first set of thread groups operate in parallel to execute instructions associated with the first process, and wherein the first set of thread groups is associated with a reference counter, the reference counter being incremented upon completion of execution of each of the thread groups from the first set of thread groups, a second launch command for a second process, wherein the second launch command is to be executed by a second set of thread groups, and wherein thread groups from the second set of thread groups execute in parallel instructions associated with the second process, and a dependency between the first process and the second process, wherein the dependency associates the first reference counter with the second launch command; executing the first launch command in the processing unit; and delaying execution of the second launch command in the processing unit based on the dependency, wherein execution of the second launch command is delayed until the first reference counter equals a number of thread groups in the first set of thread groups used to execute instructions associated with the first process. 8. The method of claim 7 wherein the dependency between the first process and the second process includes an indication that the first process has completed execution. 9. The method of claim 7 wherein delaying execution comprises determining based on a comparison of a launch counter and a completion counter whether the first process has completed execution. 10. The method of claim 7 wherein the reference counter further comprises a launch counter, and wherein executing the first launch command comprises incrementing the launch counter for each launched thread group in the first set of thread groups in the first process. 11. The method of claim 8 wherein the reference counter further comprises a completion counter, and wherein executing the first process comprises incrementing the completion counter for each thread group in the first set of thread groups in the first process, and wherein delaying execution of the second launch command further comprises delaying the execution of the second launch command until the completion counter equals the launch counter. 12. The method of claim 7 further comprising, while delaying execution of the second command, executing a third process in the processor. 13. A system, comprising: a central processing unit configured to generate: a first launch command for a first process, wherein the first launch command is to be executed by a first set of thread groups, wherein thread groups from the first set of thread groups operate in parallel to execute instructions associated with the first process, and wherein the first set of thread groups is associated with a reference counter, the reference counter being incremented upon completion of execution of each of the thread groups from the first set of thread groups, a second launch command for a second process, including a dependency of the second process upon the first process, wherein the second launch command is to be executed by a second set of thread groups, wherein thread groups from the second set of thread groups execute, in parallel, instructions associated with the second process, and a third launch command for a third process, wherein the third launch command is to be executed by a third set of thread groups, wherein thread groups from the third set of thread groups execute, in parallel, instructions associated with the third process; a pushbuffer coupled to the central processing unit, the pushbuffer configured to sequentially receive the first, second, and third launch commands; and a processing unit configured to: execute the first launch command to complete the first process, delay execution of the second launch command based on the dependency; and execute the third launch command while the second process is delayed. 14. The system of claim 13 wherein executing the third launch command for the third process while the second process is delayed comprises utilizing resources in the processing unit that would not otherwise be utilized. 15. The system of claim 13 wherein executing the third launch command comprises executing instructions out of order in the processing unit with respect to an order of launch commands in the pushbuffer. 16. The system of claim 13 wherein executing the third launch command comprises determining that the third process may execute while the second process is delayed. 17. The system of claim 15 wherein the processing unit includes logic configured to determine that the third process is not dependent upon a result of the first process or the second process. 18. The system of claim 17 wherein the logic is configured to determine that the third process is not dependent upon a result of the first process by analyzing a reference counter identifier assigned to the first process.

이 특허에 인용된 특허 (4)

Nishihara Kazunori ; Hiramatsu Takaai, Condition variable to synchronize high level communication between processing threads.
상세보기
Kanai,Tatsunori; Maeda,Seiji; Yoshii,Kenichiro, Local memory management system with plural processors.
상세보기
Savov,Andrey I.; Garg,Man M., Method and apparatus for serving a request queue.
상세보기
Dmitry Robsman, Thread optimization.
상세보기

이 특허를 인용한 특허 (34)

Farrar, Timothy Paul Lottes; Llamas, Ignacio; Wexler, Daniel Elliot; Duttweiler, Craig Ross, API for launching work on a processor.
상세보기
Gopalakrishnan, Liji; Fruchter, Vlad; Lai, Lawrence; Batra, Pradeep; Woo, Steven C.; Ellis, Wayne Frederick, Communication via a memory interface.
상세보기
Gopalakrishnan, Liji; Fruchter, Vlad; Lai, Lawrence; Batra, Pradeep; Woo, Steven C.; Ellis, Wayne Frederick, Communication via a memory interface.
상세보기
Cuadra, Philip Alexander; Abdalla, Karim M.; Duluk, Jr., Jerome F.; Durant, Luke; Luiz, Gerald F.; Purcell, Timothy John; Shah, Lacky V., Compute work distribution reference counters.
상세보기
Chhodavdia, Avdhesh; Fenton, Michael S.; Nayak, Sheethal Somesh, Configurable synchronized processing of multiple operations.
상세보기
Eldar, Avigdor; Teomim, Noam; Manevitch, Alexandra; Goldman, Amos; Zour, Liad Aben; Weisel, Orly; Kellerman, Raizy, Connected component labeling in graphics processors.
상세보기
Subrahmanyam, Pratap; Venkatasubramanian, Rajesh, Consistent and efficient mirroring of nonvolatile memory state in virtualized environments by remote mirroring memory addresses of nonvolatile memory to which cached lines of the nonvolatile memory have been flushed.
상세보기
Subrahmanyam, Pratap; Venkatasubramanian, Rajesh, Consistent and efficient mirroring of nonvolatile memory state in virtualized environments where dirty bit of page table entries in non-volatile memory are not cleared until pages in non-volatile memory are remotely mirrored.
상세보기
Shah, Lacky V.; Abdalla, Karim M.; Treichler, Sean J.; de Waal, Abraham B., Controlling work distribution for processing tasks.
상세보기
Adya, Atul; Wolman, Alastair; Dunagan, John D, Crisscross cancellation protocol.
상세보기
Adya, Atul; Wolman, Alastair; Dunagan, John D, Crisscross cancellation protocol.
상세보기
Adya, Atul; Wolman, Alastair; Dunagan, John D, Crisscross cancellation protocol.
상세보기
Adya, Atul; Wolman, Alastair; Dunagan, John D, Crisscross cancellation protocol.
상세보기
Adya, Atul; Wolman, Alastair; Dunagan, John D., Crisscross cancellation protocol.
상세보기
Adya, Atul; Wolman, Alastair; Dunagan, John D., Crisscross cancellation protocol.
상세보기
Teruyama, Tatsuo; Satoh, Jin, Drawing apparatus and method for processing plural pixels in parallel.
상세보기
Shalev, Tal; Pinhas, Ariel, Event-driven state-machine sequencer.
상세보기
Buck, Ian A.; Aarts, Bastiaan, Expressing parallel execution relationships in a sequential programming language.
상세보기
Larson, Michael K., Fast queries in a multithreaded queue of a graphics system.
상세보기
Bourd, Alexei V.; Sharp, Colin Christopher; Garcia Garcia, David Rigel; Zhang, Chihong, Inter-processor communication techniques in a multiple-processor computing platform.
상세보기
Bourd, Alexei V.; Sharp, Colin Christopher; Garcia Garcia, David Rigel; Zhang, Chihong, Inter-processor communication techniques in a multiple-processor computing platform.
상세보기
Bourd, Alexei Vladimirovich; Sharp, Colin Christopher; Garcia Garcia, David Rigel; Zhang, Chihong, Inter-processor communication techniques in a multiple-processor computing platform.
상세보기
Duluk, Jr., Jerome F.; Hall, Jesse David; Cuadra, Philip Alexander; Abdalla, Karim M., Methods and apparatus for auto-throttling encapsulated compute tasks.
상세보기
Perego, Richard E.; Batra, Pradeep; Woo, Steven; Lai, Lawrence; Yeung, Chi-Ming, Methods and systems for mapping a peripheral function onto a legacy memory interface.
상세보기
Perego, Richard E.; Batra, Pradeep; Woo, Steven; Lai, Lawrence; Yeung, Chi-Ming, Methods and systems for mapping a peripheral function onto a legacy memory interface.
상세보기
Perego, Richard E.; Batra, Pradeep; Woo, Steven; Lai, Lawrence; Yeung, Chi-Ming, Methods and systems for mapping a peripheral function onto a legacy memory interface.
상세보기
Surti, Prasoonkumar; Piazza, Thomas A., Ordering threads as groups in a multi-threaded, multi-core graphics compute system.
상세보기
Takano, Fumiyo, Parallel processing device, parallel processing method, optimization device, optimization method and computer program.
상세보기
Wu, Haihua; Gould, Julia A.; Tang, Li-An, Power efficient hybrid scoreboard method.
상세보기
Budge, Brian C., Processing of loops with internal data dependencies using a parallel processor.
상세보기
Denisenko, Dmitry; Czajkowski, Tomasz, Repartitioning and reordering of multiple threads into subsets based on possible access conflict, for sequential access to groups of memory banks in a shared memory.
상세보기
Pirvu, Marius, Running time of short running applications by effectively interleaving compilation with computation in a just-in-time environment.
상세보기
Durant, Luke, Techniques for managing the execution order of multiple nested tasks executing on a parallel processor.
상세보기
Coon, Brett W.; Nickolls, John R.; Lindholm, John Erik; Stoll, Robert J.; Wang, Nicholas; Choquette, Jack Hilaire, Thread group scheduler for computing on a parallel thread processor.
상세보기

IPC	Description
A	생활필수품
A62	인명구조; 소방(사다리 E06C)
A62B	인명구조용의 기구, 장치 또는 방법(특히 의료용에 사용되는 밸브 A61M 39/00; 특히 물에서 쓰이는 인명구조 장치 또는 방법 B63C 9/00; 잠수장비 B63C 11/00; 특히 항공기에 쓰는 것, 예. 낙하산, 투출좌석 B64D; 특히 광산에서 쓰이는 구조장치 E21F 11/00)
A62B-1/08	.. 윈치 또는 풀리에 제동기구가 있는 것

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표IPC 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 공고번호, 공고일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표출원인, 출원인국적, 출원인주소, 발명자, 발명자E, 발명자코드, 발명자주소, 발명자 우편번호, 발명자국적, 대표IPC, IPC코드, 요약, 미국특허분류, 대리인주소, 대리인코드, 대리인(한글), 대리인(영문), 국제공개일자, 국제공개번호, 국제출원일자, 국제출원번호, 우선권, 우선권주장일, 우선권국가, 우선권출원번호, 원출원일자, 원출원번호, 지정국, Citing Patents, Cited Patents
저장형식	Text(ASCII format) Excel format PIAS분석(.xls)
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

Counter-based delay of dependent thread group execution 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

이 특허에 인용된 특허 (4)

이 특허를 인용한 특허 (34)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

Counter-based delay of dependent thread group execution 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

이 특허에 인용된 특허 (4)

이 특허를 인용한 특허 (34)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트