Runtime optimization of an application executing on a parallel computer
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-009/46
G06F-015/16
G06F-011/34
G06F-009/52
G06F-009/54
출원번호
US-0663545
(2012-10-30)
등록번호
US-8898678
(2014-11-25)
발명자
/ 주소
Faraj, Daniel A.
Smith, Brian E.
출원인 / 주소
International Business Machines Corporation
대리인 / 주소
Biggers Kennedy Lenart Spraggins LLP
인용정보
피인용 횟수 :
0인용 특허 :
81
초록▼
Identifying a collective operation within an application executing on a parallel computer; identifying a call site of the collective operation; determining whether the collective operation is root-based; if the collective operation is not root-based: establishing a tuning session and executing the c
Identifying a collective operation within an application executing on a parallel computer; identifying a call site of the collective operation; determining whether the collective operation is root-based; if the collective operation is not root-based: establishing a tuning session and executing the collective operation in the tuning session; if the collective operation is root-based, determining whether all compute nodes executing the application identified the collective operation at the same call site; if all compute nodes identified the collective operation at the same call site, establishing a tuning session and executing the collective operation in the tuning session; and if all compute nodes executing the application did not identify the collective operation at the same call site, executing the collective operation without establishing a tuning session.
대표청구항▼
1. An apparatus for runtime optimization of an application executing on a parallel computer, the parallel computer having a plurality of compute nodes organized into a communicator, the apparatus comprising a computer processor and a computer memory operatively coupled to the computer processor, the
1. An apparatus for runtime optimization of an application executing on a parallel computer, the parallel computer having a plurality of compute nodes organized into a communicator, the apparatus comprising a computer processor and a computer memory operatively coupled to the computer processor, the computer memory having disposed within it computer program instructions that when executed by the computer processor cause the apparatus to carry out the steps of: determining, by each compute node, whether a collective operation is root-based;if the collective operation is not root-based, establishing a tuning session administered by a self tuning module for the collective operation in dependence upon an identifier of a call site of the collective operation and executing the collective operation in the tuning session;if the collective operation is root-based, determining, through use of a single other collective operation, whether all compute nodes executing the application identified the collective operation at the same call site;if all compute nodes executing the application identified the collective operation at the same call site, establishing a tuning session administered by the self tuning module for the collective operation in dependence upon the identifier of the call site of the collective operation and executing the collective operation in the tuning session; andif all compute nodes executing the application did not identify the collective operation at the same call site, executing the collective operation without establishing a tuning session. 2. The apparatus of claim 1 wherein a root-based collective operation comprises one of: a broadcast operation, a scatter operation, a gather operation, or a reduce operation. 3. The apparatus of claim 1 wherein determining whether all compute nodes executing the application identified the collective operation at the same call site further comprising performing on all the compute nodes of the communicator an ‘allreduce’ collective operation to identify the minimum and maximum values of all of the identified call sites. 4. The apparatus of claim 1 further comprising computer program instructions that when executed by the computer processor cause the apparatus to carry out the steps of: selecting, for a particular collective operation of the application in dependence upon one or more tuning sessions for the particular collective operation, one or more algorithms to carry out the particular collective operation, the one or more algorithms representing an optimized set of algorithms to carry out the particular collective operation;recording the one or more selected algorithms; andduring a subsequent execution of the application and without performing another tuning session, carrying out the particular collective operation of the application with the recorded selected algorithms. 5. The apparatus of claim 4 wherein recording the one or more selected algorithms from the tuning session further comprises recording, in association with the one or more selected algorithms, an identifier of the call site for the particular collective operation, a message size, and a communicator identifier. 6. The apparatus of claim 4 wherein: recording the one or more selected algorithms from the tuning session further comprises identifying any of the tuned collective operations that are non-critical collective operations; andcarrying out the particular collective operation of the application with the recorded selected algorithms further comprises carrying out the non-critical collective operations with standard messaging module algorithms. 7. A computer program product for runtime optimization of an application executing on a parallel computer, the parallel computer having a plurality of compute nodes organized into a communicator, the computer program product disposed in a computer readable hardware storage medium, the computer program product comprising computer program instructions that when executed by a processor cause a computer to carry out the steps of: determining, by each compute node, whether a collective operation is root-based;if the collective operation is not root-based, establishing a tuning session administered by a self tuning module for the collective operation in dependence upon an identifier of a call site of the collective operation and executing the collective operation in the tuning session;if the collective operation is root-based, determining, through use of a single other collective operation, whether all compute nodes executing the application identified the collective operation at the same call site;if all compute nodes executing the application identified the collective operation at the same call site, establishing a tuning session administered by the self tuning module for the collective operation in dependence upon the identifier of the call site of the collective operation and executing the collective operation in the tuning session; andif all compute nodes executing the application did not identify the collective operation at the same call site, executing the collective operation without establishing a tuning session. 8. The computer program product of claim 7 wherein a root-based collective operation comprises one of: a broadcast operation, a scatter operation, a gather operation, or a reduce operation. 9. The computer program product of claim 7 wherein determining whether all compute nodes executing the application identified the collective operation at the same call site further comprising performing on all the compute nodes of the communicator an ‘allreduce’ collective operation to identify the minimum and maximum values of all of the identified call sites. 10. The computer program product of claim 7 further comprising computer program instructions that when executed by a processor cause a computer to carry out the steps of: selecting, for a particular collective operation of the application in dependence upon one or more tuning sessions for the particular collective operation, one or more algorithms to carry out the particular collective operation, the one or more algorithms representing an optimized set of algorithms to carry out the particular collective operation;recording the one or more selected algorithms; andduring a subsequent execution of the application and without performing another tuning session, carrying out the particular collective operation of the application with the recorded selected algorithms. 11. The computer program product of claim 10 wherein recording the one or more selected algorithms from the tuning session further comprises recording, in association with the one or more selected algorithms, an identifier of the call site for the particular collective operation, a message size, and a communicator identifier. 12. The computer program product of claim 10 wherein: recording the one or more selected algorithms from the tuning session further comprises identifying any of the tuned collective operations that are non-critical collective operations; andcarrying out the particular collective operation of the application with the recorded selected algorithms further comprises carrying out the non-critical collective operations with standard messaging module algorithms.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (81)
Wilford, Bruce; Dan, Yie-Fong, Architecture for high speed class of service enabled linecard.
Gorin Allen L. (Fairlawn NJ) Lewine Robert N. (Hanover Township ; Morris County NJ) Makofsky Patrick A. (Randolph NJ) Shively Richard R. (Convent Station NJ), Binary tree multiprocessor.
Wingard Drew E. ; Rosseel Geert Paul ; Tomlinson Jay S. ; Robinson Lisa A., Communications system and method with multilevel connection identification.
Wingard, Drew Eric; Rosseel, Geert Paul; Tomlinson, Jay S.; Robinson, Lisa A., Communications system and method with multilevel connection identification.
Willis John Christopher ; Newshutz Robert Neill, Compiler-oriented apparatus for parallel compilation, simulation and execution of computer programs and hardware models.
Blackard Joe Wayne ; Gillaspy Richard Adams ; Henthorn William John ; Petersen Lynn Erich ; Russell Lance W. ; Shippy Gary Roy, Data processing system and method for pacing information transfers in a communications network.
Basso, Claude; Calvignac, Jean Louis; Heddes, Marco C.; Logan, Joseph Franklin; Verplanken, Fabrice Jean, Data structures for efficient processing of multicast transmissions.
Kloth,Axel K.; Andrews,Warner; Bergantino,Paul; Bicknell,Jeremy; Fu,Daniel; De Leon,Moshe; Mills,Stephen M., Dynamic bandwidth allocation for wide area networks.
Barzilai Tsipora P. (Millwood NY) Chen Mon-Song (Katonah NY) Kadaba Bharath K. (Peekskill NY) Kaplan Marc A. (Purdys NY), Flow control for high speed networks.
Blackmore, Robert S.; Chang, Fu Chung; Chaudhary, Piyush; Gildea, Kevin J.; Goscinski, Jason E.; Govindaraju, Rama K.; Grice, Donald G.; Helmer, Jr., Leonard W.; Heywood, Patricia E.; Hochschild, Peter H.; Houston, John S.; Kim, Chulho; Martin, Steven J., Half RDMA and half FIFO operations.
Burns, Randal Chilton; Goel, Atul; Long, Darrell D. E.; Rees, Robert Michael, Lease based safety protocol for distributed system with multiple networks.
Richard Alan Diedrich ; Harvey Gene Kiel, Method and apparatus for multimedia data interchange with pacing capability in a distributed data processing system.
Shtayer Ronen (Tel-Aviv ILX) Alon Naveh (Ranat Hashnron ILX) Alexander Joffe (Rehovot ILX), Method and apparatus for pacing asynchronous transfer mode (ATM) data cell transmission.
Crawley Eric S. ; Zhang Zhaohui ; Salkewicz William M. ; Sanchez Cheryl A., Method and apparatus for providing quality of service routing in a network.
Levin Vladimir K.,RUX ; Karatanov Vjacheslav V.,RUX ; Jalin Valerii V.,RUX ; Titov Alexandr,RUX ; Agejev Vjacheslav M.,RUX ; Patrikeev Andrei,RUX ; Jablonsky Sergei V.,RUX ; Korneev Victor V.,RUX ; M, Method for deadlock-free message passing in MIMD systems using routers and buffers.
Arimilli, Lakshminarayana B.; Arimilli, Ravi K.; Rajamony, Ramakrishnan; Speight, William E., Performing collective operations using software setup and partial software execution at leaf nodes in a multi-tiered full-graph interconnect architecture.
Archer, Charles J.; Blocksome, Michael A.; Peters, Amanda E.; Ratterman, Joseph D.; Smith, Brian E., Reducing power consumption while performing collective operations on a plurality of compute nodes.
Daruwalla, Feisal; Forster, James R.; Roeck, Guenter E.; Woundy, Richard M.; Thomas, Michael A., Routing protocol based redundancy design for shared-access networks.
Ray, Amar N.; Bugenhagen, Michael K.; Morrill, Robert J.; Chakravarthy, Cadathur V., System and method for adjusting the window size of a TCP packet through network elements.
Blandy Geoffrey Owen ; Saba Maher Afif, System and method for instruction burst performance profiling for single-processor and multi-processor systems.
Schumacher, Larry Lee; Gonzales-Tuchmann, Agustin; Yogman, Laurence Tobin; Dingman, Paul C., System for deadlock condition detection and correction by allowing a queue limit of a number of data tokens on the queue to increase.
Levy Henry M. ; Feeley Michael J.,CAX ; Karlin Anna R. ; Morgan William E. ; Thekkath Chandramohan A., Using global memory information to manage memory in a computer network.
Advani Deepak Mohan ; Byron Michael Justin ; Hansell Steven Robert ; Ming Chun Li Todd ; Marino John Paul ; Panda Rajendra Datta ; Pierce James Andrew ; Wang Ko-Yang ; Weinel Dennis George ; Welch Ro, Visualization tool for graphically displaying trace data.
Advani Deepak Mohan ; Byron Michael Justin ; Hansell Steven Robert ; Li Todd Ming Chun ; Marino John Paul ; Panda Rajendra Datta ; Pierce James Andrew ; Wang Ko-Yang ; Weinel Dennis George ; Welch Ro, Visualization tool for graphically displaying trace data produced by a parallel processing computer.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.