Performing an allreduce operation using shared memory
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-009/46
G06F-013/00
G06F-007/38
출원번호
US-0754782
(2007-05-29)
등록번호
US-8161480
(2012-04-17)
발명자
/ 주소
Archer, Charles J.
Dozsa, Gabor
Ratterman, Joseph D.
Smith, Brian E.
출원인 / 주소
International Business Machines Corporation
대리인 / 주소
Biggers & Ohanian, LLP
인용정보
피인용 횟수 :
21인용 특허 :
44
초록▼
Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instru
Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.
대표청구항▼
1. A method for performing an allreduce operation using shared memory, the method comprising: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation;establishing, by the core that received the instruction, a job status object
1. A method for performing an allreduce operation using shared memory, the method comprising: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation;establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node;determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; andperforming, by that available core on the compute node, that next shared memory allreduce work unit;providing, by each available core to each other available core, a read-only window into a local memory buffer containing an array to be reduced by the allreduce operation; andupdating, by each available core on the compute node, the job status object with a descriptor of the buffer into which the read-only window is provided. 2. The method of claim 1 wherein establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units further comprises assigning a plurality of threads for executing the shared memory allreduce work units. 3. The method of claim 1 further comprising: copying, by each available core on the compute node into separate shared memory buffers, an array to be reduced by the allreduce operation; andupdating, by each available core on the compute node, the job status object with a descriptor of the shared memory buffer into which the available core copied the array. 4. The method of claim 1 wherein performing, by that available core on the compute node, that next shared memory allreduce work unit further comprises performing a reduction operation on elements of arrays to be reduced. 5. The method of claim 1 wherein: the compute node is connected to a plurality of other compute nodes through a data communications network; andperforming, by that available core on the compute node, that next shared memory allreduce work unit further comprises transmitting, by that available core, local reduction results to one or more of the other compute nodes through the network. 6. The method of claim 1 wherein: the compute node is connected to a plurality of other compute nodes through a data communications network;performing, by that available core on the compute node, that next shared memory allreduce work unit further comprises:receiving global reduction results through the network, andstoring the global reduction results into shared memory; andthe method further comprises copying, by each available core on the compute node, the global reduction results from the shared memory to local memory of the available core. 7. A compute node for performing an allreduce operation using shared memory, the compute node comprising a plurality of processing cores, computer memory operatively coupled to the plurality of processing cores, the computer memory having disposed within it computer program instructions for: receiving, by at least one of the plurality of processing cores on the compute node, an instruction to perform an allreduce operation;establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node;determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; andperforming, by that available core on the compute node, that next shared memory allreduce work unit;providing, by each available core to each other available core, a read-only window into a local memory buffer containing an array to be reduced by the allreduce operation; andupdating, by each available core on the compute node, the job status object with a descriptor of the buffer into which the read-only window is provided. 8. The compute node of claim 7 wherein the computer memory also has disposed within it computer program instructions capable of: copying, by each available core on the compute node into separate shared memory buffers, an array to be reduced by the allreduce operation; andupdating, by each available core on the compute node, the job status object with a descriptor of the shared memory buffer into which the available core copied the array. 9. The compute node of claim 7 wherein performing, by that available core on the compute node, that next shared memory allreduce work unit further comprises performing a reduction operation on elements of arrays to be reduced. 10. The compute node of claim 7 wherein: the compute node is connected to a plurality of other compute nodes through a data communications network; andperforming, by that available core on the compute node, that next shared memory allreduce work unit further comprises transmitting, by that available core, local reduction results to one or more of the other compute nodes through the network. 11. The compute node of claim 7 wherein: the compute node is connected to a plurality of other compute nodes through a data communications network;performing, by that available core on the compute node, that next shared memory allreduce work unit further comprises:receiving global reduction results through the network, andstoring the global reduction results into shared memory; andthe computer memory also has disposed within it computer program instructions capable of copying, by each available core on the compute node, the global reduction results from the shared memory to local memory of the available core. 12. A computer program product for performing an allreduce operation using shared memory, the computer program product disposed upon a computer-readable, recordable storage medium, the computer program product comprising computer program instructions for: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation;establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node;determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; andperforming, by that available core on the compute node, that next shared memory allreduce work unit;providing, by each available core to each other available core, a read-only window into a local memory buffer containing an array to be reduced by the allreduce operation; andupdating, by each available core on the compute node, the job status object with a descriptor of the buffer into which the read-only window is provided. 13. The computer program product of claim 12 wherein establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units further comprises assigning a plurality of threads for executing the shared memory allreduce work units. 14. The computer program product of claim 12 further comprising computer program instructions capable of: copying, by each available core on the compute node into separate shared memory buffers, an array to be reduced by the allreduce operation; andupdating, by each available core on the compute node, the job status object with a descriptor of the shared memory buffer into which the available core copied the array. 15. The computer program product of claim 12 wherein performing, by that available core on the compute node, that next shared memory allreduce work unit further comprises performing a reduction operation on elements of arrays to be reduced. 16. The computer program product of claim 12 wherein: the compute node is connected to a plurality of other compute nodes through a data communications network; andperforming, by that available core on the compute node, that next shared memory allreduce work unit further comprises transmitting, by that available core, local reduction results to one or more of the other compute nodes through the network. 17. The computer program product of claim 12 wherein: the compute node is connected to a plurality of other compute nodes through a data communications network;performing, by that available core on the compute node, that next shared memory allreduce work unit further comprises:receiving global reduction results through the network, andstoring the global reduction results into shared memory; andthe computer program product further comprises computer program instructions capable of copying, by each available core on the compute node, the global reduction results from the shared memory to local memory of the available core.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (44)
Scott Steven L. ; Pribnow Richard D. ; Logghe Peter G. ; Kunkel Daniel L. ; Schwoerer Gerald A., Adaptive congestion control mechanism for modular computer networks.
Kato Sadayuki,JPX ; Ishihata Hiroaki,JPX ; Horie Takeshi,JPX ; Inano Satoshi,JPX ; Shimizu Toshiyuki,JPX, Data gathering/scattering system for a plurality of processors in a parallel computer.
Connor, Patrick L.; McVay, Robert G., Direct memory access transfer reduction method and apparatus to overlay data on to scatter gather descriptors for bus-mastering I/O controllers.
Michael Olivier, Dynamically matching users for group communications based on a threshold degree of matching of sender and recipient predetermined acceptance criteria.
Archer, Charles J.; Ratterman, Joseph D., Executing scatter operation to parallel computer nodes by repeatedly broadcasting content of send buffer partition corresponding to each node upon bitwise OR operation.
Cypher Robert E. (Los Gatos CA) Sanz Jorge L. C. (Los Gatos CA), Hierarchical interconnection network architecture for parallel processing, having interconnections between bit-addressib.
Flaig Charles M. (Pasadena CA) Seitz Charles L. (San Luis Rey CA), Inter-computer message routing system with each computer having separate routinng automata for each dimension of the net.
Carmichael Richard D. ; Ward Joel M. ; Winchell Michael A., Method and apparatus for controlling (N+I) I/O channels with (N) data managers in a homogenous software programmable en.
Krishnamoorthy Ashok V. (11188 Caminito Rodar San Diego CA 92126) Kiamilev Fouad (c/o UNC Charlotte ; Dept. of EE ; Smith Hall Room 332 Charlotte NC 28223), Packet-switched self-routing multistage interconnection network having contention-free fanout, low-loss routing, and fan.
Yasuda Yoshiko,JPX ; Tanaka Teruo,JPX, Parallel computer system using properties of messages to route them through an interconnect network and to select virtua.
Wilkinson Paul Amba ; Dieffenderfer James Warren ; Kogge Peter Michael ; Schoonover Nicholas Jerome, Partitioning of processing elements in a SIMD/MIMD array processor.
Archer, Charles J.; K. A., Nysal Jan; Sharkawi, Sameh S., Executing an all-to-allv operation on a parallel computer that includes a plurality of compute nodes.
Archer, Charles J.; K.A., Nysal Jan; Sharkawi, Sameh S., Executing an all-to-ally operation on a parallel computer that includes a plurality of compute nodes.
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E., Improving efficiency of a global barrier operation in a parallel computer.
Chung, Won-young; Lee, Yong Surk; Park, Jong-su; Jeong, Ha-young, Method of performing collective communication according to status-based determination of a transmission order between processing nodes and collective communication system using the same.
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E., Performing a deterministic reduction operation in a parallel computer.
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E., Performing a deterministic reduction operation in a parallel computer.
Archer, Charles J.; Peters, Amanda E.; Smith, Brian E., Performing an all-to-all data exchange on a plurality of data buffers by performing swap operations.
Archer, Charles J.; Blocksome, Michael A.; Ratterman, Joseph D.; Smith, Brian E., Processing data communications events by awakening threads in parallel active messaging interface of a parallel computer.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.