Performing a local reduction operation on a parallel computer
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-015/76
G06F-015/16
G06F-009/02
G06F-012/00
출원번호
US-0585993
(2012-08-15)
등록번호
US-8458244
(2013-06-04)
발명자
/ 주소
Blocksome, Michael A.
Faraj, Daniel A.
출원인 / 주소
International Business Machines Corporation
대리인 / 주소
Biggers & Ohanian, LLP
인용정보
피인용 횟수 :
0인용 특허 :
55
초록▼
A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction
A parallel computer including compute nodes, each including two reduction processing cores, a network write processing core, and a network read processing core, each processing core assigned an input buffer. Copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory; copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory; copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; and locally reducing in parallel by the reduction processing cores: the contents of the reduction processing core's input buffer; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer.
대표청구항▼
1. A method of performing a local reduction operation on a parallel computer, the parallel computer comprising a plurality of compute nodes coupled for data communications with a high speed, low latency network, the compute nodes organized for collective operations, each compute node comprising a pl
1. A method of performing a local reduction operation on a parallel computer, the parallel computer comprising a plurality of compute nodes coupled for data communications with a high speed, low latency network, the compute nodes organized for collective operations, each compute node comprising a plurality of processing cores, each processing core assigned an input buffer, the processing cores including two reduction processing cores dedicated to executing reduction operations, a network processing core, the method comprising: copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory;copying, by the reduction processing cores, contents of an input buffer of the network processing core;locally reducing in parallel by the reduction processing cores: the contents of the reduction processing cores' input buffers; every other interleaved chunk of the interleaved buffer; the copied contents of the network processing core's input buffer. 2. The method of claim 1 further comprising writing, by the network write processing core, results of the local reduction to the network for data processing by other compute nodes. 3. The method of claim 2 further comprising: receiving, by the network read processing core, results of the other compute node's data processing;storing the data processing results in a result buffer assigned to the network read processing core; andlocally broadcasting the data processing results to each of the other processing cores including initiating, by the reduction processing cores, interleaved DMA transfers of the data processing results from the result buffer assigned to the network read processing core into result buffers assigned to each of the other processing cores. 4. The method of claim 1 wherein copying contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory, copying contents of the network write processing core's input buffer to shared memory, and copying contents of the network read processing core's input buffer to shared memory further comprise performing one or more direct memory access (‘DMA’) transfers. 5. The method of claim 1 wherein the local reduction includes performing, by each reduction processor, in parallel, one or more four-way mathematical operations with four input elements and one output element. 6. The method of claim 1 wherein the high speed, low latency network comprises data communications links coupling the compute nodes so as to organize the compute nodes as a tree. 7. An apparatus for performing a local reduction operation on a parallel computer, the parallel computer comprising a plurality of compute nodes coupled for data communications with a high speed, low latency network, the compute nodes organized for collective operations, each compute node comprising at least four processing cores, each processing core assigned an input buffer, the at least four processing cores including two reduction processing cores dedicated to executing reduction operations, a network write processing core dedicated to writing results of reduction operations to the network, and a network read processing core dedicated to receiving data from the network, the apparatus comprising a computer processor and a computer memory operatively coupled to the computer processor, the computer memory having disposed within it computer program instructions capable of: copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory;copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory;copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; andlocally reducing in parallel by the reduction processing cores: the contents of the reduction processing cores' input buffers; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer. 8. The apparatus of claim 7 further comprising computer program instructions capable of writing, by the network write processing core, results of the local reduction to the network for data processing by other compute nodes. 9. The apparatus of claim 8 further comprising computer program instructions capable of: receiving, by the network read processing core, results of the other compute node's data processing;storing the data processing results in a result buffer assigned to the network read processing core; andlocally broadcasting the data processing results to each of the other processing cores including initiating, by the reduction processing cores, interleaved DMA transfers of the data processing results from the result buffer assigned to the network read processing core into result buffers assigned to each of the other processing cores. 10. The apparatus of claim 7 wherein copying contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory, copying contents of the network write processing core's input buffer to shared memory, and copying contents of the network read processing core's input buffer to shared memory further comprise performing one or more direct memory access (‘DMA’) transfers. 11. The apparatus of claim 7 wherein the local reduction includes performing, by each reduction processor, in parallel, one or more four-way mathematical operations with four input elements and one output element. 12. The apparatus of claim 7 wherein the high speed, low latency network comprises data communications links coupling the compute nodes so as to organize the compute nodes as a tree. 13. A computer program product for performing a local reduction operation on a parallel computer, the parallel computer comprising a plurality of compute nodes coupled for data communications with a high speed, low latency network, the compute nodes organized for collective operations, each compute node comprising at least four processing cores, each processing core assigned an input buffer, the at least four processing cores including two reduction processing cores dedicated to executing reduction operations, a network write processing core dedicated to writing results of reduction operations to the network, and a network read processing core dedicated to receiving data from the network, the computer program product disposed in a computer readable storage medium, wherein the computer readable storage medium is not a signal, the computer program product comprising computer program instructions capable of: copying, in interleaved chunks by the reduction processing cores, contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory;copying, by one of the reduction processing cores, contents of the network write processing core's input buffer to shared memory;copying, by another of the reduction processing cores, contents of the network read processing core's input buffer to shared memory; andlocally reducing in parallel by the reduction processing cores: the contents of the reduction processing cores' input buffers; every other interleaved chunk of the interleaved buffer; the copied contents of the network write processing core's input buffer; and the copied contents of the network read processing core's input buffer. 14. The computer program product of claim 13 further comprising computer program instructions capable of writing, by the network write processing core, results of the local reduction to the network for data processing by other compute nodes. 15. The computer program product of claim 14 further comprising computer program instructions capable of: receiving, by the network read processing core, results of the other compute node's data processing;storing the data processing results in a result buffer assigned to the network read processing core; andlocally broadcasting the data processing results to each of the other processing cores including initiating, by the reduction processing cores, interleaved DMA transfers of the data processing results from the result buffer assigned to the network read processing core into result buffers assigned to each of the other processing cores. 16. The computer program product of claim 13 wherein copying contents of the reduction processing cores' input buffers to an interleaved buffer in shared memory, copying contents of the network write processing core's input buffer to shared memory, and copying contents of the network read processing core's input buffer to shared memory further comprise performing one or more direct memory access (‘DMA’) transfers. 17. The computer program product of claim 13 wherein the local reduction includes performing, by each reduction processor, in parallel, one or more four-way mathematical operations with four input elements and one output element. 18. The computer program product of claim 13 wherein the high speed, low latency network comprises data communications links coupling the compute nodes so as to organize the compute nodes as a tree. 19. The computer program product of claim 13 wherein the computer readable storage medium comprises a recordable medium.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (55)
Scott Steven L. ; Pribnow Richard D. ; Logghe Peter G. ; Kunkel Daniel L. ; Schwoerer Gerald A., Adaptive congestion control mechanism for modular computer networks.
Archer, Charles J.; Inglett, Todd A.; Ratterman, Joseph D.; Smith, Brian E., Configuring compute nodes of a parallel computer in an operational group into a plurality of independent non-overlapping collective networks.
Kato Sadayuki,JPX ; Ishihata Hiroaki,JPX ; Horie Takeshi,JPX ; Inano Satoshi,JPX ; Shimizu Toshiyuki,JPX, Data gathering/scattering system for a plurality of processors in a parallel computer.
Rhoades, John; Cameron, Ken; Winser, Paul; McConnell, Ray; Faulds, Gordon; McIntosh-Smith, Simon; Spencer, Anthony; Bond, Jeff; Dejaegher, Matthias; Halamish, Danny; Panesar, Gajinder, Data processing architectures for packet handling wherein batches of data packets of unpredictable size are distributed across processing elements arranged in a SIMD array operable to process different respective packet protocols at once while executing a single common instruction stream.
Michael Olivier, Dynamically matching users for group communications based on a threshold degree of matching of sender and recipient predetermined acceptance criteria.
Cypher Robert E. (Los Gatos CA) Sanz Jorge L. C. (Los Gatos CA), Hierarchical interconnection network architecture for parallel processing, having interconnections between bit-addressib.
Flaig Charles M. (Pasadena CA) Seitz Charles L. (San Luis Rey CA), Inter-computer message routing system with each computer having separate routinng automata for each dimension of the net.
Blumrich, Matthias A.; Chen, Dong; Chiu, George L.; Cipolla, Thomas M.; Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Kopcsay, Gerard V.; Mok, Lawrence S.; Takken, Todd E., Massively parallel supercomputer.
Carmichael Richard D. ; Ward Joel M. ; Winchell Michael A., Method and apparatus for controlling (N+I) I/O channels with (N) data managers in a homogenous software programmable en.
Rangarajan, Vijay; Maniyar, Shyamsundar N.; Eatherton, William N., Method and apparatus for storing tree data structures among and within multiple memory channels.
Rangarajan,Vijay; Maniyar,Shyamsundar N.; Eatherton,William N., Method and apparatus for storing tree data structures among and within multiple memory channels.
Krishnamoorthy Ashok V. (11188 Caminito Rodar San Diego CA 92126) Kiamilev Fouad (c/o UNC Charlotte ; Dept. of EE ; Smith Hall Room 332 Charlotte NC 28223), Packet-switched self-routing multistage interconnection network having contention-free fanout, low-loss routing, and fan.
Yasuda Yoshiko,JPX ; Tanaka Teruo,JPX, Parallel computer system using properties of messages to route them through an interconnect network and to select virtua.
Wilkinson Paul Amba ; Dieffenderfer James Warren ; Kogge Peter Michael ; Schoonover Nicholas Jerome, Partitioning of processing elements in a SIMD/MIMD array processor.
VanHuben Gary Alan ; Blake Michael A. ; Mak Pak-kin, SMP clusters with remote resource managers for distributing work to other clusters while reducing bus traffic to a minimum.
Kil, David H.; Pottschmidt, David B., System and method for automatic generation of a hierarchical tree network and the use of two complementary learning algorithms, optimized for each leaf of the hierarchical tree network.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.