Shared address collectives using counter mechanisms
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-015/16
G06F-015/167
출원번호
US-0568115
(2009-09-28)
등록번호
US-8655962
(2014-02-18)
발명자
/ 주소
Blocksome, Michael
Dozsa, Gabor
Gooding, Thomas M.
Heidelberger, Philip
Kumar, Sameer
Mamidala, Amith R.
Miller, Douglas
출원인 / 주소
International Business Machines Corporation
대리인 / 주소
Scully, Scott, Murphy & Presser, P.C.
인용정보
피인용 횟수 :
1인용 특허 :
7
초록▼
A shared address space on a compute node stores data received from a network and data to transmit to the network. The shared address space includes an application buffer that can be directly operated upon by a plurality of processes, for instance, running on different cores on the compute node. A sh
A shared address space on a compute node stores data received from a network and data to transmit to the network. The shared address space includes an application buffer that can be directly operated upon by a plurality of processes, for instance, running on different cores on the compute node. A shared counter is used for one or more of signaling arrival of the data across the plurality of processes running on the compute node, signaling completion of an operation performed by one or more of the plurality of processes, obtaining reservation slots by one or more of the plurality of processes, or combinations thereof.
대표청구항▼
1. A device for communication in message passing interface applications with multiple processes running on a compute node connected to a global network of compute nodes, comprising: a shared address space on a compute node, operable to store data coming in from a network and data to be written out t
1. A device for communication in message passing interface applications with multiple processes running on a compute node connected to a global network of compute nodes, comprising: a shared address space on a compute node, operable to store data coming in from a network and data to be written out to the network, wherein the shared address space includes an application buffer that can be directly operated upon by a plurality of processes; anda shared counter operable to be used for performing different roles during a message collective operation of a message passing interface, the different roles comprising signaling arrival of the data across the plurality of processes running on the compute node, signaling completion of an operation performed by one or more of the plurality of processes, obtaining reservation slots by one or more of the plurality of processes,wherein the shared counter is shared by and visible to the plurality of processes,wherein the shared counter functions as completion counter during a broadcast operation wherein a process of the plurality of processes designated as a master process receives broadcast data from the global network into a buffer associated with the master process and notifies other of the plurality of processes of the received data, wherein each of the other of the plurality of processes increments the completion counter after copying the received data to a memory region associated with said each of the other of the plurality of processes, and in response to the completion counter reaching a total number of the other of the plurality of processes, the master process reuses the buffer. 2. The device of claim 1, wherein the shared counter is operable for signaling arrival of the data across the plurality of processes running on the compute node and signaling completion of an operation. 3. The device of claim 2, wherein a process running on the compute node in response to receiving data from the network writes to the shared address space and increments a byte count in the shared counter, and wherein one or more other processes running on the compute node poll the shared counter and in response to detecting an increment in the byte count copy the data. 4. The device of claim 3, wherein the one or more other processes running on the compute node signal completion of copy operation in the shared counter. 5. The device of claim 1, wherein the shared counter includes FIFO data structure, wherein one or more of the plurality of processes obtain a reservation slot in the FIFO. 6. The device of claim 5, wherein the shared counter further includes an atomic counter for indicating whether one or more of the plurality of processes read the FIFO, wherein broadcast operations may be performed between one of the plurality of processes and rest of the plurality of processes using the FIFO and the atomic counter. 7. The device of claim 1, wherein each of the plurality of processes run on a separate core on the compute node. 8. The device of claim 1, wherein the shared address space includes send and receive buffers used for allreduce MPI operation. 9. The device of claim 1, wherein injection and reception of the data to and from a global network is handled exclusively by two separate dedicated cores, and local broadcast is handled by using another two cores to copy the data from an application buffer of a core dedicated to receiving the data from the global network, the application buffer being in the shared address space, and wherein the shared counter is used to synchronized the data. 10. The device of claim 1, where in the device include one counter for each connection to a global network. 11. The device of claim 1, wherein the shared address space and the shared counter are used to pipeline across different stages of operations performed by the plurality of processes. 12. A collective communication method for message passing interface applications with multiple processes running on a compute node, comprising: receiving data from a global network for performing a collective operation;writing the data directly into an application buffer of a core on a compute node receiving the data, the application buffer being in shared address space;signaling using a shared counter of the received data; andin response to the signaling, copying the data directly from the application buffer to a plurality of cores on the compute node different from the core receiving the data,wherein the shared counter is shared by and visible to the plurality of processes,wherein the shared counter functions as completion counter during a broadcast operation wherein a process of the plurality of processes designated as a master process receives broadcast data from the global network and notifies other of the plurality of processes of the received data, wherein each of the other of the plurality of processes increments the completion counter after copying the received data to a memory region associated with said each of the other of the plurality of processes, and in response to the completion counter reaching a total number of the other of the plurality of processes, the master process reuses the application buffer. 13. The method of claim 12, wherein the shared counter includes at least information associated with base address of the application buffer and total bytes written into the application buffer. 14. The method of claim 13, wherein the signaling using the shared counter includes incrementing the shared counter. 15. The method of claim 14, wherein an element in the FIFO data structure is reserved for one of the plurality of processes running on the compute node using an atomic counter. 16. The method of claim 13, wherein the shared counter further includes an atomic completion counter. 17. The method of claim 12, further including: dedicating each core on the compute node to a different task. 18. The method of claim 17, wherein injection and reception of the data to and from a global network is handled exclusively by two separate dedicated cores, and local broadcast is handled by using another two cores to copy the data from an application buffer of a core dedicated to receiving the data from the global network, the application buffer being in the shared address space, and wherein the shared counter is used to synchronized the data. 19. The method of claim 17, wherein an allreduce MPI operation is performed by the compute node by having one core in the compute node perform local sum and network reduction and rest of the cores on the compute node perform local data reduction, wherein the shared address space is used for data access and the shared counter is used for synchronization among the one core and the rest of the cores. 20. The method of claim 12, wherein the shared counter includes a FIFO data structure. 21. A computer readable storage medium storing a program of instructions executable by a machine to perform a collective communication method for message passing interface applications with multiple processes running on a compute node, comprising: receiving data from a global network for performing a collective operation;writing the data directly into an application buffer of a core on a compute node receiving the data, the application buffer being in shared address space;signaling using a shared counter of the received data; andin response to the signaling, copying the data directly from the application buffer to a plurality of cores on the compute node different from the core receiving the data,wherein the shared counter is shared by and visible to the plurality of processes,wherein the shared counter functions as completion counter during a broadcast operation wherein a process of the plurality of processes designated as a master process receives broadcast data from the global network and notifies other of the plurality of processes of the received data, wherein each of the other of the plurality of processes increments the completion counter after copying the received data to a memory region associated with said each of the other of the plurality of processes, and in response to the completion counter reaching a total number of the other of the plurality of processes, the master process reuses the application buffer. 22. The computer readable storage medium of claim 21, wherein the shared counter includes at least information associated with base address of the application buffer and total bytes written into the application buffer. 23. The computer readable storage medium of claim 22, wherein the signaling using the shared counter includes incrementing the shared counter. 24. The computer readable storage medium of claim 21, wherein the shared counter includes a FIFO data structure. 25. The computer readable storage medium of claim 21, injection and reception of the data to and from a global network is handled exclusively by two separate dedicated cores, and local broadcast is handled by using another two cores to copy the data from an application buffer of a core dedicated to receiving the data from the global network, the application buffer being in the shared address space, and wherein the shared counter is used to synchronized the data.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (7)
Mitchell A. Bauman ; Eugene A. Rodi ; Douglas E. Morrissey, Directory based cache coherency system supporting multiple instruction processor and input/output caches.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.