Multistream processing memory-and barrier-synchronization method and apparatus
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-012/00
G06F-009/52
G06F-009/46
출원번호
US-0643741
(2003-08-18)
등록번호
US-7437521
(2008-10-14)
발명자
/ 주소
Scott,Steven L.
Faanes,Gregory J.
Stephenson,Brick
Moore, Jr.,William T.
Kohn,James R.
출원인 / 주소
Cray Inc.
대리인 / 주소
Schwegman, Lundberg & Woessner, P.A.
인용정보
피인용 횟수 :
38인용 특허 :
84
초록▼
A method and apparatus to provide specifiable ordering between and among vector and scalar operations within a single streaming processor (SSP) via a local synchronization (Lsync) instruction that operates within a relaxed memory consistency model. Various aspects of that relaxed memory consistency
A method and apparatus to provide specifiable ordering between and among vector and scalar operations within a single streaming processor (SSP) via a local synchronization (Lsync) instruction that operates within a relaxed memory consistency model. Various aspects of that relaxed memory consistency model are described. Further, a combined memory synchronization and barrier synchronization (Msync) for a multistreaming processor (MSP) system is described. Also, a global synchronization (Gsync) instruction provides synchronization even outside a single MSP system is described. Advantageously, the pipeline or queue of pending memory requests does not need to be drained before the synchronization operation, nor is it required to refrain from determining addresses for and inserting subsequent memory accesses into the pipeline.
대표청구항▼
What is claimed is: 1. An apparatus comprising: a memory interface; a plurality of queues connected to the memory interface, including a first queue and a second queue, wherein each of the plurality of queues holds pending memory requests and enforces an ordering in the commitment of the pending me
What is claimed is: 1. An apparatus comprising: a memory interface; a plurality of queues connected to the memory interface, including a first queue and a second queue, wherein each of the plurality of queues holds pending memory requests and enforces an ordering in the commitment of the pending memory requests to memory; one or more instruction-processing circuits, wherein each instruction-processing circuit is operatively coupled through the plurality of queues to the memory interface and wherein each of the plurality of instruction-processing circuits inserts one or more memory requests into at least one of the queues based on a first memory operation instruction, inserts a first synchronization marker into the first queue and inserts a second synchronization marker into the second queue based on a synchronization operation instruction and inserts one or more memory requests into at least one of the queues based on a second memory operation instruction; and a first synchronization circuit, operatively coupled to the first plurality of queues, that selectively halts processing of further memory requests from the first queue based on the first synchronization marker reaching a predetermined point in the first queue until the corresponding second synchronization marker reaches a predetermined point in the second queue; wherein each of the memory requests is a memory reference, wherein the memory reference is generated as a result of instructions by the instruction-processing circuits, wherein the first queue is used for only synchronization markers and vector memory references, and the second queue is used for only synchronization markers and scalar memory references, wherein the synchronization operation instruction is an Lsync V,S-type instruction, wherein the instruction-processing circuits include a data cache and wherein the Lsync V,S-type instruction prevents subsequent scalar references from accessing the data cache until all vector references have been sent to an external cache and all vector writes have caused any necessary invalidations of the data cache. 2. The apparatus of claim 1, wherein, for a second synchronization operation instruction, a corresponding synchronization marker is inserted to in only the first queue. 3. The apparatus of claim 2, wherein the second synchronization instruction is an Lsync-type instruction. 4. The apparatus of claim 1, wherein the first queue includes two subqueues, including a first subqueue and a second subqueue, wherein the first subqueue is for holding the vector memory references and synchronization markers associated with the vector memory references and wherein the second subqueue is for holding a plurality of store data elements and synchronization markers associated with the store data elements, wherein each store data element in the second subqueue corresponds to a one of the memory requests in the first subqueue, and wherein the store data elements are loaded into the second subqueue decoupled from the loading of the memory requests into the first subqueue. 5. A method comprising: providing a memory interface; providing a plurality of queues connected to the memory interface, including a first queue and a second queue, wherein each of first plurality of queues holds pending memory requests and enforces an ordering in the commitment of the pending memory requests to memory; providing one or more instruction-processing circuits, wherein each instruction-processing circuit is operatively coupled through the plurality of queues to the memory interface; inserting one or more memory requests into at least one of the queues based on a first memory operation instruction executed in one of the instruction-processing circuits; inserting a first synchronization marker into the first queue and inserting a second synchronization marker into the second queue based on a synchronization operation instruction executed in one of the instruction-processing circuits; inserting one or more memory requests into at least one of the queues based on a second memory operation instruction based on a second memory operation instruction executed in one of the instruction-processing circuits; processing memory requests from the first queue; and selectively halting further processing of memory requests from the first queue based on the first synchronization marker reaching a predetermined point in the first queue until the corresponding second synchronization marker reaches a predetermined point in the second queue; wherein each of the memory requests is a memory reference, wherein the first queue stores only synchronization markers and vector memory references, and wherein the second queue stores only synchronization markers and scalar memory references, wherein the synchronization operation instruction is an Lsync V,S-type instruction, wherein providing the instruction processing circuits includes providing a data cache and wherein performing the Lsync V,S type instruction includes preventing subsequent scalar references from accessing the data cache until all vector references have been sent to an external cache and all vector writes have caused any necessary invalidations of the data cache. 6. The method of claim 5, wherein the first queue includes two subqueues, including a first subqueue and a second subqueue, wherein the first subqueue stores the vector memory references and synchronization markers associated with the vector memory references and wherein the second subqueue stores a plurality of store data elements and synchronization markers associated with the store data elements, wherein each store data element in the second subqueue corresponds to one of the memory requests in the first subqueue, and wherein the store data elements are inserted into the second subqueue decoupled from the inserting of the memory requests into the first subqueue. 7. An apparatus comprising: a memory interface; a plurality of queues connected to the memory interface, including a first queue and a second queue, wherein each of the plurality of queues holds pending memory requests and enforces an ordering in the commitment of the pending memory requests to memory; one or more instruction-processing circuits, wherein each instruction-processing circuit is operatively coupled through the plurality of queues to the memory interface and wherein each of the plurality of instruction-processing circuits includes: means for inserting one or more memory requests into at least one of the queues based on a first memory operation instruction executed in one of the instruction-processing circuits; means for inserting a first synchronization marker into the first queue and inserting a second synchronization marker into the second queue based on a synchronization operation instruction executed in one of the instruction-processing circuits; means for inserting one or more memory requests into at least one of the queues based on a second memory operation instruction based on a second memory operation instruction executed in one of the instruction-processing circuits; means for processing memory requests from the first queue; and means for selectively halting further processing of memory requests from the first queue based on the first synchronization marker reaching a predetermined point in the first queue until the corresponding second synchronization marker reaches a predetermined point in the second queue; wherein each of the memory requests is a memory reference, wherein means for inserting to the first queue operates for only vector memory requests and synchronizations, and means for inserting to the second queue operates for only scalar memory requests and synchronizations, wherein the synchronization operation instruction is an Lsync V,S-type instruction, wherein the instruction-processing circuits include a data cache and wherein the Lsync V,S-type instruction prevents subsequent scalar references from accessing the data cache until all vector references have been sent to an external cache and all vector writes have caused any necessary invalidations of the data cache. 8. A system comprising: a plurality of processors, including a first processor and a second processor, wherein each of the processors includes: a memory interface; a plurality of Lsync queues connected to the memory interface, including a first Lsync queue and a second Lsync queue, wherein each of the plurality of Lsync queues holds pending memory requests and enforces an ordering in the commitment of the pending memory requests to memory; one or more instruction-processing circuits, wherein each instruction-processing circuit is operatively coupled through the plurality of Lsync queues to the memory interface and wherein each of the plurality of instruction-processing circuits inserts one or more memory requests into at least one of the Lsync queues based on a first memory operation instruction, inserts a first Lsync synchronization marker into the first Lsync queue and inserts a second Lsync synchronization marker into the second Lsync queue based on a synchronization operation instruction, and inserts one or more memory requests into at least one of the Lsync queues based on a second memory operation instruction; and a Lsync synchronization circuit, operatively coupled to the plurality of Lsync queues, that selectively halts processing of further memory requests from the first Lsync queue based on the first Lsync synchronization marker reaching a predetermined point in the first Lsync queue until the corresponding second Lsync synchronization marker reaches a predetermined point in the second Lsync queue; and one or more Msync circuits, wherein each of the Msync circuits is connected to the plurality of processors and wherein each of the Msync circuits includes: a plurality of Msync queues, including a first Msync queue and a second Msync queue, each of the plurality of Msync queues for holding a plurality of pending memory requests received from the Lsync queues, wherein the first Msync queue stores only Msync synchronization markers and memory requests from the first processor, and the second Msync queue stores only Msync synchronization markers and memory requests from the second processor; and an Msync synchronization circuit, operatively coupled to the plurality of Msync queues, that selectively halts further processing of the memory requests from the first Msync queue based on an Msync synchronization marker reaching a predetermined point in the first Msync queue until a corresponding Msync synchronization marker from the second processor reaches a predetermined point in the second Msync queue; wherein each of the memory requests is a memory reference, wherein the memory reference is generated as a result of execution of instructions by instruction-processing circuits in each processor, wherein each processor includes a data cache and wherein each Msync synchronization circuit includes an external cache, wherein the data cache and the external cache are used to perform an Lsync V,S type instruction, wherein the Lsync V,S type instruction prevents subsequent scalar references from accessing the data cache until all vector references have been sent to the external cache in a corresponding Msync synchronization circuit and all vector writes have caused any necessary invalidations of the data cache. 9. The system of claim 8, wherein the Msync synchronization circuit includes a plurality of stall lines, wherein each of the stall lines is connected to one of the plurality of Msync queues and wherein each of the stall lines is for halting further processing of the memory requests from a corresponding Msync queue. 10. A method comprising: providing a plurality of processors, including a first processor and a second processor, wherein each of the processors includes a memory interface, a plurality of Lsync queues connected to the memory interface, including a first Lsync queue and a second Lsync queue, wherein each of the plurality of Lsync queues holds pending memory requests and enforces an ordering in the commitment of the pending memory requests to memory, and one or more instruction-processing circuits, each of the instruction-processing circuits operatively coupled through the plurality of Lsync queues to the memory interface; providing one or more Msync circuits, wherein each of the Msync circuits is connected to the plurality of processors and wherein each of the Msync circuits includes a plurality of Msync queues, including a first Msync queue and a second Msync queue, each of the plurality of Msync queues operatively coupled to the plurality of Lsync queues in one of the plurality of processors; inserting one or more memory requests into at least one of the Lsync queues based on a first memory operation instruction executed in one of the instruction-processing circuits; inserting a first Lsync synchronization marker into the first Lsync queue and inserting a second Lsync synchronization marker into the second Lsync queue based on a synchronization operation instruction executed in one of the instruction-processing circuits; inserting one or more memory requests into at least one of the Lsync queues based on a second memory operation instruction executed in one of the instruction-processing circuits; processing memory requests from the first Lsync queue; selectively halting further processing of memory requests from the first Lsync queue based on the first Lsync synchronization marker reaching a predetermined point in the first Lsync queue until the corresponding second Lsync synchronization marker reaches a predetermined point in the second Lsync queue, inserting Msync synchronization markers and memory requests received from the Lsync queues in the first processor into the first Msync queue; inserting Msync synchronization markers and memory requests received from the Lsync queues in the second processor into the second Msync queue; and selectively halting further processing of the memory requests from the first Msync queue based on an Msync synchronization marker reaching a predetermined point in the first Msync queue until a corresponding Msync synchronization marker from the second processor reaches a predetermined point in the second Msync queue; wherein each of the memory requests is a memory reference, wherein selectively halting further processing of memory requests from the first Lsync queue includes performing an Lsync V,S type instruction, wherein performing the Lsync V,S type instruction includes preventing subsequent scalar references from accessing a data cache in the processor until all vector references have been sent to an external cache in a corresponding Msync synchronization circuit and all vector writes have caused any necessary invalidations of the data cache. 11. The method of claim 10, wherein selectively halting further processing of the memory requests from the first Msync queue includes sending a stall signal to the Msync queues.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (84)
Nugent Steven F. (Portland OR), Adaptive message routing for multi-dimensional networks.
Bruckert William (Northboro MA) Bissett Thomas D. (Derry NH) Kovalcin David (Grafton MA) Nene Ravi (Chelmsford MA), Apparatus and method for documenting faults in computing modules.
Oberlin Steven M. (Chippewa Falls WI) Fromm Eric C. (Eau Claire WI), Barrier synchronization for distributed memory massively parallel processing systems.
Hall Barbara A. (Endwell NY) Huang Kevin C. (Endicott NY) Jabusch John D. (Endwell NY) Ngai Agnes Y. (Endwell NY), Central processing unit checkpoint retry for store-in and store-through cache systems.
Chen Steve S. (Chippewa Falls) Simmons Frederick J. (Neillsville) Spix George A. (Eau Claire) Wilson Jimmie R. (Eau Claire) Miller Edward C. (Eau Claire) Eckert Roger E. (Eau Claire) Beard Douglas R., Cluster architecture for a highly parallel scalar/vector multiprocessor system.
Mendelsohn Noah R. (Arlington MA) Perchik James (Cambridge MA) Hancock Thomas R. (Somerville MA), Component replacement control for fault-tolerant data processing system.
Nagashima, Shigeo; Torii, Shunichi; Omoda, Koichiro; Inagami, Yasuhiro, Data processing system including scalar data processor and vector data processor.
Papadopoulos Gregory M. (Acton MA) Nikhil Rishiyur S. (Arlington MA) Greiner Robert J. (Chandler AZ) Arvind (Arlington MA), Data processing system with synchronization coprocessor for multiple threads.
Papadopoulos Gregory M. (Burlington MA) Nikhil Rishiyur S. (Arlington MA) Greiner Robert J. (Chandler AZ) Arvind (Arlington MA), Data processing system with synchronization coprocessor for multiple threads.
Ogura Takao (Kawasaki JPX) Amemiya Shigeo (Kawasaki JPX) Tezuka Koji (Kawasaki JPX) Chujo Takafumi (Kawasaki JPX), Distributed control of telecommunication network for setting up an alternative communication path.
Flaig Charles M. (Pasadena CA) Seitz Charles L. (San Luis Rey CA), Inter-computer message routing system with each computer having separate routinng automata for each dimension of the net.
Thomas Basil Smith, III ; Robert Brett Tremaine, Memory system for permitting simultaneous processor access to a cache line and sub-cache line sectors fill and writeback to a system memory.
Beard Douglas R. (Eleva WI) Phelps Andrew E. (Eau Claire WI) Woodmansee Michael A. (Eau Claire WI) Blewett Richard G. (Altoona WI) Lohman Jeffrey A. (Eau Claire WI) Silbey Alexander A. (Eau Claire WI, Method and apparatus for chaining vector instructions.
Peterson John C. (Alta Loma CA) Chow Edward (San Dimas CA) Madan Herb S. (Marina del Rey CA), Method and apparatus for eliminating unsuccessful tries in a search tree.
Dion Rodgers ; Darrell Boggs ; Amit Merchant ; Rajesh Kota ; Rachel Hsu ; Keshavan Tiruvallur, Method and apparatus for processing an event occurrence within a multithreaded processor.
Fossum Tryggve (Northboro MA) Hetherington Ricky C. (Northboro MA) Fite ; Jr. David B. (Northboro MA) Manley Dwight P. (Holliston MA) McKeen Francis X. (Westboro MA) Murray John E. (Acton MA), Method and apparatus using a cache and main memory for both vector processing and scalar processing by prefetching cache.
Shiojiri Hirohisa (Tokyo JPX) Koga Toshio (Tokyo JPX), Method of adaptively multiplexing a plurality of video channel data using channel data assignment information obtained f.
Neches Philip M. (Pasadena CA), Multi processor sorting network for sorting while transmitting concurrently presented messages by message content to del.
Barrett Linda (Raleigh NC) Long Lynn D. (Chapel Hill NC) Menditto Louis F. (Raleigh NC) Stagg Arthur J. (Raleigh NC) Ward Raymond E. (Durham NC), Multi-path channel (MPC) interface with user transparent, unbalanced, dynamically alterable computer input/output channe.
den Haan, Petrus A. M.; Hopmans, Franciscus P. M., Multi-processor computer system with distributed memory and an interprocessor communication mechanism, and method for operating such mechanism.
Baum Richard I. (Poughkeepsie NY) Brotman Charles H. (Poughkeepsie NY) Rymarczyk James W. (Poughkeepsie NY), Multiprocessing packet switching connection system having provision for error correction and recovery.
Frink Craig R. (Chelmsford MA) Bryg William R. (Saratoga CA) Chan Kenneth K. (San Jose CA) Hotchkiss Thomas R. (Groton MA) Odineal Robert D. (Roseville CA) Williams James B. (Lowell MA) Ziegler Micha, Multiprocessor system for maintaining cache coherency by checking the coherency in the order of the transactions being i.
Nesheim William A. ; Guzovskiy Aleksandr, Multiprocessor system having mapping table in each node to map global physical addresses to local physical addresses of.
Deneau, Thomas M., Multiprocessor system implementing virtual memory using a shared memory, and a page replacement method for maintaining paged memory coherence.
Childs Philip L. (Endicott NY) Olnowich Howard T. (Endicott NY) Skovira Joseph F. (Binghamton NY), SYNC-NET- a barrier synchronization apparatus for multi-stage networks.
Nickolls John R. (Los Altos CA) Zapisek John (Cupertino CA) Kim Won S. (Fremont CA) Kalb Jeffery C. (Saratoga CA) Blank W. Thomas (Palo Alto CA) Wegbreit Eliot (Palo Alto CA) Van Horn Kevin (Mountain, Scalable processor to processor and processor-to-I/O interconnection network and method for parallel processing arrays.
Beard Douglas R. (Eleva WI) Phelps Andrew E. (Eau Claire WI) Woodmansee Michael A. (Eau Claire WI) Blewett Richard G. (Altoona WI) Lohman Jeffrey A. (Eau Claire WI) Silbey Alexander A. (Eau Claire WI, Scalar/vector processor.
Nakazato, Satoshi, Shared memory type vector processing system, including a bus for transferring a vector processing instruction, and control method thereof.
Dutton Patrick Francis ; Gregor Steven Lee ; Li Hehching Harry, Storage subsystem including an error correcting cache and means for performing memory to memory transfers.
Horie Takeshi (Kawasaki JPX) Ikesaka Morio (Yokohama JPX) Ishihata Hiroaki (Tokyo JPX), System for controlling communication between parallel computers.
Richard L. Frank ; Gopalan Arun ; Michael J. Cusson ; Daniel E. O'Shaughnessy, System for efficiently maintaining translation lockaside buffer consistency in a multi-threaded, multi-processor virtual memory system.
Van Loo William C. (Palo Alto CA) Ebrahim Zahir (Mountain View CA) Nishtala Satyanarayana (Cupertino CA) Normoyle Kevin (San Jose CA) Loewenstein Paul (Palo Alto CA) Coffin ; III Louis F. (San Jose C, Writeback cancellation processing system for use in a packet switched cache coherent multiprocessor system.
Ohlgren, Harry Carl Håkan; Lindquist, Carl Tobias, Allocating audio processing among a plurality of processing units with a global synchronization pulse.
Scott, Steven L.; Faanes, Gregory J., Decoupling of write address from its associated write data in a store to a shared memory in a multiprocessor system.
Guthrie, Guy L.; Helterhoff, Harmony L.; Jeremiah, Thomas L.; Ng, Alvan W.; Starke, William J.; Stuecheli, Jeffrey A.; Williams, Philip G., Empirically based dynamic control of acceptance of victim cache lateral castouts.
Cargnoni, Robert A.; Guthrie, Guy L.; Helterhoff, Harmony L.; Starke, William J.; Stuecheli, Jeffrey A.; Williams, Phillip G., Empirically based dynamic control of transmission of victim cache lateral castouts.
Ould-Ahmed-Vall, Elmoustapha; Doshi, Kshitij A.; Sair, Suleyman; Yount, Charles R., Instruction and logic to provide stride-based vector load-op functionality with mask duplication.
Ould-Ahmed-Vall, Elmoustapha; Doshi, Kshitij A.; Sair, Suleyman; Yount, Charles R., Instruction and logic to provide vector loads with strides and masking functionality.
Guthrie, Guy L.; Le, Hien M.; Ng, Alvan W.; Siegel, Michael S.; Williams, Derek E.; Williams, Phillip G., Lateral castout (LCO) of victim cache line in data-invalid state.
Sprangle, Eric; Rohillah, Anwar; Cavin, Robert; Forsyth, Tom; Abrash, Michael, Processor and system using a mask register to track progress of gathering and prefetching elements from memory.
Faanes, Gregory J.; Lundberg, Eric P.; Scott, Steven L.; Baird, Robert J., System and method for processing memory instructions using a forced order queue.
Sprangle, Eric; Rohillah, Anwar; Cavin, Robert; Forsyth, Andrew T.; Abrash, Michael, System and method for using a mask register to track progress of gathering and scattering elements between data registers and memory.
Daly, Jr., George William; Guthrie, Guy Lynn; Leavens, Ross Boyd; McDonald, Joseph Gerald; Siegel, Michael Steven; Starke, William John; Williams, Derek Edward, Techniques for write-after-write ordering in a coherency managed processor system that employs a command pipeline.
Eichenberger, Alexandre E.; Gschwind, Michael K.; Salapura, Valentina, Vector loads with multiple vector elements from a same cache line in a scattered load operation.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.