Embodiments relate to vector processing in an active memory device. An aspect includes a system for vector processing in an active memory device. The system includes memory in the active memory device and a processing element in the active memory device. The processing element is configured to perfo
Embodiments relate to vector processing in an active memory device. An aspect includes a system for vector processing in an active memory device. The system includes memory in the active memory device and a processing element in the active memory device. The processing element is configured to perform a method including decoding an instruction with a plurality of sub-instructions to execute in parallel. An iteration count to repeat execution of the sub-instructions in parallel is determined. Execution of the sub-instructions is repeated in parallel for multiple iterations, by the processing element, based on the iteration count. Multiple locations in the memory are accessed in parallel based on the execution of the sub-instructions.
대표청구항▼
1. A system for vector processing in an active memory device, the system comprising: a memory in the active memory device; anda processing element in the active memory device, the processing element configured to perform a method comprising: decoding, in the processing element, an instruction compri
1. A system for vector processing in an active memory device, the system comprising: a memory in the active memory device; anda processing element in the active memory device, the processing element configured to perform a method comprising: decoding, in the processing element, an instruction comprising a plurality of sub-instructions to execute in parallel;determining an iteration count to repeat execution of the sub-instructions in parallel based on decoding an iteration count source field of the instruction that defines whether to set the iteration count based on an iteration count field of the instruction or based on an iteration count register;repeating execution of the sub-instructions in parallel for multiple iterations, by the processing element, based on the iteration count;accessing multiple locations in the memory in parallel based on the execution of the sub-instructions;identifying a lane control sub-instruction in the instruction based on the decoding of the instruction, the lane control sub-instruction controlling a sequence of instruction execution and positioned in parallel with the sub-instructions to execute in parallel; andexecuting the lane control sub-instruction, by the processing element, only once after execution of the sub-instructions is performed in parallel for multiple iterations. 2. The system of claim 1, wherein the sub-instructions comprise at least a pair of a memory access sub-instruction in parallel with an arithmetic-logical sub-instruction, and the processing element is further configured to perform: flowing the memory access sub-instruction to a load-store unit in the processing element; andflowing the arithmetic-logical sub-instruction to an arithmetic logic unit in the processing element to execute the memory access sub-instruction in parallel with the arithmetic-logical sub-instruction. 3. The system of claim 2, wherein the processing element is further configured to perform: accessing one or more of: a vector computation register file and a scalar computation register file in the processing element for operands to execute the memory access sub-instruction in the load-store unit; andaccessing one or more of: the vector computation register file and the scalar computation register file in the processing element for operands to execute the arithmetic-logical sub-instruction in the arithmetic logic unit. 4. The system of claim 3, wherein the processing element is further configured to perform: partitioning at least one of the operands as a plurality of sub-elements based on a data type of the arithmetic-logical sub-instruction;performing, by the arithmetic logic unit, an operation of the arithmetic-logical sub-instruction in parallel execution slots on each of the sub-elements; andcomputing, by the load-store unit, an address per sub-element. 5. The system of claim 3, wherein the processing element is further configured to perform: flowing an output of the load-store unit to one or more of: the load-store unit, an effective-to-real address translation unit, a load-store queue, the vector computation register file, and the scalar computation register file; andflowing an output of the arithmetic logic unit to one or more of: the arithmetic logic unit, the load-store unit, the vector computation register file, and the scalar computation register file. 6. The system of claim 3, wherein the processing element is partitioned into multiple processing slices operable in parallel, each processing slice comprising a pair of the load-store unit and the arithmetic logic unit, and an associated pair of the vector computation register file and the scalar computation register file, and the processing element is further configured to perform: flowing an output of the arithmetic logic unit of one processing slice to an input of one or more of: the load-store unit and the arithmetic logic unit. 7. The system of claim 3, wherein the processing element is further configured to perform: performing an error check on the operands prior to executing the memory access sub-instruction and the arithmetic-logical sub-instruction. 8. The system of claim 1, wherein the lane control sub-instruction is a branch sub-instruction executed by the processing element during execution of a last iteration of the instruction based on conditions evaluated during execution of a first element of the instruction. 9. A system for vector processing in an active memory device, the system comprising: a memory in the active memory device; anda processing element in the active memory device, the processing element configured to perform a method comprising: receiving, in the processing element, a command from a requestor;fetching, in the processing element, an instruction based on the command, the instruction being fetched from an instruction buffer in the processing element;decoding, in the processing element, the instruction comprising a plurality of sub-instructions to execute in parallel;determining an iteration count to repeat execution of the sub-instructions in parallel based on decoding an iteration count source field of the instruction that defines whether to set the iteration count based on an iteration count field of the instruction or based on an iteration count register;repeating execution of the sub-instructions in parallel for multiple iterations, by the processing element, based on the iteration count;accessing multiple locations in the memory in parallel based on the execution of the sub-instructions;identifying a lane control sub-instruction in the instruction based on the decoding of the instruction, the lane control sub-instruction controlling a sequence of instruction execution and positioned in parallel with the sub-instructions to execute in parallel; andexecuting the lane control sub-instruction, by the processing element, only once after execution of the sub-instructions is performed in parallel for multiple iterations. 10. The system of claim 9, wherein the processing element is further configured to perform: fetching a special instruction from the instruction buffer to load a new instruction from the memory; andreplacing an entry in the instruction buffer with the new instruction based on executing the special instruction. 11. The system of claim 9, wherein the active memory device is a three-dimensional memory cube, the memory is divided into three-dimensional blocked regions as memory vaults, and accessing multiple locations in the memory is performed through one or more memory controllers in the active memory device. 12. The system of claim 9, wherein the sub-instructions comprise at least a pair of a memory access sub-instruction in parallel with an arithmetic-logical sub-instruction, and the processing element is further configured to perform: flowing the memory access sub-instruction to a load-store unit in the processing element; andflowing the arithmetic-logical sub-instruction to an arithmetic logic unit in the processing element to execute the memory access sub-instruction in parallel with the arithmetic-logical sub-instruction. 13. The system of claim 12, wherein the processing element is further configured to perform: accessing one or more of: a vector computation register file and a scalar computation register file in the processing element for operands to execute the memory access sub-instruction in the load-store unit; andaccessing one or more of: the vector computation register file and the scalar computation register file in the processing element for operands to execute the arithmetic-logical sub-instruction in the arithmetic logic unit. 14. The system of claim 13, wherein the processing element is further configured to perform: partitioning at least one of the operands as a plurality of sub-elements based on a data type of the arithmetic-logical sub-instruction;performing, by the arithmetic logic unit, an operation of the arithmetic-logical sub-instruction in parallel execution slots on each of the sub-elements; andcomputing, by the load-store unit, an address per sub-element. 15. The system of claim 13, wherein the processing element is further configured to perform: performing an error check on the operands prior to executing the memory access sub-instruction and the arithmetic-logical sub-instruction;based on detecting a correctable error, freezing instruction processing, fixing the correctable error, and resuming instruction processing; andbased on detecting an uncorrectable error, freezing instruction processing and notifying a main processor. 16. The system of claim 13, wherein the processing element is further configured to perform: generating an address for the memory access sub-instruction;translating the generated address to a real address of the memory;checking for an address translation fault based on translating the generated address; andbased on identifying the address translation fault, freezing instruction processing, notifying the main processor, waiting for a response from the main processor, fixing a problem causing the address translation fault, and resuming instruction processing. 17. The system of claim 13, wherein the processing element is further configured to perform: detecting an exception based on executing the arithmetic-logical sub-instruction; andbased on detecting the exception, freezing instruction processing and notifying the main processor. 18. The system of claim 13, wherein the processing element is further configured to perform: decrementing the iteration count based on executing the memory access sub-instruction and the arithmetic-logical sub-instruction;based on decrementing the iteration count to zero, decoding the lane control sub-instruction from the instruction;based on determining that the lane control sub-instruction is one of: a return sub-instruction and a pause sub-instruction, freezing instruction processing and notifying a main processor; andbased on determining that the lane control sub-instruction is one of: a branch sub-instruction and a no-operation sub-instruction, adjusting a current instruction address to identify a next instruction in the instruction buffer.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (42)
Cutler David N. (Bellevue WA) Orbits David A. (Redmond WA) Bhandarkar Dileep (Shrewsbury MA) Cardoza Wayne (Merrimack NH) Witek Richard T. (Littleton MA), Apparatus and method for recovering from missing page faults in vector data processing operations.
Gostin Gary B. ; Barr Matthew F. ; McGuffey Ruth A. ; Roan Russell L., Apparatus, systems and method for improving memory bandwidth utilization in vector processing systems.
Papworth David B. ; Hinton Glenn J. ; Fetterman Michael A. ; Colwell Robert P. ; Glew Andrew F., Exception handling in a processor that performs speculative out-of-order instruction execution.
Fujii Hiroaki (Kokubunji CA JPX) Hamanaka Naoki (Palo Alto CA) Tanaka Teruo (Hachoiji JPX) Inagami Yasuhiro (Kodaira JPX) Tamaki Yoshiko (Kodaira JPX), Information processing apparatus having a register file used interchangeably both as scalar registers of register window.
Haigh Stephen G. (Redwood City CA) Baji Toru (Burlingame CA), Instruction preprocessor for conditionally combining short memory instructions into virtual long instructions.
Thayer John S. ; Favor John G. ; Weber Frederick D., Load and store instructions which perform unpacking and packing of data bits in separate vector and integer cache storage.
Liao, Yu-Chung C.; Sandon, Peter A.; Cheng, Howard; Van Hook, Timothy J., Method and apparatus for obtaining a scalar value directly from a vector register.
Thomas L. Drabenstott ; Gerald G. Pechanek ; Edwin F. Barry ; Charles W. Kurak, Jr., Methods and apparatus to support conditional execution in a VLIW-based array processor with subword execution.
Gschwind, Michael Karl; Hofstee, Harm Peter; Hopkins, Martin Edward, SIMD datapath coupled to scalar/vector/address/conditional data register file with selective subpath scalar processing mode.
Brodnax Timothy B. (Austin TX) Bialas ; Jr. John S. (Bealeton VA) King Steven A. (Herndon VA) LeBlanc Johnny J. (Austin TX) Rickard Dale A. (Manassas VA) Spencer Clark J. (Praha CSX) Stanley Daniel L, Shadow register file for instruction rollback.
Gower,Kevin C.; Kellogg,Mark W.; Maule,Warren E.; Smith, III,Thomas B.; Tremaine,Robert B., System, method and storage medium for providing data caching and data compression in a memory subsystem.
Zumkehr, John F.; Abouelnaga, Amir A., Systems and methods for use in reduced instruction set computer processors for retrying execution of instructions resulting in errors.
Beard Douglas R. (Eleva WI) Phelps Andrew E. (Eau Claire WI) Woodmansee Michael A. (Eau Claire WI) Blewett Richard G. (Altoona WI) Lohman Jeffrey A. (Eau Claire WI) Silbey Alexander A. (Eau Claire WI, Vector processor having registers for control by vector resisters.
Kashiyama Masamori (Hadano JPX) Ishii Koichi (Hadano JPX) Kawabe Shun (Machida JPX) Usami Masami (Ome JPX), Vector processor performing data operations in one half of a total time period of write operation and the read operation.
Fossum Tryggve (Northboro MA) Manley Dwight P. (Holliston MA) McKeen Francis X. (Westboro MA) Tehranian Michael M. (Boxboro MA), Vector register system for executing plural read/write commands concurrently and independently routing data to plural re.
Oberlin Steven M. ; Fromm Eric C. ; Passint Randal S., Virtual to logical to physical address translation for distributed memory massively parallel processing systems.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.