Scheduler of program instructions for streaming vector processor having interconnected functional units
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-009/50
G06F-009/46
G06F-009/44
출원번호
US-0184772
(2002-06-28)
발명자
/ 주소
May,Philip E.
Moat,Kent Donald
Essick, IV,Raymond B.
Chiricescu,Silviu
Lucas,Brian Geoffrey
Norris,James M.
Schuette,Michael Allen
Saidi,Ali
출원인 / 주소
Motorola, inc.
인용정보
피인용 횟수 :
9인용 특허 :
61
초록▼
A method for scheduling a computation for execution on a computer with a number of interconnected functional units. The computation is representable by a data-flow graph with a number of nodes connected by edge. A loop-period of the computation is calculated and the nodes are scheduled for throughpu
A method for scheduling a computation for execution on a computer with a number of interconnected functional units. The computation is representable by a data-flow graph with a number of nodes connected by edge. A loop-period of the computation is calculated and the nodes are scheduled for throughput by assigning an execution cycle and a functional unit to each node of the data-flow graph. The scheduling of flexible nodes is adjusted to minimize the number of interconnections required in each execution cycle. The edges of the data-flow graph are allocated to one or more of the interconnections between functional units. The scheduling method may be used, for example, to optimize the interconnection fabric design for an ASIC or as part of a compiler for a re-configurable streaming vector processor.
대표청구항▼
What is claimed is: 1. A method for scheduling a computation for execution on a computer comprising a plurality of functional units interconnected by a plurality of interconnections, the computation being representable by a data-flow graph having a plurality of nodes and a plurality of edges and th
What is claimed is: 1. A method for scheduling a computation for execution on a computer comprising a plurality of functional units interconnected by a plurality of interconnections, the computation being representable by a data-flow graph having a plurality of nodes and a plurality of edges and the method comprising: (a) computing a loop-period of the computation; (b) attempting to schedule the plurality of nodes within the loop period for throughput by assigning an execution cycle and a functional unit to each node of the plurality of nodes; (c) adjusting the scheduling of flexible nodes of the plurality of nodes to reduce the number of interconnections required in any execution cycle if the number of interconnections required exceeds the number of interconnections in the plurality of interconnections; and (d) allocating the plurality of edges to one or more of the plurality of interconnections. 2. A method in accordance with claim 1, wherein one or more of the functional units is partitioned into two or more slices, the method further comprising: mapping nodes of the data-flow graph onto slices of the one of more of the partitioned functional units so as to reduce the number of interconnections required in an execution cycle. 3. A method in accordance with claim 2, and wherein the mapping nodes of the data-flow graph onto slices of the one of more of the partitioned functional units so as to reduce the number of interconnections required in an execution cycle comprises: computing a set of execution cycles number for which the number of interconnections required is greater than the number of interconnections in the plurality of interconnections; computing tail-times for each node that is the source of an edge that intersects the set of execution cycles; and mapping nodes onto slices of the one of more of the partitioned functional units so as to reduce the number of interconnections required in a cycle. 4. A method in accordance with claim 2, further comprising: computing the set of execution cycles for which the number of interconnections required is greater than the number of interconnections in the plurality of interconnections; computing lead-times for each node that is the destination of an edge that intersects the set of execution cycles allocated to a cycle of the first set of execution cycles; and mapping nodes onto the slices of the one of more of the partitioned functional units so as to reduce the number of interconnections required in a cycle. 5. A method in accordance with claim 2, wherein slugs are used to discard results from unused slices of the one of more of the partitioned functional units. 6. A method in accordance with claim 1, wherein the edges of the plurality of edges are allocated so that values are stored at one or more of: an input of a functional unit; an output of a functional unit; a storage entry in the interconnection fabric; and a trampoline node. 7. A method in accordance with claim 1, wherein the plurality of interconnections comprises a re-configurable interconnect fabric having a plurality of links and wherein the edges of the plurality of edges are allocated so that values live at one or more of: an output of a functional unit; a storage entry in the interconnection fabric; a trampoline node; and an input of a functional unit. 8. A method in accordance with claim 1, wherein the scheduling the plurality of nodes for throughput by assigning an execution cycle and a functional unit to each node of the plurality of nodes comprises: (b1) attempting to schedule the plurality of node within the loop-period; and (b2) while the attempt to schedule the plurality of node within the loop-period is unsuccessful, increasing the loop-period and repeating from (b1). 9. A method in accordance with claim 1, wherein the allocating the plurality of edges to one or more of the plurality of interconnections comprises (d1) attempting to allocate the plurality of edges to one or more of the plurality of interconnections; and (d2) if the attempt to allocate the plurality of edges is unsuccessful, increasing the loop-period and repeating from (d1). 10. A method in accordance with claim 1, wherein the plurality of interconnections comprises a re-configurable interconnect fabric having a plurality of links and wherein the allocation of an edge of the plurality of edges is ordered as: the input or output of the functional unit to which the node is assigned; a storage entry in the interconnection fabric; and the input or output of a free functional unit. 11. A method in accordance with claim 1, further comprising splitting the data-flow graph into a number of partitions, corresponding to the number of iterations that are executed in parallel when a steady state operation of the computer has been achieved. 12. A method in accordance with claim 1, further comprising overlapping the schedules for two or more adjacent iterations to obtain a higher throughput. 13. A method in accordance with claim 1, wherein consecutive iterations are scheduled to use different functional unit instances. 14. A method in accordance with claim 1, wherein two schedules are computed, one for maximum throughput and one for minimum latency, and wherein a schedule of the two schedules is selected in accordance with the number of iterations to be performed. 15. A method in accordance with claim 1, wherein the resulting schedule is represented as one of a set of very long instruction words and microcode instructions. 16. A method for minimizing the number of interconnections required by a computer to execute a computation, the computer comprising a plurality of functional units interconnected by a plurality of interconnections, the computation being representable by a data-flow graph having a plurality of nodes and a plurality of edges and the method comprising: (a) computing a loop-period of the computation; (b) attempting to schedule the plurality of nodes within the loop period by assigning an execution cycle and a functional unit to each node of the plurality of nodes; (c) adjusting the scheduling the plurality of nodes to minimize the number of interconnections required in any execution cycle; (d) adjusting the scheduling of the plurality of nodes to increase throughput if the throughput is below a predetermined minimum throughput; (e) adjusting the scheduling of the plurality of nodes to decrease latency if the latency exceeds a predetermined maximum latency; and (f) allocating the plurality of edges to one or more of the plurality of interconnections. 17. A computer readable medium containing instructions which, when executed on a first computer, carry out a process of scheduling a computation for execution on a second computer, the second computer having a plurality of functional units interconnected by a plurality of interconnections, and the computation being representable by a data-flow graph having a plurality of nodes and a plurality of edges, the process of scheduling comprising: (a) computing a loop-period of the computation; (b) attempting to schedule the plurality of nodes within the loop period for throughput by assigning an execution cycle and a functional unit to each node of the plurality of nodes; (c) adjusting the scheduling of flexible nodes of the plurality of nodes to reduce the number of interconnections required in each execution cycle if the number of interconnections required is greater than the number of interconnection in the plurality of interconnections; and (d) allocating the plurality of edges to one or more of the plurality of interconnections. 18. A computer readable medium in accordance with claim 17, wherein one or more of the functional units is partitioned into two or more slices, the process further comprising: assigning slices of the one of more of the partitioned functional units so as to reduce the number of interconnections required in an execution cycle. 19. A computer readable medium in accordance with claim 18, wherein the assigning slices of the one of more of the partitioned functional units so as to reduce the number of interconnections required in an execution cycle comprises: computing a set of execution cycles number for which the number of interconnections required is greater than the number of interconnections in the plurality of interconnections; computing tail-times for each node allocated to a cycle of the set of execution cycles; and mapping nodes to slices of the one of more of the partitioned functional units so as to reduce the number of interconnections required in a cycle. 20. A computer readable medium in accordance with claim 18, further comprising: computing the set of execution cycles for which the number of interconnections required is greater than the number of interconnections in the plurality of interconnections; computing lead-times for each node that is the destination of an edge that intersects the set of execution cycles allocated to a cycle of the first set of execution cycles; and mapping nodes to the slices of the one of more of the partitioned functional units so as to reduce the number of interconnections required in a cycle. 21. A computer readable medium in accordance with claim 17, wherein the allocating the plurality of edges to one or more of the plurality of interconnections comprises (d1) attempting to allocate the plurality of edges to one or more of the plurality of interconnections; and (d2) if the attempt to allocate the plurality of edges is unsuccessful, increasing the loop-period and repeating from (d1). 22. A computer readable medium in accordance with claim 17 where the first and second computers are the same computer. 23. An application specific integrated circuit for performing a computation representable by a data-flow graph having a plurality of nodes and a plurality of edges, the application specific integrated circuit having a plurality of functional units interconnected by a plurality of interconnections, wherein the number of interconnections in the plurality of interconnections is determined by: (a) computing a loop-period of the computation; (b) attempting to schedule the plurality of nodes within the loop period for throughput by assigning an execution cycle and a functional unit to each node of the plurality of nodes; (c) adjusting the scheduling of flexible nodes of the plurality of nodes to minimizing the number of interconnections required in each execution cycle; and (d) allocating the plurality of edges to one or more of the plurality of interconnections. 24. An application specific integrated circuit in accordance with claim 23, wherein the allocating the plurality of edges to one or more of the plurality of interconnections comprises (d1) attempting to allocate the plurality of edges to one or more of the plurality of interconnections; and (d2) if the attempt to allocate the plurality of edges is unsuccessful, increasing the loop-period and repeating from (d1).
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (61)
Lane Allen Smith, Address generation utilizing an adder, a non-sequential counter and a latch.
Franssen Frank,BEX ; van Swaaij Michael,BEX ; Nachtergaele Lode,BEX ; Samsom Hans,BEX ; Catthoor Francky,BEX ; De Man Hugo,BEX, Control flow and memory management optimization.
Gallup Michael G. ; Goke L. Rodney ; Seaton ; Jr. Robert W. ; Lawell Terry G. ; Osborn Stephen G. ; Tomazin Thomas J., Data processing system and method thereof.
Scales ; III Hunter Ledbetter ; Diefendorff Keith Everett ; Olsson Brett ; Dubey Pradeep Kumar ; Hochsprung Ronald Ray ; Beavers Bradford Byron ; Burgess Bradley G. ; Snyder Michael Dean ; May Cathy , Data processing system for processing vector data and method therefor.
Kodosky Jeffrey L. ; McKaskle Greg ; Kay Meg Fletcher, Method and apparatus for providing improved type compatibility and data structure organization in a graphical data flow.
Agarwal Ramesh Chandra ; Groves Randall Dean ; Gustavson Fred G. ; Johnson Mark A. ; Lyon Terry L. ; Olsson Brett ; Shearer James B., Method and system in a data processing system for loading and storing vectors in a plurality of modes.
Hunt Peter D. (Pleasanton CA) Elliott Jon K. (Pleasanton CA) Tobias Richard J. (San Jose CA) Herring Alan J. (San Jose CA) Morgan Craig R. (San Jose CA) Hiller John A. (Palo Alto CA), Method for automated deployment of a software program onto a multi-processor architecture.
Nishiyama Hiroyasu,JPX ; Kikuchi Sumio,JPX ; Mori Noriyasu,JPX ; Nishimoto Akira,JPX ; Takeuchi Yooichi,JPX, Method for controlling a processor for power-saving in a computer for executing a program, compiler medium and processo.
Chow Frederick ; Kennedy Robert ; Liu Shin-Ming ; Lo Raymond ; Tu Peng ; Chan Sun C., Method, system, and computer program product for performing register promotion via load and store placement optimization within an optimizing compiler.
Pechanek Gerald G. ; Revilla Juan Guillermo ; Barry Edwin F., Methods and apparatus for dynamic very long instruction word sub-instruction selection for execution time parallelism in an indirect very long instruction word processor.
York Richard,GBX ; Frances Hedley James,GBX ; Symes Dominic,GBX ; Biles Stuart,GBX, Non-instruction base register addressing in a data processing apparatus.
Gupta Rajiv ; Berson David A. ; Fang Jesse Z., Optimizing code by exploiting speculation and predication with a cost-benefit data flow analysis based on path profiling information.
Ku Charlene S. ; Stearns Charles C. ; Tao Olive T., Partitioned decompression of audio data using audio decoder engine for computationally intensive processing.
Betker Michael R. ; Fernando John S. ; Lemmon Frank ; Whalen Shaun P., Pointer register indirectly addressing a second register in the processor core of a digital processor.
Kung Hsiang-Tsung (Pittsburgh PA) Hsu Feng-Hsiung (Pittsburgh PA) Sussman Alan L. (Pittsburgh PA) Nishizawa Teiji (Nara JPX), Programmable interconnection chip for computer system functional modules.
Jones, Michael B.; Draves, Jr., Richard P.; Rosu, Daniela; Rosu, Marcel-Catalin, Providing predictable scheduling of programs using a repeating precomputed schedule.
Michael B. Jones ; Richard P. Draves, Jr. ; Daniela Rosu ; Marcel-Catalin Rosu, Providing predictable scheduling of programs using a repeating precomputed schedule.
Shebanow Michael C. (Austin TX) Alsup Mitchell K. (Dripping Springs TX) Scales Hunter L. (Austin TX) Hoekstra George P. (Austin TX), Randomly accessible memory having time overlapping memory accesses.
Epstein David A. (Ossining NY) Gilley Glenn G. (Chapel Hill NC) McAuliff Kevin P. (Peekskill NY), System and method for efficiently executing directed acyclic graphs.
Dally William J. ; Rixner Scott Whitney ; Grossman Jeffrey P. ; Buehler Christopher James, System and method for performing compound vector operations.
Rehg,James M.; Knobe,Kathleen, System for computing the optimal static schedule using the stored task execution costs with recent schedule execution costs.
Omoda Koichiro (Sagamihara JPX) Torii Shunichi (Musashino JPX) Nagashima Shigeo (Hachioji JPX) Inagami Yasuhiro (Hadano JPX) Nakagawa Takayuki (Hadano JPX), Vector processor for reordering vector data during transfer from main memory to vector registers.
Ashar, Pranav; Raghunathan, Anand; Bhattacharya, Subhrajit; Gupta, Aarti, Verification of scheduling in the presence of loops using uninterpreted symbolic simulation.
Gschwind, Michael K.; Salapura, Valentina, Techniques for identifying instructions for decode-time instruction optimization grouping in view of cache boundaries.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.