IPC분류정보
국가/구분 |
United States(US) Patent
등록
|
국제특허분류(IPC7판) |
|
출원번호 |
UP-0871245
(2004-06-18)
|
등록번호 |
US-7756919
(2010-08-02)
|
발명자
/ 주소 |
- Dean, Jeffrey
- Ghemawat, Sanjay
|
출원인 / 주소 |
|
대리인 / 주소 |
Morgan, Lewis & Bockius LLP
|
인용정보 |
피인용 횟수 :
32 인용 특허 :
4 |
초록
▼
A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parall
A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data values. One or more application-independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data.
대표청구항
▼
What is claimed is: 1. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising: a plurality of processes executing on a plurality of interconnected processors; the plurality of processes includi
What is claimed is: 1. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising: a plurality of processes executing on a plurality of interconnected processors; the plurality of processes including a supervisory process for coordinating a data processing job for processing a set of input data, and a plurality of map processes and a plurality of reduce processes; wherein the supervisory process is for assigning input data blocks of the set of input data to respective map processes of the plurality of map processes; wherein each of the plurality of map processes includes an application-independent map module for retrieving an input data block assigned thereto by the supervisory process, reading portions of the input data block, and applying an application-specific map operation to the input data block to produce intermediate key-value pairs, wherein at least two of the plurality of map processes operate simultaneously so as to perform the map operation in parallel on multiple input data blocks; a plurality of intermediate data structures, the intermediate data structures adapted for storing the intermediate key-value pairs; and wherein each of the plurality of reduce processes includes an application-independent reduce modules for retrieving a respective subset of the intermediate key-value pairs from a subset of the intermediate data structures and applying an application-specific reduce operation to the respective subset of intermediate key-value pairs, including combining respective intermediate values sharing the same key, to provide output, wherein at least two of the plurality of reduce processes operate simultaneously so as to perform the reduce operation in parallel on multiple respective subsets of the intermediate key-value pairs. 2. The system of claim 1, wherein the supervisory process is configured to automatically determine a number of distinct map tasks and a number of distinct reduce tasks to perform the data processing job, and to automatically assign the map tasks and reduce tasks to the map processes and reduce processes in accordance with availability of the processes executing on the plurality of interconnected processors. 3. The system of claim 2, wherein the number of map tasks exceeds in number the plurality of processes to which the supervisory process can assign map tasks, and wherein the supervisory process maintains status information with respect to map tasks awaiting assignment to a process. 4. The system of claim 2, wherein the system includes a file system for storing files and making files available to the processes executing on the plurality of interconnected processors, and wherein the intermediate data structures are files stored in the file system: 5. A method of performing a large-scale data processing job, comprising, in a distributed and parallel processing environment including a set of interconnected computer systems. executing a plurality of processes on a plurality of interconnected processors, the plurality of processes including a supervisory process for coordinating a data processing job for processing a set of input data, and a plurality of map processes and a plurality of reduce processes; in the supervisory process, assigning input data blocks of a set of input data to respective map processes of the plurality of map processes; in each of the plurality of map processes, executing an application-independent map module to retrieve an input data block assigned thereto by the supervisory process, to read portions of the input data block and to apply an application-specific map operation to the input data block to produce intermediate key-value pairs; including operating at least two of the plurality of the map processes simultaneously so as to perform the map operation in parallel on multiple input data blocks; storing the intermediate key-value pairs in a plurality of intermediate data structures; and in each of the plurality of reduce processes, executing an application-independent reduce module to retrieve a respective subset of the intermediate key-value pairs from a subset of the intermediate data structures and to apply an application-specific reduce operation to the respective subset of intermediate key-value pairs, including combining respective intermediate values sharing the same key, to provide output data; including operating at least two of the plurality of reduce processes simultaneously so as to perform the reduce operation in parallel on multiple respective subsets of the intermediate key-value pairs. 6. The method of claim 5, including, in the supervisory process, automatically determining a number of distinct map tasks and a number of distinct reduce tasks to perform the data processing job, and automatically assigning the map tasks and reduce tasks to the map processes and reduce processes in accordance with availability of the processes executing on the plurality of interconnected processors. 7. The method of claim 6, wherein the number of map tasks exceeds in number the plurality of processes to which the supervisory process can assign map tasks, and wherein the supervisory process maintains status information with respect to map tasks awaiting assignment to a process. 8. The method of claim 5, including storing the intermediate data structures as files in a file system, and in the plurality of reduce processes, retrieving respective ones of the files so as to retrieve the respective subsets of the intermediate key-value pairs. 9. A system for large-scale processing of data in a distributed and parallel processing environment comprising: a set of interconnected computing systems, each having memory and one or more processors; one or more application-independent map modules, stored in the memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for reading input data and applying at least one application-specific map operation to the input data to produce intermediate key-value pairs, wherein the one or more application-independent map modules automatically parallelize the map operation across multiple processors in the parallel processing environment; a plurality of intermediate data structures, the intermediate data structures adapted for storing the intermediate key-value pairs; and one or more application-independent reduce modules, stored in the memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for retrieving the intermediate key-value pairs from the intermediate data structures and applying at least one application-specific reduce operation to the intermediate key-value pairs, including combining respective intermediate values sharing the same key to provide output data. 10. The system of claim 9, wherein the one or more application-independent reduce modules automatically parallelize the reduce operation across multiple processors in the parallel processing environment. 11. The system of claim 9, wherein the reduce operation includes sorting the intermediate key-value pairs and combining values having the same key. 12. The system of claim 9, wherein the intermediate data structure includes one or more intermediate data files coupled to each map module for storing intermediate data values. 13. The system of claim 9, wherein the output data is written to a file system accessible via a distributed network. 14. The system of claim 9, further comprising: a work queue master module coupled to at least one of the map and reduce modules and configured to assign at least one of the map and reduce operations to an idle process coupled to a distributed network. 15. The system of claim 9, wherein at least one application-specific map operation includes an application-specific combiner operation for combining initial data values produced by at least one application-specific map operation so as to produce the intermediate key-values. 16. The system of claim 9, wherein at least one application-specific map operation includes an application-specific combiner operation for combining initial data values produced by at least one application-specific map operation having shared key values so as to produce the intermediate key-value pairs. 17. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising: a set of interconnected computing systems, each having memory and one or more processors; a set of application independent map modules, stored in the memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for reading portions of input files containing data, and for applying at least one user-specified, application-specific map operation to the data to produce intermediate key-value pairs; a set of intermediate data structures distributed throughout the distributed and parallel processing environment for storing the intermediate key-value pairs; and a set of application independent reduce modules, stored inthe memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for applying at least one user-specified, application-specific reduce operation to the intermediate key-value pairs so as to combine intermediate values sharing the same key. 18. The system of claim 17, wherein at least one of the map and reduce operations is automatically parallelized across multiple processors in the parallel processing environment. 19. The system of claim 17, wherein the key and the combined intermediate values are written to a file system on a distributed network. 20. The system of claim 17, wherein the map operation includes an application-specific combiner operation for combining initial data values produced by the map operation so as to produce the intermediate key-value pairs. 21. The system of claim 17, wherein the map operation includes an application-specific combiner operation for combining initial data values produced by the map operation having shared key values so as to produce the intermediate key-value pairs. 22. A large-scale data processing method in a distributed and parallel processing environment including a set of interconnected computing systems, comprising: in one or more application-independent map modules, reading input data and applying at least one application-specific map operation to the input data to produce intermediate key-value pairs, wherein the one or more application independent map modules automatically parallelize the map operation across multiple processors in the parallel processing environment; storing the intermediate key-value pairs produced by the map operation in a plurality of intermediate data structures adapted for storing the intermediate key-value pairs; and in one or more application-indpenedent reduce modules, retrieving the intermediate key-value pairs from the intermediate data structures and applying at least one application-specific reduce operation to the intermediate key-value pairs, including combining respective intermediate values sharing the same key to provide output data. 23. The method of claim 22, further comprising: automatically parallelizing the reduce operation across multiple processors in the parallel processing environment using an application independent methodology; and storing the output data produced by the reduce operation. 24. The method of claim 23, wherein the input data includes key-value pairs. 25. The method of claim 24, wherein the reduce operation includes sorting the intermediate key-value pairs. 26. The method of claim 22, wherein the map operation includes an application-specific combiner operation for combining initial data values produced by the map operation so as to produce the intermediate key-value pairs. 27. The method of claim 22, wherein the map operation includes an application-specific combiner operation for combining initial data values produced by the map operation having shared key values so as to produce the intermediate key-value pairs. 28. A computer-readable storage medium having stored thereon instructions, which, when executed by a plurality of processors, cause the plurality of processors to perform the operations of: in one or more application-independent map modules, reading input data and applying at least one application-specific map operation to the input data to produce intermediate key-value pairs, wherein the one or more application-independent map modules automatically parallelize the map operation across multiple processors in a parallel processing environment; storing the intermediate key-value pairs produced by the map operation in a plurality of intermediate data structures adapted for storing the intermediate key-value pairs; and 6p1 in one or more application-independent reduce modules, retrieving the intermediate key-value pairs from the intermediate data structures and applying at least one application-specific reduce operation to the intermediate key-value pairs, including combining respective intermediate values sharing the same key to provide output data. 29. The computer-readable storage medium of claim 28, wherein the instructions additionally include instructions, to perform the operations of: automatically parallelizing the reduce operation across multiple processors in the parallel processing environment using an application independent methodology; and storing the output data produced by the reduce operation. 30. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising: means for applying at least one application-specific map operation to input data to produce intermediate key-value pairs; means for automatically parallelizing the map operation across multiple processors in the parallel processing environment using an application independent methodology; means for storing the intermediate key-value pairs produced by the map operation; means for retrieving a respective subset of the intermediate data values key-value pairs and applying at least one application-specific reduce operation to the intermediate key-value pairs including combining respective intermediate values sharing the same key; means for automatically parallelizing the reduce operation across multiple processors in the parallel processing environment using an application independent methodology; and means for storing output data values produced by the reduce operation.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.