IPC분류정보
국가/구분 |
United States(US) Patent
등록
|
국제특허분류(IPC7판) |
|
출원번호 |
US-0686292
(2010-01-12)
|
등록번호 |
US-8612510
(2013-12-17)
|
발명자
/ 주소 |
- Dean, Jeffrey
- Ghemawat, Sanjay
|
출원인 / 주소 |
|
대리인 / 주소 |
Morgan, Lewis & Bockius LLP
|
인용정보 |
피인용 횟수 :
10 인용 특허 :
22 |
초록
▼
A large-scale data processing system and method for processing data in a distributed and parallel processing environment. The system includes an application-independent framework for processing data having a plurality of application-independent map modules and reduce modules. These application-indep
A large-scale data processing system and method for processing data in a distributed and parallel processing environment. The system includes an application-independent framework for processing data having a plurality of application-independent map modules and reduce modules. These application-independent modules use application-independent operators to automatically handle parallelization of computations across the distributed and parallel processing environment when performing user-specified data processing operations. The system also includes a plurality of user-specified, application-specific operators, for use with the application-independent framework to perform a user-specified data processing operation on a user-specified set of input files. The application-specific operators include: a map operator and a reduce operator. The map operator is applied by the application-independent map modules to input data in the user-specified set of input files to produce intermediate data values. The reduce operator is applied by the application-independent reduce modules to process the intermediate data values to produce final output data.
대표청구항
▼
1. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising: a plurality of interconnected processors;an application-independent framework for processing data, including: a plurality of applicatio
1. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising: a plurality of interconnected processors;an application-independent framework for processing data, including: a plurality of application-independent map modules that execute on at least a first subset of the plurality of interconnected processors; and,a plurality of application-independent reduce modules, distinct from the plurality of application-independent map modules, that execute on at least a second subset of the plurality of interconnected processors, wherein the application-independent map modules and application-independent reduce modules use application-independent operators to-automatically handle parallelization of computations across the distributed and parallel processing environment when performing user-specified data processing operations;wherein the plurality of application-independent map modules and the plurality of application-independent reduce modules are independent of a plurality of user-specified, application-specific operators and operations; andthe plurality of user-specified, application-specific operators, for use with the application-independent framework, perform the user-specified data processing operations on a user-specified set of input files, the plurality of user-specified, application-specific operators including: a map operator that produces intermediate data values, wherein the map operator is applied by the application-independent map modules to input data in the user-specified set of input files to produce the intermediate data values; anda reduce operator that produces final output data for at least one of the user-specified data processing operations, wherein the reduce operator is applied by the application-independent reduce modules to process the intermediate data values to produce the final output data for the user-specified data processing operation. 2. The system of claim 1, wherein the application-independent framework for processing data further includes a set of intermediate data structures distributed among a plurality of the interconnected computing systems for storing the intermediate data values. 3. The system of claim 1, wherein: the application-independent framework for processing data further includes a set of intermediate data structures for storing the intermediate data values; andthe plurality of user-specified, application-specific operators further comprise a partition operator, and the partition operator is applied to at least a subset of the intermediate data values to specify, for each respective intermediate data value in the subset of intermediate data values, a respective intermediate data structure of the set of intermediate data structures in which to store the respective intermediate data value. 4. The system of claim 1, wherein: the plurality of user-specified, application-specific operators further comprise a combiner operator; andthe combiner operator is applied by the application-independent map modules in conjunction with the map operator to produce the intermediate data values by combining a plurality of initial data values produced by the application of the map operator to the input data. 5. The system of claim 1, wherein the application independent framework for processing data further includes: a plurality of worker processes, each worker process including one of the application-independent map modules and one of the application-independent reduce modules; andan application-independent supervisory process for: determining, for the user-specified data processing operation, a plurality of data processing tasks including a plurality of map tasks specifying input data to be processed into intermediate data values and a plurality of reduce tasks specifying intermediate data values to be processed into final output data; andassigning the data processing tasks to idle ones of the worker processes. 6. The system of claim 5, wherein: the number of map tasks exceeds in number the plurality of processes to which the supervisory process can assign map tasks; andthe application-independent supervisory process maintains status information with respect to map tasks awaiting assignment to a worker process. 7. The system of claim 5, wherein: the set of interconnected computer systems are grouped into a plurality of datacenters; andwhen assigning the data processing tasks to idle ones of the worker processes, the supervisory process preferentially assigns data processing tasks for data stored on computer systems in a respective datacenter to worker processes that are running on computer systems in the respective datacenter. 8. The system of claim 1, wherein the application-independent operators include application-independent operators that automatically provide fault tolerance when performing the user-specified data processing operations. 9. The system of claim 1, wherein the application-independent operators include application-independent operators that automatically handle I/O scheduling when performing the user-specified data processing operations. 10. A method of performing large-scale processing of data in a distributed and parallel processing environment comprising: receiving, from a user, a plurality of user-specified, application-specific operators, for use with an application-independent framework to perform user-specified data processing operations on a user-specified set of input files, the plurality of user-specified, application-specific operators including a map operator and a reduce operator; andprocessing data on a set of interconnected computer systems in the application-independent framework, each of the computer systems comprising one or more processors and memory, the application independent framework including: a plurality of application-independent map modules; and,a plurality of application-independent reduce modules, distinct from the plurality of application-independent map modules, wherein the application-independent map modules and application-independent reduce modules use application-independent operators to automatically handle parallelization of computations across the distributed and parallel processing environment when performing user-specified data processing operations;wherein the plurality of application-independent map modules and the plurality of application-independent reduce modules are independent of a plurality of user-specified, application-specific operators and operations;the data processing including: producing intermediate data values by using the application-independent map modules to apply the map operator to input data in the user-specified set of input files; andproducing output data for at least one of the user-specified data processing operations by using the application-independent reduce modules to apply the reduce operator to process the intermediate data values. 11. The method of claim 10, wherein processing data on the set of interconnected computer systems in the application-independent framework includes storing the intermediate data values in a set of intermediate data structures distributed among a plurality of the interconnected computing systems. 12. The method of claim 10, wherein processing data on the set of interconnected computer systems in the application-independent framework includes storing the intermediate data values in a set of intermediate data structures; wherein the plurality of user-specified, application-specific operators further include a partition operator; andthe method further includes applying the partition operator to at least a subset of the intermediate data values to specify, for each respective intermediate data value in the subset of intermediate data values, a respective intermediate data structure of the set of intermediate data structures in which to store the respective intermediate data value. 13. The method of claim 10, wherein the plurality of user-specified, application-specific operators further include a combiner operator; and the method further includes applying the combiner operator in conjunction with the map operator to produce the intermediate data values by combining a plurality of initial data values produced by the application of the map operator to the input data. 14. The method of claim 10, wherein the application independent framework for processing data further includes a plurality of worker processes, each worker process including one of the application-independent map modules and one of the application-independent reduce modules, and an application-independent supervisory process; and the data processing further includes: determining, via the application-independent supervisory process, a plurality of data processing tasks including a plurality of map tasks specifying input data to be processed into intermediate data values and a plurality of reduce tasks specifying intermediate data values to be processed into final output data; andassigning, via the application-independent supervisory process, the data processing tasks to idle ones of the worker processes. 15. The method of claim 14, wherein: the number of map tasks exceeds in number the plurality of processes to which the supervisory process can assign map tasks; andthe data processing further includes maintaining, via the application-independent supervisory process, status information with respect to map tasks awaiting assignment to a worker process. 16. The method of claim 14, wherein the set of interconnected computer systems are grouped into a plurality of datacenters; and assigning the data processing tasks to idle ones of the worker processes includes preferentially assigning, via the application-independent supervisory process, data processing tasks for data stored on computer systems in a respective datacenter to worker processes that are running on computer systems in the respective datacenter. 17. The method of claim 10, including automatically providing fault tolerance, via the application-independent operators, when performing the user-specified data processing operations. 18. The method of claim 10, including automatically handling I/O scheduling, via the application-independent operators, when performing the user-specified data processing operations. 19. A non-transitory computer-readable storage medium storing one or more programs configured for execution by a plurality of processors of a set of interconnected computer systems, the one or more programs comprising instructions to be executed by one or more of the plurality of processors so as to: receive, from a user, a plurality of user-specified, application-specific operators, for use with an application-independent framework to perform user-specified data processing operations on a user-specified set of input files, the plurality of user-specified application-specific operators including a map operator and a reduce operator; andprocess data on a set of interconnected computer systems in the application-independent framework, each of the computer systems comprising one or more processors and memory, the application independent framework including: a plurality of application-independent map modules; and,a plurality of application-independent reduce modules, distinct from the plurality of application-independent map modules, wherein the application-independent map modules and application-independent reduce modules use application-independent operators to automatically handle parallelization of computations across the distributed and parallel processing environment when performing user-specified data processing operations;wherein: the plurality of application-independent map modules and the plurality of application-independent reduce modules are independent of a plurality of user-specified, application-specific operators and operations;the application-independent map modules include instructions to be executed by one or more of the plurality of processors so as to produce intermediate data values by applying the map operator to input data in the user-specified set of input files; andthe application-independent reduce modules include instructions to be executed by one or more of the plurality of processors so as to produce output data for at least one of the user-specified data processing operations by applying the reduce operator to process the intermediate data values. 20. The non-transitory computer-readable storage medium of claim 19, wherein processing data on the set of interconnected computer systems in the application-independent framework includes storing the intermediate data values in a set of intermediate data structures distributed among a plurality of the interconnected computing systems. 21. The non-transitory computer-readable storage medium of claim 19, wherein processing data on the set of interconnected computer systems in the application-independent framework includes storing the intermediate data values in a set of intermediate data structures; wherein the plurality of user-specified, application-specific operators further comprise a partition operator; andthe one or more programs further comprise instructions to apply the partition operator to at least a subset of the intermediate data values to specify, for each respective intermediate data value in the subset of intermediate data values, a respective intermediate data structure of the set of intermediate data structures in which to store the respective intermediate data value. 22. The non-transitory computer-readable storage medium of claim 19, wherein the plurality of user-specified, application-specific operators further comprise a combiner operator; and the one or more programs further comprise instructions to apply the combiner operator in conjunction with the map operator to produce the intermediate data values by combining a plurality of initial data values produced by the application of the map operator to the input data. 23. The non-transitory computer-readable storage medium of claim 19, wherein the application independent framework for processing data further includes a plurality of worker processes, each worker process including one of the application-independent map modules and one of the application-independent reduce modules, and an application-independent supervisory process; and the one or more programs further comprise instructions to: determine, via the application-independent supervisory process, a plurality of data processing tasks including a plurality of map tasks specifying input data to be processed into intermediate data values and a plurality of reduce tasks specifying intermediate data values to be processed into final output data; andassign, via the application-independent supervisory process, the data processing tasks to idle ones of the worker processes. 24. The non-transitory computer-readable storage medium of claim 23, wherein: the number of map tasks exceeds in number the plurality of processes to which the supervisory process can assign map tasks; andthe one or more programs further comprise instructions to maintain, via the application-independent supervisory process, status information with respect to map tasks awaiting assignment to a worker process. 25. The non-transitory computer-readable storage medium of claim 23, wherein the set of interconnected computer systems are grouped into a plurality of datacenters; and assigning the data processing tasks to idle ones of the worker processes includes preferentially assigning, via the application-independent supervisory process, data processing tasks for data stored on computer systems in a respective datacenter to worker processes that are running on computer systems in the respective datacenter. 26. The non-transitory computer-readable storage medium of claim 19, the one or more programs further comprise instructions to automatically provide fault tolerance, via the application-independent operators, when performing the user-specified data processing operations. 27. The non-transitory computer-readable storage medium of claim 19, the one or more programs further comprise instructions to automatically handle I/O scheduling, via the application-independent operators, when performing the user-specified data processing operations.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.