[특허]System and method for large-scale data processing using an application-independent framework

System and method for large-scale data processing using an application-independent framework 원문보기

IPC분류정보
국가/구분	United States(US) Patent 등록
국제특허분류(IPC7판)	G06F-017/30 G06F-009/54 G06F-009/48
출원번호	US-0099806 (2013-12-06)
등록번호	US-9612883 (2017-04-04)
발명자 / 주소	Dean, Jeffrey Ghemawat, Sanjay
출원인 / 주소	Google Inc.
대리인 / 주소	Morgan, Lewis & Bockius LLP
인용정보	피인용 횟수 : 0 인용 특허 : 27

초록 ▼

A large-scale data processing system and method for processing data in a distributed and parallel processing environment is disclosed. The system comprises a set of interconnected computing systems, each having one or more processors and memory. The set of interconnected computing systems include: a set of application-independent map modules for reading portions of input files containing data, and for producing intermediate data values by applying at least one user-specified, application-specific map operation to the data; a set of intermediate data structures distributed among a plurality of the interconnected computing systems for storing the intermediate data values; and a set of application-independent reduce modules, distinct from the plurality of application-independent map modules, for producing final output data by applying at least one user-specified, application-specific reduce operation to the intermediate data values.

대표청구항 ▼

1. A system for large-scale processing of data in a distributed and parallel processing environment, comprising: a set of interconnected computing systems, each having one or more processors and memory, the set of interconnected computing systems including: a plurality of worker processes executing on the set of interconnected computing systems;an application-independent supervisory process executing on the set of interconnected computing systems, for: determining, for input files, a plurality of data processing tasks including a plurality of map tasks specifying data from the input files to be processed into intermediate data values and a plurality of reduce tasks specifying intermediate data values to be processed into final output data; andassigning the data processing tasks to idle ones of the worker processes;a set of application-independent map functions, executed by a first subset of the plurality of worker processes, for reading portions of the input files containing data, and for producing intermediate data values by applying at least one user-specified, application-specific map operation to the data, wherein the set of application-independent map functions are independent of the at least one user-specified, application-specific map operation;a set of intermediate data structures distributed among a plurality of the interconnected computing systems for storing the intermediate data values; anda set of application-independent reduce functions, distinct from the set of application-independent map functions, the set of application-independent reduce functions executed by a second subset of the plurality of worker processes for producing the final output data by applying at least one user-specified, application-specific reduce operation to the intermediate data values, wherein the set of application-independent reduce functions are independent of the at least one user-specified, application-specific reduce operation. 2. The system of claim 1, wherein at least one of the map and reduce operations is automatically parallelized across multiple processors in the distributed and parallel processing environment using an application-independent methodology. 3. The system of claim 1, wherein the set of interconnected computing systems applies a partition operation to at least a subset of the intermediate data values, and for each respective intermediate data value in the at least a subset of the intermediate data values, the partition operation specifies a respective intermediate data structure of the set of intermediate data structures in which to store the respective intermediate data value. 4. The system of claim 1, wherein a respective application-specific map operation includes an application-specific combiner operation for combining initial values produced by the respective application-specific map operation so as to produce the intermediate data values. 5. The system of claim 1, wherein: the number of map tasks exceeds in number the plurality of processes to which the supervisory process can assign map tasks; andthe supervisory process maintains status information with respect to map tasks awaiting assignment to a worker process. 6. The system of claim 1, wherein: the set of interconnected computer systems are grouped into a plurality of datacenters;when assigning the data processing tasks to idle ones of the worker processes, the supervisory process preferentially assigns data processing tasks for data stored on computer systems in a respective datacenter to worker processes that are running on computer systems in the respective datacenter. 7. The system of claim 1, wherein the map and reduce operations are implemented on different processors coupled to a distributed network. 8. The system of claim 7, wherein the final output data is written to a file system on the distributed network. 9. A method of performing large-scale processing of data in a distributed and parallel processing environment, comprising: at a set of interconnected computing systems, each having one or more processors and memory: executing a plurality of worker processes;executing an application-independent supervisory process on the set of interconnected computing systems, for: determining, for input files, a plurality of data processing tasks including a plurality of map tasks specifying data from the input files to be processed into intermediate data values and a plurality of reduce tasks specifying intermediate data values to be processed into final output data; andassigning the data processing tasks to idle ones of the worker processes: using a set of application-independent map functions, executed by a first subset of the plurality of worker processes, to read portions of the input files containing data and produce intermediate data values by applying at least one user-specified, application-specific map operation to the data;storing the intermediate data values in a set of intermediate data structures distributed among a plurality of the interconnected computing systems; andusing a set of application-independent reduce functions, distinct from the set of application-independent map functions, to produce the final output data by applying at least one user-specified, application-specific reduce operation to the intermediate data values, wherein the set of application-independent reduce functions are executed by a second subset of the plurality of worker processes;wherein the set of application-independent map functions and the set of application-independent reduce functions are independent of application-specific operators and operations including the at least one user-specified, application-specific map operation and the at least one user-specified, application-specific reduce operation. 10. The method of claim 9, including applying a partition operation to at least a subset of the intermediate data values, wherein for each respective intermediate data value in the at least a subset of the intermediate data values, the partition operation specifies a respective intermediate data structure of the set of intermediate data structures in which to store the respective intermediate data value. 11. The method of claim 9, wherein a respective application-specific map operation includes an application-specific combiner operation for combining initial values produced by the respective application-specific map operation so as to produce the intermediate data values. 12. The method of claim 9, wherein: the number of map tasks exceeds in number the plurality of processes to which the supervisory process can assign map tasks; andthe supervisory process maintains status information with respect to map tasks awaiting assignment to a worker process. 13. A non-transitory computer readable storage medium storing one or more programs configured for execution by a plurality processors of a set of interconnected computing systems, the one or more programs comprising instructions to be executed by the plurality of processors so as to: execute a plurality of worker processes on the set of interconnected computing systems;execute an application-independent supervisory process on the set of interconnected computing systems, for: determining, for input files, a plurality of data processing tasks including a plurality of map tasks specifying data from the input files to be processed into intermediate data values and a plurality of reduce tasks specifying intermediate data values to be processed into final output data; andassigning the data processing tasks to idle ones of the worker processes;use a set of application-independent map functions, executed by a first subset of the plurality of worker processes, to read portions of the input files containing data and produce intermediate data values by applying at least one user-specified, application-specific map operation to the data;store the intermediate data values in a set of intermediate data structures distributed among a plurality of the interconnected computing systems; anduse a set of application-independent reduce functions, distinct from the set of application-independent map functions, to produce the final output data by applying at least one user-specified, application-specific reduce operation to the intermediate data values, wherein the set of application-independent reduce functions are executed by a second subset of the plurality of worker processes;wherein the set of application-independent map functions and the set of application-independent reduce functions are independent of application-specific operators and operations, including the at least one user-specified, application-specific map operation and the at least one user-specified, application-specific reduce operation. 14. The non-transitory computer readable storage medium of claim 13, wherein one or more programs further comprise instructions to be executed by the plurality of processors so as to apply a partition operation to at least a subset of the intermediate data values, wherein for each respective intermediate data value in the at least a subset of the intermediate data values, the partition operation specifies a respective intermediate data structure of the set of intermediate data structures in which to store the respective intermediate data value. 15. The non-transitory computer readable storage medium of claim 13, wherein a respective application-specific map operation includes an application-specific combiner operation for combining initial values produced by the respective application-specific map operation so as to produce the intermediate data values. 16. The non-transitory computer readable storage medium of claim 13, wherein: the number of map tasks exceeds in number the plurality of processes to which the supervisory process can assign map tasks; andthe supervisory process maintains status information with respect to map tasks awaiting assignment to a worker process.

이 특허에 인용된 특허 (27)

McMillen Robert J. ; Watson M. Cameron ; Chura David J., Computer system using a master processor to automatically reconfigure faulty switch node that is detected and reported.
상세보기
Hardwick Jonathan C.,GBX, Dynamic load balancing among processors in a parallel computer.
상세보기
Shimon Muller ; Denton E. Gentry, Jr. ; John E. Watkins ; Linda T. Cheng, High performance network interface.
상세보기
Liu, Huan, Infrastructure for parallel programming of clusters of machines.
상세보기
Dean, Jeffrey; Ghemawat, Sanjay, Large-scale data processing in a distributed and parallel processing enviornment.
상세보기
Matsushita Masayuki,JPX ; Ugajin Atsushi,JPX, Management system and method for parallel computer system.
상세보기
Dageville,Benoit; Amor,Patrick A., Managing parallel execution of work granules according to their affinity.
상세보기
Waddington William H. ; Cohen Jeffrey I., Method and apparatus for parallel processing aggregates using intermediate aggregate values.
상세보기
Eichstaedt Matthias ; Lu Qi ; Teng Shang-Hua, Method and apparatus for parallel profile matching in a large scale webcasting system.
상세보기
Tsuchida Masashi,JPX ; Masai Kazuo,JPX ; Torii Shunichi,JPX, Method and system of database divisional management for parallel database system.
상세보기
Matsuzawa Hirofumi,JPX ; Fukuda Takeshi,JPX, Method for executing aggregate queries, and computer system.
상세보기
Waddington William H. ; Tan Leng Leng ; Grewell Patricia, Method for managing shared resources in a multiprocessing computer system.
상세보기
Waddington William H. ; Tan Leng Leng ; Grewell Patricia, Method for managing termination of a lock-holding process using a waiting lock.
상세보기
Ekanadham Kattamuri ; Moreira Jose Eduardo ; Naik Vijay Krishnarao, Method for resource control in parallel environments using program organization and run-time support.
상세보기
van Driel,Marinus A., Method for the automatic generation of an interactive electronic equipment documentation package.
상세보기
Allen,Terry Dennis; Desai,Paramesh S.; Shibamiya,Akira; Tie,Hong Sang; Tsang,Annie S., Method, system, and program for optimizing database query execution.
상세보기
Chan Lee ; Richard A. Weier ; Robert F. Krick, Multi-tag system and method for cache read/write.
상세보기
Sudzilouski, Uladzislau; Zaika, Igor, Multi-threaded processes for opening and saving documents.
상세보기
Douglas P. Brown ; Allen N. Diaz ; Donald R. Pederson, Multi-threading, multi-tasking architecture for a relational database management system.
상세보기
Hardwick Jonathan C.,GBX, Nested parallel 2D Delaunay triangulation method.
상세보기
Gulko,Abraham; Mellor,David, Parallel computing system, method and architecture.
상세보기
Ogi Yoshifumi,JPX, Parallel processor apparatus in which data is divided in sequential stages.
상세보기
Hirooka, Takashi; Ohta, Hiroshi; Iitsuka, Takayoshi; Kikuchi, Sumio, Parallel program generating method.
상세보기
Bookman,Lawrence A.; Blair,David Albert; Rosenthal,Steven M.; Krawitz,Robert Louis; Beckerle,Michael J.; Callen,Jerry Lee; Razdow,Allen M.; Mudambi,Shyam R., Segmentation and processing of continuous data streams using transactional semantics.
상세보기
Dean, Jeffrey; Ghemawat, Sanjay, System and method for efficient large-scale data processing.
상세보기
Dean, Jeffrey; Ghemawat, Sanjay, System and method for large-scale data processing using an application-independent framework.
상세보기
Malewicz, Grzegorz; Dvorsky, Marian; Colohan, Christopher B.; Thomson, Derek P.; Levenberg, Joshua Louis, System and method for limiting the impact of stragglers in large-scale parallel data processing.
상세보기

IPC	Description
A	생활필수품
A62	인명구조; 소방(사다리 E06C)
A62B	인명구조용의 기구, 장치 또는 방법(특히 의료용에 사용되는 밸브 A61M 39/00; 특히 물에서 쓰이는 인명구조 장치 또는 방법 B63C 9/00; 잠수장비 B63C 11/00; 특히 항공기에 쓰는 것, 예. 낙하산, 투출좌석 B64D; 특히 광산에서 쓰이는 구조장치 E21F 11/00)
A62B-1/08	.. 윈치 또는 풀리에 제동기구가 있는 것

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표IPC 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 공고번호, 공고일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표출원인, 출원인국적, 출원인주소, 발명자, 발명자E, 발명자코드, 발명자주소, 발명자 우편번호, 발명자국적, 대표IPC, IPC코드, 요약, 미국특허분류, 대리인주소, 대리인코드, 대리인(한글), 대리인(영문), 국제공개일자, 국제공개번호, 국제출원일자, 국제출원번호, 우선권, 우선권주장일, 우선권국가, 우선권출원번호, 원출원일자, 원출원번호, 지정국, Citing Patents, Cited Patents
저장형식	Text(ASCII format) Excel format PIAS분석(.xls)
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

System and method for large-scale data processing using an application-independent framework 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

이 특허에 인용된 특허 (27)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

System and method for large-scale data processing using an application-independent framework 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

이 특허에 인용된 특허 (27)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트