[특허]Large-scale data processing in a distributed and parallel processing enviornment

Large-scale data processing in a distributed and parallel processing enviornment 원문보기

IPC분류정보
국가/구분	United States(US) Patent 등록
국제특허분류(IPC7판)	G06F-015/16
출원번호	UP-0871245 (2004-06-18)
등록번호	US-7756919 (2010-08-02)
발명자 / 주소	Dean, Jeffrey Ghemawat, Sanjay
출원인 / 주소	Google Inc.
대리인 / 주소	Morgan, Lewis & Bockius LLP
인용정보	피인용 횟수 : 32 인용 특허 : 4

초록 ▼

A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data values. One or more application-independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data.

대표청구항 ▼

What is claimed is: 1. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising: a plurality of processes executing on a plurality of interconnected processors; the plurality of processes including a supervisory process for coordinating a data processing job for processing a set of input data, and a plurality of map processes and a plurality of reduce processes; wherein the supervisory process is for assigning input data blocks of the set of input data to respective map processes of the plurality of map processes; wherein each of the plurality of map processes includes an application-independent map module for retrieving an input data block assigned thereto by the supervisory process, reading portions of the input data block, and applying an application-specific map operation to the input data block to produce intermediate key-value pairs, wherein at least two of the plurality of map processes operate simultaneously so as to perform the map operation in parallel on multiple input data blocks; a plurality of intermediate data structures, the intermediate data structures adapted for storing the intermediate key-value pairs; and wherein each of the plurality of reduce processes includes an application-independent reduce modules for retrieving a respective subset of the intermediate key-value pairs from a subset of the intermediate data structures and applying an application-specific reduce operation to the respective subset of intermediate key-value pairs, including combining respective intermediate values sharing the same key, to provide output, wherein at least two of the plurality of reduce processes operate simultaneously so as to perform the reduce operation in parallel on multiple respective subsets of the intermediate key-value pairs. 2. The system of claim 1, wherein the supervisory process is configured to automatically determine a number of distinct map tasks and a number of distinct reduce tasks to perform the data processing job, and to automatically assign the map tasks and reduce tasks to the map processes and reduce processes in accordance with availability of the processes executing on the plurality of interconnected processors. 3. The system of claim 2, wherein the number of map tasks exceeds in number the plurality of processes to which the supervisory process can assign map tasks, and wherein the supervisory process maintains status information with respect to map tasks awaiting assignment to a process. 4. The system of claim 2, wherein the system includes a file system for storing files and making files available to the processes executing on the plurality of interconnected processors, and wherein the intermediate data structures are files stored in the file system: 5. A method of performing a large-scale data processing job, comprising, in a distributed and parallel processing environment including a set of interconnected computer systems. executing a plurality of processes on a plurality of interconnected processors, the plurality of processes including a supervisory process for coordinating a data processing job for processing a set of input data, and a plurality of map processes and a plurality of reduce processes; in the supervisory process, assigning input data blocks of a set of input data to respective map processes of the plurality of map processes; in each of the plurality of map processes, executing an application-independent map module to retrieve an input data block assigned thereto by the supervisory process, to read portions of the input data block and to apply an application-specific map operation to the input data block to produce intermediate key-value pairs; including operating at least two of the plurality of the map processes simultaneously so as to perform the map operation in parallel on multiple input data blocks; storing the intermediate key-value pairs in a plurality of intermediate data structures; and in each of the plurality of reduce processes, executing an application-independent reduce module to retrieve a respective subset of the intermediate key-value pairs from a subset of the intermediate data structures and to apply an application-specific reduce operation to the respective subset of intermediate key-value pairs, including combining respective intermediate values sharing the same key, to provide output data; including operating at least two of the plurality of reduce processes simultaneously so as to perform the reduce operation in parallel on multiple respective subsets of the intermediate key-value pairs. 6. The method of claim 5, including, in the supervisory process, automatically determining a number of distinct map tasks and a number of distinct reduce tasks to perform the data processing job, and automatically assigning the map tasks and reduce tasks to the map processes and reduce processes in accordance with availability of the processes executing on the plurality of interconnected processors. 7. The method of claim 6, wherein the number of map tasks exceeds in number the plurality of processes to which the supervisory process can assign map tasks, and wherein the supervisory process maintains status information with respect to map tasks awaiting assignment to a process. 8. The method of claim 5, including storing the intermediate data structures as files in a file system, and in the plurality of reduce processes, retrieving respective ones of the files so as to retrieve the respective subsets of the intermediate key-value pairs. 9. A system for large-scale processing of data in a distributed and parallel processing environment comprising: a set of interconnected computing systems, each having memory and one or more processors; one or more application-independent map modules, stored in the memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for reading input data and applying at least one application-specific map operation to the input data to produce intermediate key-value pairs, wherein the one or more application-independent map modules automatically parallelize the map operation across multiple processors in the parallel processing environment; a plurality of intermediate data structures, the intermediate data structures adapted for storing the intermediate key-value pairs; and one or more application-independent reduce modules, stored in the memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for retrieving the intermediate key-value pairs from the intermediate data structures and applying at least one application-specific reduce operation to the intermediate key-value pairs, including combining respective intermediate values sharing the same key to provide output data. 10. The system of claim 9, wherein the one or more application-independent reduce modules automatically parallelize the reduce operation across multiple processors in the parallel processing environment. 11. The system of claim 9, wherein the reduce operation includes sorting the intermediate key-value pairs and combining values having the same key. 12. The system of claim 9, wherein the intermediate data structure includes one or more intermediate data files coupled to each map module for storing intermediate data values. 13. The system of claim 9, wherein the output data is written to a file system accessible via a distributed network. 14. The system of claim 9, further comprising: a work queue master module coupled to at least one of the map and reduce modules and configured to assign at least one of the map and reduce operations to an idle process coupled to a distributed network. 15. The system of claim 9, wherein at least one application-specific map operation includes an application-specific combiner operation for combining initial data values produced by at least one application-specific map operation so as to produce the intermediate key-values. 16. The system of claim 9, wherein at least one application-specific map operation includes an application-specific combiner operation for combining initial data values produced by at least one application-specific map operation having shared key values so as to produce the intermediate key-value pairs. 17. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising: a set of interconnected computing systems, each having memory and one or more processors; a set of application independent map modules, stored in the memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for reading portions of input files containing data, and for applying at least one user-specified, application-specific map operation to the data to produce intermediate key-value pairs; a set of intermediate data structures distributed throughout the distributed and parallel processing environment for storing the intermediate key-value pairs; and a set of application independent reduce modules, stored inthe memory of one or more of the computing systems for execution by one or more processors of one or more of the computing systems, for applying at least one user-specified, application-specific reduce operation to the intermediate key-value pairs so as to combine intermediate values sharing the same key. 18. The system of claim 17, wherein at least one of the map and reduce operations is automatically parallelized across multiple processors in the parallel processing environment. 19. The system of claim 17, wherein the key and the combined intermediate values are written to a file system on a distributed network. 20. The system of claim 17, wherein the map operation includes an application-specific combiner operation for combining initial data values produced by the map operation so as to produce the intermediate key-value pairs. 21. The system of claim 17, wherein the map operation includes an application-specific combiner operation for combining initial data values produced by the map operation having shared key values so as to produce the intermediate key-value pairs. 22. A large-scale data processing method in a distributed and parallel processing environment including a set of interconnected computing systems, comprising: in one or more application-independent map modules, reading input data and applying at least one application-specific map operation to the input data to produce intermediate key-value pairs, wherein the one or more application independent map modules automatically parallelize the map operation across multiple processors in the parallel processing environment; storing the intermediate key-value pairs produced by the map operation in a plurality of intermediate data structures adapted for storing the intermediate key-value pairs; and in one or more application-indpenedent reduce modules, retrieving the intermediate key-value pairs from the intermediate data structures and applying at least one application-specific reduce operation to the intermediate key-value pairs, including combining respective intermediate values sharing the same key to provide output data. 23. The method of claim 22, further comprising: automatically parallelizing the reduce operation across multiple processors in the parallel processing environment using an application independent methodology; and storing the output data produced by the reduce operation. 24. The method of claim 23, wherein the input data includes key-value pairs. 25. The method of claim 24, wherein the reduce operation includes sorting the intermediate key-value pairs. 26. The method of claim 22, wherein the map operation includes an application-specific combiner operation for combining initial data values produced by the map operation so as to produce the intermediate key-value pairs. 27. The method of claim 22, wherein the map operation includes an application-specific combiner operation for combining initial data values produced by the map operation having shared key values so as to produce the intermediate key-value pairs. 28. A computer-readable storage medium having stored thereon instructions, which, when executed by a plurality of processors, cause the plurality of processors to perform the operations of: in one or more application-independent map modules, reading input data and applying at least one application-specific map operation to the input data to produce intermediate key-value pairs, wherein the one or more application-independent map modules automatically parallelize the map operation across multiple processors in a parallel processing environment; storing the intermediate key-value pairs produced by the map operation in a plurality of intermediate data structures adapted for storing the intermediate key-value pairs; and 6p1 in one or more application-independent reduce modules, retrieving the intermediate key-value pairs from the intermediate data structures and applying at least one application-specific reduce operation to the intermediate key-value pairs, including combining respective intermediate values sharing the same key to provide output data. 29. The computer-readable storage medium of claim 28, wherein the instructions additionally include instructions, to perform the operations of: automatically parallelizing the reduce operation across multiple processors in the parallel processing environment using an application independent methodology; and storing the output data produced by the reduce operation. 30. A system for large-scale processing of data in a distributed and parallel processing environment including a set of interconnected computing systems, comprising: means for applying at least one application-specific map operation to input data to produce intermediate key-value pairs; means for automatically parallelizing the map operation across multiple processors in the parallel processing environment using an application independent methodology; means for storing the intermediate key-value pairs produced by the map operation; means for retrieving a respective subset of the intermediate data values key-value pairs and applying at least one application-specific reduce operation to the intermediate key-value pairs including combining respective intermediate values sharing the same key; means for automatically parallelizing the reduce operation across multiple processors in the parallel processing environment using an application independent methodology; and means for storing output data values produced by the reduce operation.

이 특허에 인용된 특허 (4)

McMillen Robert J. ; Watson M. Cameron ; Chura David J., Computer system using a master processor to automatically reconfigure faulty switch node that is detected and reported.
상세보기
Hardwick Jonathan C.,GBX, Dynamic load balancing among processors in a parallel computer.
상세보기
Shimon Muller ; Denton E. Gentry, Jr. ; John E. Watkins ; Linda T. Cheng, High performance network interface.
상세보기
Matsushita Masayuki,JPX ; Ugajin Atsushi,JPX, Management system and method for parallel computer system.
상세보기

이 특허를 인용한 특허 (32)

Beda, III, Joseph S.; Czajkowski, Grzegorz J.; Zhao, Jerry, Clustering for parallel processing.
상세보기
Beda, III, Joseph S.; Czajkowski, Grzegorz J.; Zhao, Yonggang, Clustering for parallel processing.
상세보기
Harvey, Richard H.; McDonald, Justin J.; Ramsay, Ronald W., Directory partitioned system and method.
상세보기
Narang, Ankur; Soman, Jyothish, Distributed data scalable adaptive map-reduce framework.
상세보기
Narang, Ankur; Soman, Jyothish, Distributed data scalable adaptive map-reduce framework.
상세보기
Verma, Abhishek; Cherkasova, Ludmila; Kumar, Vijay S., Job scheduling based on map stage and reduce stage duration.
상세보기
Lee, Seung Won; Lee, Shi Hwa; Kang, Dong-In; Kang, Mikyung, Method and system for dynamically parallelizing application program.
상세보기
Lee, Seung Won; Lee, Shi Hwa; Kang, Dong-In; Kang, Mikyung, Method and system for dynamically parallelizing application program.
상세보기
Bogrett, Steven, Method and system for performing transactional updates in a key-value store.
상세보기
Douetteau, Florian; Boubrik, Abdelmajid; Bordier, Jeremie; Luzzardi, Andrea; Moal, Tanguy, Method and system for processing information of a stream of information.
상세보기
Goda, Kazuo; Yamada, Hiroyuki; Kitsuregawa, Masaru; Kawamura, Nobuo; Fujiwara, Shinji; Mogi, Kazuhiko, Parallel data processing system, computer, and parallel data processing method.
상세보기
Singh, Gyanit; Chiu, Chi-Hsien; Sundaresan, Neelakantan, Parallel data stream processing system.
상세보기
Florissi, Patricia G. S.; Vijendra, Sudhir, Parallel modeling and execution framework for distributed computation and file system access.
상세보기
Lipton, Daniel; Lhotak, Vladimir, Providing resumption data in a distributed processing system.
상세보기
Chandramouli, Badrish; Goldstein, Jonathan; Jin, Xin; Raman, Balan Sethu; Duan, Songyun, Real-time-ready behavioral targeting in a large-scale advertisement system.
상세보기
Ellis, Edric; Martin, Jocelyn; Stefansson, Halldor N., State management for task queues.
상세보기
Pike, Robert C.; Quinlan, Sean; Dorward, Sean M.; Dean, Jeffrey; Ghemawat, Sanjay, System and method for analyzing data records.
상세보기
Dean, Jeffrey; Ghemawat, Sanjay, System and method for large-scale data processing using an application-independent framework.
상세보기
Dean, Jeffrey; Ghemawat, Sanjay, System and method for large-scale data processing using an application-independent framework.
상세보기
Malewicz, Grzegorz; Dvorsky, Marian; Colohan, Christopher B.; Thomson, Derek P.; Levenberg, Joshua Louis, System and method for limiting the impact of stragglers in large-scale parallel data processing.
상세보기
Malewicz, Grzegorz; Dvorsky, Marian; Colohan, Christopher B.; Thomson, Derek P.; Levenberg, Joshua Louis, System and method for limiting the impact of stragglers in large-scale parallel data processing.
상세보기
Malewicz, Grzegorz; Dvorsky, Marian; Colohan, Christopher B.; Thomson, Derek P.; Levenberg, Joshua Louis, System and method for limiting the impact of stragglers in large-scale parallel data processing.
상세보기
Malewicz, Grzegorz; Dvorsky, Marian; Colohan, Christopher B.; Thomson, Derek P.; Levenberg, Joshua Louis, System and method for limiting the impact of stragglers in large-scale parallel data processing.
상세보기
Shao, Feng, System and method for offline data generation for online system analysis.
상세보기
Gupta, Rajeev; Ravindra, Padmashree; Roy, Prasan, System and method for shared execution of mixed data flows.
상세보기
Gupta, Rajeev; Ravindra, Padmashree; Roy, Prasan, System and method for shared execution of mixed data flows.
상세보기
Maddhirala, Anil K.; Subbarayan, Ravikumar; Chinnathambi, Senthil K., System and method of multithreaded processing across multiple servers.
상세보기
Dasdan, Ali, System and/or method for balancing allocation of data among reduce processes by reallocation.
상세보기
Lam, Wang Chee; Liu, Lu; Siripurapu, Taraka Subrahmanya Prasad; Rajaraman, Anand; Vacheri, Zoheb; Doan, AnHai, Systems and methods for event stream processing.
상세보기
Lam, Wang Chee; Liu, Lu; Siripurapu, Taraka Subrahmanya Prasad; Rajaraman, Anand; Vacheri, Zoheb; Doan, AnHai, Systems and methods for event stream processing.
상세보기
Vijendra, Sudhir; Florissi, Patricia G. S., Worldwide distributed file system model.
상세보기
Florissi, Patricia G. S.; Vijendra, Sudhir, Worldwide distributed job and tasks computational model.
상세보기

IPC	Description
A	생활필수품
A62	인명구조; 소방(사다리 E06C)
A62B	인명구조용의 기구, 장치 또는 방법(특히 의료용에 사용되는 밸브 A61M 39/00; 특히 물에서 쓰이는 인명구조 장치 또는 방법 B63C 9/00; 잠수장비 B63C 11/00; 특히 항공기에 쓰는 것, 예. 낙하산, 투출좌석 B64D; 특히 광산에서 쓰이는 구조장치 E21F 11/00)
A62B-1/08	.. 윈치 또는 풀리에 제동기구가 있는 것

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표IPC 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 공고번호, 공고일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표출원인, 출원인국적, 출원인주소, 발명자, 발명자E, 발명자코드, 발명자주소, 발명자 우편번호, 발명자국적, 대표IPC, IPC코드, 요약, 미국특허분류, 대리인주소, 대리인코드, 대리인(한글), 대리인(영문), 국제공개일자, 국제공개번호, 국제출원일자, 국제출원번호, 우선권, 우선권주장일, 우선권국가, 우선권출원번호, 원출원일자, 원출원번호, 지정국, Citing Patents, Cited Patents
저장형식	Text(ASCII format) Excel format PIAS분석(.xls)
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

Large-scale data processing in a distributed and parallel processing enviornment 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

이 특허에 인용된 특허 (4)

이 특허를 인용한 특허 (32)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

Large-scale data processing in a distributed and parallel processing enviornment 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

이 특허에 인용된 특허 (4)

이 특허를 인용한 특허 (32)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트