[특허]Data profiling

Data profiling 원문보기

IPC분류정보
국가/구분	United States(US) Patent 등록
국제특허분류(IPC7판)	G06F-017/30
출원번호	US-0941402 (2004-09-15)
등록번호	US-8868580 (2014-10-21)
발명자 / 주소	Gould, Joel Feynman, Carl Bay, Paul
출원인 / 주소	Ab Initio Technology LLC
대리인 / 주소	Fish & Richardson P.C.
인용정보	피인용 횟수 : 6 인용 특허 : 25

초록 ▼

Processing data includes profiling data from a data source, including reading the data from the data source, computing summary data characterizing the data while reading the data, and storing profile information that is based on the summary data. The data is then processed from the data source. This processing includes accessing the stored profile information and processing the data according to the accessed profile information.

대표청구항 ▼

1. A method for processing data including: reading data records, in a dataflow graph, from a data source, the dataflow graph including components and links, wherein the links direct flows of data between components;profiling the data records using the data flow graph including sending the data records on a first link to a partitioning component, the partitioning component partitioning the data records among a plurality of partitions;generating, in each partition, by a canonicalize component in communication with the partitioning component, a flow of census elements on a respective one of a first plurality of links including generating a plurality of census elements for each data record, each census element including: a field of the data,a corresponding value occurring within the field of the data record;generating, in each partition, by a rollup component in communication with the a corresponding canonicalize component, a flow of output census elements on a respective one of a second plurality of links including combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements;combining the flows of output census elements in each partition, and partitioning the combined flow of output census elements by the field and the value;adding counts of the number of occurrences of the same value for the same field for the partitioned flows of output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the flow of data records;storing profile information based on the single census elements; and processing data from the data source, including accessing the stored profile information, reading the data from the data source after profiling the data from the data source, processing the data according to the accessed profile information, and outputting a result of the processing. 2. The method of claim 1 wherein profiling the data is performed without maintaining a copy of the data outside the data source. 3. The method of claim 2 wherein the data includes variable record structure records with at least one of a conditional field and a variable number of fields. 4. The method of claim 3 wherein profiling the data includes interpreting the variable record structure records. 5. The method of claim 1 wherein the data source includes a data storage system. 6. The method of claim 5 wherein the data storage system includes a database system. 7. The method of claim 1 wherein profiling the data includes counting a number of occurrences for each of a set of distinct values for a field. 8. The method of claim 7 wherein storing profile information includes storing statistics for the field based on the single census elements. 9. The method of claim 1 further including maintaining a metadata store that contains metadata related to the data source. 10. The method of claim 9 wherein storing the profile information includes updating the metadata related to the data source. 11. The method of claim 9 wherein profiling the data records and processing the data each make use of metadata for the data source. 12. The method of claim 1 wherein profiling the data records from the data source further includes determining a format specification based on the profile information. 13. The method of claim 1 wherein profiling the data records further includes determining a validation specification based on the profile information. 14. The method of claim 13 wherein processing the data includes identifying invalid records in the data based on the validation specification. 15. The method of claim 1 wherein profiling the data records further includes specifying data transformation instructions based on the profile information. 16. The method of claim 15 wherein processing the data records includes applying the transformation instructions to the data. 17. The method of claim 1 wherein processing the data includes importing the data into a data storage subsystem. 18. The method of claim 17 wherein processing the data includes validating the data prior to importing the data into a data storage subsystem. 19. The method of claim 18 wherein validating the data includes comparing characteristics of the data to reference characteristics for said data. 20. The method of claim 19 wherein the reference characteristics include statistical properties of the data. 21. The method of claim 1 wherein reading data includes reading the data from a parallel data source, each part of the parallel data source being processed by a different one of a first set of parallel processors and/or computers. 22. Software stored on a non-transitory computer-readable storage medium including executable instructions for causing a computer system to: read data records, in a dataflow graph, from a data source, the dataflow graph including components and links, wherein the links direct flows of data between components;profile the data records using the dataflow graph including sending the data records on a first link to a partitioning component, the partitioning component partitioning the data records among a plurality of partitions;generating, in each partition, a flow of census elements, by a canonicalize component in communication with the partitioning component, on a respective one of a first plurality of links including generating a plurality of census elements identifying a field and a corresponding value for each data record, each census element including: a field of the data,a corresponding value occurring within the field of the data record;generating, in each partition, a flow of output census elements, by a rollup component in communication with a corresponding canonicalize component; on a respective one of a second plurality of links including combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements;combining the flows of output census elements in each partition, and partitioning the combined flow of output census elements by the field and the value;adding counts of the number of occurrences of the same value for the same field for the partitioned flows of output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the flow of data records;store profile information based on the single census elements; andprocess data from the data source by accessing the stored profile information, reading the data from the data source after profiling the data from the data source, processing the data according to the accessed profile information, and outputting a result of the processing. 23. A data processing system including: a computer system, including a plurality of processors;a data source accessible to the computer system;a data storage subsystem including a non-transitory computer-readable storage medium in communication with the computer system;with a dataflow graph configured to execute on a plurality of the processors, the dataflow graph including components and links, wherein the links direct flows of data between components and the components including:a read data component configured to read data records from a data source, a first partitioning component connected to the read data component by a link and configured to partition the data records among a plurality of partitions corresponding to different processors;a plurality of canonicalize components, each canonicalize component in communication with the first partitioning component and configured to generate a flow of census elements including generating a plurality of census elements for each data record, each census element including:a field of the data,a corresponding value occurring within the of the data record;a plurality of local rollup components, each local rollup component in communication with a canonicalize component and configured to generate, in each partition, a flow of output census elements including combining occurrences of census elements having the same value for the same field into an output census element including the field, the value, and a count of the number of combined census elements;a second partitioning component connected to each local rollup component in the plurality of local rollup components by a link and configured to combining the flows of output census elements in each partition, and partitioning the combined flow of output census elements by the field and the value;a plurality of global rollup components, each global rollup component connected to the second partitioning component by a link and configured to: add counts of the number of occurrences of the same value for the same field for the partitioned flows of output census elements to produce, for each field and corresponding value, a single census element that includes a total count of occurrences of that field and corresponding value in the flow of data records, and store profile information based on the single census elements; anda processing module connected over communication paths to the data source and the data storage subsystem and configured execute on the computer system to access the stored profile information, to read data from the data source after the profiling module reads the data from the data source, to process the data from the data source according to the accessed profile information, and to output a result of the processing. 24. The method of claim 1 wherein storing profile information includes storing the total count from the single element for a field and a corresponding value. 25. The method of claim 1, wherein each census element further includes a flag indicating whether the value included in the element is valid. 26. The method of claim 1, wherein each census element further includes a flag indicating whether the value included in the element corresponds to a pre-determined null value. 27. The method of claim 1 wherein the data records are filtered, by a filtering component, to limit profiling to a field of each data record. 28. The method of claim 1 wherein the data records are filtered, by a filtered component, including: determining that a data record is invalid; andsending the data record to an invalid records component on a link.

이 특허에 인용된 특허 (25)

Miller, Timothy Edward; Tate, Brian Don; Rollins, Anthony Lowell, Analytic logical data model.
상세보기
Reiner David ; Miller Jeffrey M. ; Wheat David C., Apparatus and method for decomposing database queries for database management system including multiprocessor digital d.
상세보기
Santosuosso,John M., Byte-code representations of actual data to reduce network traffic in database transactions.
상세보기
Klein Laurence C., Computer system and method of data analysis.
상세보기
Leppard, Andrew, Considering multiple lookups in bloom filter decision making.
상세보기
Pathria, Anu K; Allmon, Andrea L; de Traversay, Jean; Ianakiev, Krassimir G; Suresh, Nallan C; Tyler, Michael K, Consistency modeling of healthcare claims to detect fraud and abuse.
상세보기
Caldwell, Donald F; Church, Kenneth Ward; Fowler, Glenn Stephen, Data compression method and apparatus.
상세보기
Okaue,Takumi, Data processing system, data processing method, and program providing medium.
상세보기
Gibbons,Phillip B., Distinct sampling system and a method of distinct sampling for optimizing distinct value query estimates.
상세보기
Stanfill Craig W. ; Lasser Clifford A. ; Lordi Robert D., Executing computations expressed as graphs.
상세보기
Gould, Joel; Feynman, Carl; Bay, Paul, Functional dependency data profiling.
상세보기
Gould, Joel; Feynman, Carl; Bay, Paul, Joint field profiling.
상세보기
Kliebhan Daniel F., Method and apparatus for merging telephone switching office databases.
상세보기
Bayliss,David, Method and system for processing data records.
상세보기
Rathbun Kyle R., Method for creating and using parallel data structures.
상세보기
Chen, Li-Wen; Feng, Hwa Chung, Method for dynamically creating a profile.
상세보기
Homma Koichi (Yokohama JPX) Kagami Akira (Kawasaki JPX) Akashi Kichizo (Ebina JPX) Hirata Shigeki (Kamakura JPX) Mori Hiroshi (Ebina JPX) Aizawa Takayuki (Matsudo JPX), Method of multi-dimensional analysis and display for a large volume of record information items and a system therefor.
상세보기
Reed, Kenneth L.; Hariharan, Hari S.; Saito, Michiko, Multi-dimensional segmentation for use in a customer interaction.
상세보기
Ditlow, Gary S.; Dooling, Daria R.; Moran, David E.; Williams, Ralph J., Partitioning and load balancing graphical shape data for parallel applications.
상세보기
Ananian,John Allen, Personalized interactive digital catalog profiling.
상세보기
Santosuosso,John M., Pre-formatted column-level caching to improve client performance.
상세보기
Bookman, Lawrence A.; Blair, David Albert; Rosenthal, Steven M.; Krawitz, Robert Louise; Beckerle, Michael J.; Callen, Jerry Lee; Razdow, Allen; Mudambi, Shyam R., Segmentation and processing of continuous data streams using transactional semantics.
상세보기
Gavan, John; Paul, Kevin; Richards, Jim; Dallas, Charles A.; Van Arkel, Hans; Herrington, Cheryl; Mahone, Saralyn; Curtis, Terril J.; Wagner, James J., System and method for detecting and managing fraud.
상세보기
Agrawal Rakesh ; Shafer John Christopher, System and method for parallel mining of association rules in databases.
상세보기
Blakeley,Jose A.; Zhang,Hongang; Rathakrishnan,Balaji; Venkatesh,Ramachandran; Sezgin,Beysim; Boukouvalas,Alexios; Galindo Legaria,Cesar A.; Carlin,Peter A., System and method for providing user defined aggregates in a database system.
상세보기

이 특허를 인용한 특허 (6)

Bostick, James E.; Ganci, Jr., John M.; Singh, Arvind; Wenk, David S., Automated value analysis in legacy data.
상세보기
Wuchner, Egon, Management apparatus and method for managing data elements.
상세보기
Bach, Edward; Oberdorf, Richard; Larson, Brond, Managing lineage information.
상세보기
Johnson, Robert; Dimitrov, Boris, Methods for stratified sampling-based query execution.
상세보기
Johnson, Robert; Abraham, Lior; Johnson, Ann; Dimitrov, Boris; Fossgreen, Don, System and methods for rapid data analysis.
상세보기
Johnson, Robert; Abraham, Lior; Johnson, Ann; Dimitrov, Boris; Fossgreen, Don, Systems and methods for rapid data analysis.
상세보기

IPC	Description
A	생활필수품
A62	인명구조; 소방(사다리 E06C)
A62B	인명구조용의 기구, 장치 또는 방법(특히 의료용에 사용되는 밸브 A61M 39/00; 특히 물에서 쓰이는 인명구조 장치 또는 방법 B63C 9/00; 잠수장비 B63C 11/00; 특히 항공기에 쓰는 것, 예. 낙하산, 투출좌석 B64D; 특히 광산에서 쓰이는 구조장치 E21F 11/00)
A62B-1/08	.. 윈치 또는 풀리에 제동기구가 있는 것

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표IPC 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 공고번호, 공고일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표출원인, 출원인국적, 출원인주소, 발명자, 발명자E, 발명자코드, 발명자주소, 발명자 우편번호, 발명자국적, 대표IPC, IPC코드, 요약, 미국특허분류, 대리인주소, 대리인코드, 대리인(한글), 대리인(영문), 국제공개일자, 국제공개번호, 국제출원일자, 국제출원번호, 우선권, 우선권주장일, 우선권국가, 우선권출원번호, 원출원일자, 원출원번호, 지정국, Citing Patents, Cited Patents
저장형식	Text(ASCII format) Excel format PIAS분석(.xls)
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

Data profiling 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

이 특허에 인용된 특허 (25)

이 특허를 인용한 특허 (6)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

Data profiling 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

이 특허에 인용된 특허 (25)

이 특허를 인용한 특허 (6)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트