IPC분류정보
국가/구분 |
United States(US) Patent
등록
|
국제특허분류(IPC7판) |
|
출원번호 |
UP-0599655
(2005-03-18)
|
등록번호 |
US-7599927
(2009-10-20)
|
우선권정보 |
FR-04 03556(2004-04-05) |
국제출원번호 |
PCT/FR05/000673
(2005-03-18)
|
§371/§102 date |
20070717
(20070717)
|
국제공개번호 |
WO05/101292
(2005-10-27)
|
발명자
/ 주소 |
|
출원인 / 주소 |
|
대리인 / 주소 |
|
인용정보 |
피인용 횟수 :
18 인용 특허 :
5 |
초록
▼
The present invention relates to searching content, particularly for at least one extract common to a first data file and a second data file. The method comprises a preliminary step of preparing at least the first file by (a) dividing the first file into a series of data packets having a predetermi
The present invention relates to searching content, particularly for at least one extract common to a first data file and a second data file. The method comprises a preliminary step of preparing at least the first file by (a) dividing the first file into a series of data packets having a predetermined size, and identifying packet addresses in said file, (b) combining each packet address with a digital signature that defines one of three fuzzy logic states, namely true, false and indeterminate, and is the result of a combinatorial computation on data from said file; whereafter said method comprises performing an actual search for a common extract by (c) comparing the fuzzy logic states combined with each packet address of the first file with fuzzy logic states determined on the basis of data from the second file, and (d) removing from said common extract search the respective address pairs from the first and second files that have the respective logic states true and false or false and true, and retaining the other address pairs that identify data packets that may comprise said common extract.
대표청구항
▼
The invention claimed is: 1. A method implemented by computer means of searching content, for at least one extract common to a first file and to a second file, in the form of binary data, being searched for, wherein the method comprises: a prior preparation of the first file at least, comprising th
The invention claimed is: 1. A method implemented by computer means of searching content, for at least one extract common to a first file and to a second file, in the form of binary data, being searched for, wherein the method comprises: a prior preparation of the first file at least, comprising the following steps: a) segmenting the first file into a succession of data packets, of chosen size, and identifying addresses of packets in said file, b) associating with the address of each packet a digital signature defining a fuzzy logic state from among at least three states: "true", "false" and "undetermined", said signature resulting from a combinatorial calculation on data emanating from said file, and the search for common extract, itself, comprising the following steps: c) comparing the fuzzy logic states associated with each packet address of the first file, with fuzzy logic states determined on the basis of data emanating from the second file, d) eliminating from said search for common extract, pairs of respective addresses of the first and second files whose respective logic states are "true" and "false" or "false" and "true", and preserving the other pairs of addresses identifying data packets liable to comprise said common extract. 2. The method as claimed in claim 1, wherein, in step b), a data packet is assigned the state: "true" if all the data of the packet satisfy a first condition, "false" if all the data of the packet satisfy a second condition, contrary to the first condition, and "undetermined" if certain data of the packet satisfy the first condition, while other data of the packet satisfy the second condition. 3. The method as claimed in claim 1, wherein a processing prior to step b) is applied to the data of a file, said processing comprising the following steps: a1) the data of the file are considered as a string of samples obtained at a predetermined sampling frequency Fe, and of values coded according to a binary representation code, and a2) a digital filter is applied to said samples fn, said filter being adapted to minimize a probability of obtaining the "undetermined" state for the digital signatures associated with the packets of samples. 4. The method as claimed in claim 3, wherein the application of said digital filter amounts to: applying a spectral transform to the sampled data, applying a low-pass filter to said spectral transform, and applying an inverse spectral transform after said low-pass filter. 5. The method as claimed in claim 4, wherein the low-pass filter operates on a frequency band comprising substantially the interval: [-Fe/2(k-1),+Fe/2(k-1)], where Fe is said sampling frequency, and k is the number of samples that a packet comprises. 6. The method as claimed in claim 4, wherein said digital filter comprises a predetermined number of coefficients of like value, and wherein the frequency response of the associated low-pass filter is expressed, as a function of frequency f, by an expression of the type: sin(PI·f·T)/(PI·f·T), where sin( ) is the sine function, and with: PI=3.1416, and T=(K-1)/Fe where K is said predetermined number of coefficients and Fe said sampling frequency. 7. The method as claimed in claim 6, wherein said predetermined number of coefficients of the filter (2K+1) is greater than or equal to 2k-1, where k is the number of samples that a packet comprises. 8. The method as claimed in claim 3, wherein: said digital filter is a mean value filter of a predetermined number (2K+1) of coefficients, the difference between two successive filtered samples (rn+1-rn) is proportional to the difference between two unfiltered samples (fn+K+1-fn-K), respectively of a first rank and of a second rank, said ranks being spaced apart by said predetermined number of coefficients, and the calculation of said filtered samples is performed by utilizing this relation to reduce the number of calculation operations to be performed. 9. The method as claimed in claim 3, wherein: the "true" state is assigned to the address of a packet if, for this packet, all the filtered samples have a value greater than a chosen reference value, the "false" state is assigned to the address of a packet if, for this packet, all the filtered samples have a value less than a chosen reference value, and the "undetermined" state is assigned to the address of a packet if, for this packet, the filtered samples have, for certain of them, a value less than said reference value, and, for other filtered samples, a value greater than said reference value. 10. The method as claimed in claim 9, wherein, for any filtered sample rn of given order n, said reference value Vref is calculated by averaging the values of the unfiltered samples fk over a chosen number of unfiltered consecutive samples Kref about an unfiltered sample fn of the same given order n. 11. The method as claimed in claim 10, wherein the values of the filtered samples are made relative, for comparison, to a zero threshold value, and wherein said filtered samples r'n are expressed by a sum of the type: fn+k are unfiltered samples obtained in step a1), K is the number of coefficients of the digital filter, preferably chosen to be even, and Kref is said number of unfiltered samples around an unfiltered sample fn, preferably chosen to be even and greater than said number of coefficients K. 12. The method as claimed in claim 11, wherein said sum is applied to the unfiltered samples fn a plurality of times, according to a processing performed in parallel, while respectively varying the number of coefficients K. 13. The method as claimed in claim 3, wherein each filtered sample rn is expressed as a sum of the type: f(n+i) are unfiltered samples, filteri are coefficients of a digital filter, integrating, as the case may be, a threshold value referred to zero, and wherein a number k of unfiltered samples that a packet comprises is chosen, at minimum equal to 2 and less than or equal to an expression of the type: (TEF-I1-I2+1)/2, where TEF is a desired minimum size of the common extracts searched for. 14. The method as claimed in claim 13, wherein: for a given value TEF of the desired minimum size of common extracts searched for, a span of usable values for said number k of unfiltered samples that a packet comprises is determined, and, for each usable value of the number k, an optimal size TES is determined of a succession of data of digital signatures, for which succession the detection of a common extract of size TEF is guaranteed, and wherein said optimal size TES is less than or equal to an expression of the type: E[(TEF-I1-I2+1)/k]-1, where E(X) designates the integer part of X. 15. The method as claimed in claim 3, wherein the method comprises, for the first file: the sampling at a chosen sampling frequency, the digital filtering corresponding to a low-pass filtering in the frequency space, and the combination of the filtered samples to obtain digital signatures in the "true", "false" or "undetermined" state, associated with the respective addresses of the first file, while the method comprises, for the second file: the sampling at a chosen sampling frequency, the digital filtering corresponding to a low-pass filtering in the frequency space, and the logic state associated with each packet of filtered samples is determined on the basis of the logic state associated with a single filtered sample chosen born each packet, in such a way as to obtain digital signatures comprising only "true" or "false" logic states and thus to improve the selectivity of the comparison of the digital signatures. 16. The method as claimed in claim 15, wherein, if the logic state associated with an address of the first file is "true" or "undetermined", while the logic state associated with an address of the second file is "true", the pair of said addresses is retained born the search of common extract, if the logic state associated with an address of the first file is "false" or "undetermined", while the logic state associated with an address of the second file is "false", the pair of said addresses is retained for the search for common extract, while the other pairs of addresses are excluded born the search. 17. The method as claimed in claim 1, wherein the fuzzy states associated with the first file at least are each coded on at least two bits. 18. The method as claimed in claim 17, wherein the fuzzy states determined for a least number of coefficients K are coded on least significant bits and the fuzzy states determined for a larger number of coefficients K are coded on subsequent bits, up to a chosen total number of bits. 19. The method as claimed in claim 1, the two files to be compared comprising data representative of alphanumeric characters, and in particular representative of a text and/or a computer or a genetic code, wherein the method comprises: a first group of steps comprising the formation of the digital signatures and their comparison, for a coarse search, and a second group of steps comprising an identicalness comparison in the spans of addresses satisfying the coarse comparison, and in that the data of a file are considered as a string of samples, with a chosen number k of samples per packet, and wherein the value of this chosen number k is optimized initially by searching for a minimum of comparison operations to be performed. 20. The method as claimed in claim 19, wherein, for the optimization of the chosen number k of samples per packet, account is taken of a total number: of operations of comparison of digital signatures to be performed, and of operations of identicalness comparison of data to be performed thereafter, and wherein said total number of operations is a minimum for a finite set of numbers k. 21. The method as claimed in claim 19, wherein an information relating to a minimum desired size of common extracts searched for is obtained, and used to optimize said chosen number k of samples per packet, and wherein the optimal number k of samples per packet varies substantially as said minimum size so that the larger desired minimum size of common extracts searched for, the shorter the duration of the search for common extract. 22. The method as claimed in claim 1, wherein it comprises the search for common extracts consists of a single group of steps comprising the formation of the digital signatures and their comparison, and wherein the number of data items per packet is optimized by initially fixing a confidence index characterizing an acceptable threshold of probability of false detection of common extracts. 23. The method as claimed in claim 22, the first and second files being files of samples of digitized signals, wherein the method comprises a step of preprocessing of the data and a taking into account of the data associated with portions of signal of level greater than a noise reference. 24. The method as claimed in claim 22, the first and second files are files of samples of digitized signals, wherein the method provides for a step of consolidation of the search results, preferably by adjustment of relative sizes of the packets of the first and second files, in such a way as to tolerate a discrepancy in respective speeds of retrieval of the first and second files. 25. The method as claimed in claim 1, wherein one at least of the first and second files is a data stream, and wherein the method of searching for common extracts is executed in real time. 26. A computer program product, stored in a memory of a central unit of a computer or on a removable medium to cooperate with a reader of said central unit, the program comprising instructions for conducting steps for searching at least one extract common to a first file and to a second file, in the form of binary data, said steps comprising: a prior preparation of the first file at least, including the following steps: a) segmenting the first file into a succession of data packets, of chosen size, and identifying addresses of packets in said file, b) associating with the address of each packet a digital signature defining a fuzzy logic state from among at least three states: "true", "false" and "undetermined", said signature resulting from a combinatorial calculation on data emanating from said file, and the search for common extract, itself, including the following steps: c) comparing the fuzzy logic states associated with each packet address of the first file, with fuzzy logic states determined on the basis of data emanating from the second file, d) eliminating from said search for common extract, pairs of respective addresses of the first and second files whose respective logic states are "true" and "false" or "false" and "true", and preserving the other pairs of addresses identifying data packets liable to comprise said common extract. 27. A computer device, comprising a memory for storing at least first and second files, for the search for at least one extract common to the first file and the second file, wherein it comprises a memory suitable for storing the instructions of a computer program product as claimed in claim 26. 28. A computer installation, comprising: a first computer entity suitable for storing a first file, a second computer entity suitable for storing a second file, and means of communications between the first and second computer units, wherein at least one of said entities comprises a memory suitable for storing the computer program product as claimed in claim 26, for the search of extract common to the first and second files. 29. The installation as claimed in claim 28, wherein the entity storing the computer program product is designed to perform a remote update of one of the first and second files with respect to the other of the first and second files. 30. A data structure stored on a computer-readable storage medium used for a search of at least one extract common to a first and a second file, the data structure being representative of the first file, the data structure being obtained by the implementation of the following steps: a) segmenting the first file into a succession of data packets of chosen size, and identifying addresses of packets in said file, b) associating with the address of each packet a digital signature defining a fuzzy logic state from among at least three states: "true", "false" and "undetermined", said signature resulting from a combinatorial calculation on data emanating from said file, the data structure thus comprising a succession of addresses identifying addresses of the first file, a fuzzy logic state from among the states: "true", "false" and "undetermined", being assigned to each of said addresses of the data structure, c) comparing the fuzzy logic states associated with each packet address of the first file, with fuzzy logic states determined on the basis of data emanating from the second file, d) eliminating from said search for common extract, pairs of respective addresses of the first and second files whose respective logic states are "true" and "false" or "false" and "true", and preserving the other pairs of addresses identifying data packets liable to comprise said common extract.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.