Delta compression after identity deduplication
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-017/00
G06F-007/00
G06F-011/14
출원번호
US-0291998
(2008-11-14)
등록번호
US-8751462
(2014-06-10)
발명자
/ 주소
Huang, Mark
Lee, Edward K.
Li, Kai
Shilane, Philip
Wallace, Grant
Zhu, Ming Benjamin
출원인 / 주소
EMC Corporation
대리인 / 주소
Van Pelt, Yi & James LLP
인용정보
피인용 횟수 :
10인용 특허 :
12
초록▼
Delta compression after identity deduplication is disclosed. A first data segment is determined to be identical to a first previous data segment. A second data segment, not determined to be identical to a second previous data segment, is then determined to be similar to a third previous data segment
Delta compression after identity deduplication is disclosed. A first data segment is determined to be identical to a first previous data segment. A second data segment, not determined to be identical to a second previous data segment, is then determined to be similar to a third previous data segment.
대표청구항▼
1. A system for processing data, comprising: a deduplicating system, wherein the deduplicating system comprises a processor and a storage unit, wherein the deduplicating system is configured for determining, using the processor, whether a first data segment is identical to a first previously stored
1. A system for processing data, comprising: a deduplicating system, wherein the deduplicating system comprises a processor and a storage unit, wherein the deduplicating system is configured for determining, using the processor, whether a first data segment is identical to a first previously stored data segment in the storage unit, wherein in the event that the first data segment is determined to be identical to the first previously stored data segment, a reference to the first data segment is stored in the storage unit; anda delta compression system is configured for determining, only in the event that the first data segment is determined not to be identical to the first previously stored data segment, whether the first data segment is similar to a second previously stored data segment, wherein the first data segment is determined to be similar to the second previously stored data segment using a sketch function and the sketch function comprises one or more functions that returns a same value for similar data segments;wherein in the event that the first data segment is determined to be similar to the second previously stored data segment, the delta compression system is further configured for computing an encoding of the first data segment, wherein the encoding comprises determining one or more differences between the first data segment and the second previously stored data segment, and storing, in the storage unit, the first data segment using a sequence comprising the one or more differences, one or more first sequence locations corresponding to each of the one or more differences, a reference to the second previously stored data segment, and one or more second sequence references, wherein the one or more second sequence references corresponding to a sequence of data from within the second previously stored segment identifying the subset of the second previously stored segment; andwherein in the event that the first data segment is not determined to be similar to the second previously stored data segment and that the first data segment is determined not to be identical to the first stored data segment, the delta compression system is further configured for storing the first data segment in the storage unit. 2. The system as in claim 1, wherein the deduplicating system receives a data stream or data block. 3. The system as in claim 2, wherein the deduplicating system breaks the data stream or data block into a plurality of data segments. 4. The system as in claim 1, wherein determining that the first data segment is identical comprises: determining a first data segment ID associated with the first data segment;determining whether the first data segment ID is identical to a previously stored ID in an ID index. 5. The system as in claim 4, where determining the first data segment ID associated with the first data segment uses one or more of the following: a fingerprint function, a hash function, a cryptographic hash function, and a digital signature. 6. The system as in claim 1, further comprising compressing the encoding of the first data segment in the event that the first data segment is determined to be similar to the second previously stored data segment. 7. The system as in claim 1, further comprising transmitting the encoding of the first data segment in the event that the first data segment is determined to be similar to the second previously stored data segment. 8. The system as in claim 1, further comprising replicating the encoding of the first data segment in the event that the first data segment is determined to be similar to the second previously stored data segment. 9. The system as in claim 1, wherein the encoding is based at least in part on the second previously stored data segment. 10. The system as in claim 1, wherein the encoding comprises an indication of a set of data blocks in the first data segment not present in the second previously stored data segment and an indication of a set of data blocks in the second previously stored data segment. 11. The system as in claim 1, wherein in the event that the first data segment is determined to be similar to the second previously stored data segment, the delta compression system further comprises determining whether the encoding is smaller than the first data segment. 12. The system as in claim 1, wherein the sketch function comprises a hash function. 13. The system as in claim 1, wherein the sketch function comprises a plurality of hash functions. 14. The system as in claim 1, wherein the sketch function comprises one or more functions that returns a similar value for similar data segments. 15. The system as in claim 1, wherein the sketch function comprises one or more functions that may return a same value for similar data segments. 16. The system as in claim 1, wherein the sketch function comprises one or more functions that may return a similar value for similar data segments. 17. The system as in claim 16, wherein sketch function values are determined to be similar based on one or more of the following methods: numeric difference, hamming distance, locality-sensitive-hashing, or nearest-neighbor-search. 18. The system as in claim 1, wherein in the event that the first data segment is determined to be similar to the second previously stored data segment, the delta compression system further comprises determining whether the first data segment is similar to one or more previously stored segments in addition to the second previously stored data segment. 19. The system as in claim 18, wherein in the event that the first data segment is determined to be similar to the one or more previously stored data segments, the encoding is based at least in part on the second previously stored data segment and the one or more additional similar previously stored data segments. 20. The system as in claim 18, wherein the one or more previously stored data segments and second previously stored data segment are identified based at least in part on one or more of the following: temporal locality, spatial locality, ease of access, expected compression, or frequency of selection for other compressed segments. 21. The system as in claim 1, wherein the second previously stored data segment was stored as an encoding of a third previously stored data segment. 22. A method for processing data, comprising: determining, using a processor, whether a first data segment is identical to a first previously stored data segment, wherein in the event that the first data segment is determined to be identical to the first previously stored data segment, a reference to the first data segment is stored in a storage unit; anddetermining, only in the event that the first data segment is determined not to be identical to the first previously stored data segment, whether the first data segment is similar to a second previously stored data segment, wherein the first data segment is determined to be similar to the second previously stored data segment using a sketch function, wherein the sketch function comprises one or more functions that returns a same value for similar data segments;in the event that the first data segment is determined to be similar to the second previously stored data segment, computing an encoding of the first data segment, wherein the encoding comprises determining one or more differences between the first data segment and the second previously stored data segment, and storing, in the storage unit, the first data segment using a sequence comprising the one or more differences, one or more first sequence locations corresponding to each of the one or more differences, a reference to the second previously stored data segment, and one or more second sequence references, wherein the one or more second sequence references corresponding to a sequence of data from within the second previously stored segment identifying the subset of the second previously stored segment; andin the event that the first data segment is not determined to be similar to the second previously stored data segment and that the first data segment is determined not to be identical to the first stored data segment, storing the first data segment in the storage unit. 23. The method as in claim 22, further comprising receiving a data stream or data block. 24. The method as in claim 22, further comprising breaking the data stream or data block into a plurality of data segments. 25. The method as in claim 22, wherein determining that the first data segment is identical comprises: determining a first data segment ID associated with the first data segment;determining whether the first data segment ID is identical to a previously stored ID in an ID index. 26. The method as in claim 25, where determining the first data segment ID associated with the first data segment uses one or more of the following: a fingerprint function, a hash function, a cryptographic hash function, and a digital signature. 27. The method as in claim 22, further comprising compressing the encoding of the first data segment in the event that the first data segment is determined to be similar to the second previously stored data segment. 28. The method as in claim 22, further comprising transmitting the encoding of the first data segment in the event that the first data segment is determined to be similar to the second previously stored data segment. 29. The method as in claim 22, further comprising replicating the encoding of the first data segment in the event that the first data segment is determined to be similar to the second previously stored data segment. 30. The method as in claim 22, wherein the encoding is based at least in part on the second previously stored data segment. 31. The method as in claim 22, wherein the encoding comprises an indication of a set of data blocks in the first data segment not present in the second previously stored data segment and an indication of a set of data blocks in the second previously stored data segment. 32. The method as in claim 22, further comprising determining whether the encoding is smaller than the first data segment in the event that the first data segment is determined to be similar to the second previously stored data segment. 33. The method as in claim 22, wherein the sketch function comprises a hash function. 34. The method as in claim 22, wherein the sketch function comprises a plurality of hash functions. 35. The method as in claim 22, wherein the sketch function comprises one or more functions that returns a similar value for similar data segments. 36. The method as in claim 22, wherein the sketch function comprises one or more functions that may return the same value for similar data segments. 37. The method as in claim 22, wherein the sketch function comprises one or more functions that may return a similar value for similar data segments. 38. The method as in claim 37, wherein sketch function values are determined to be similar based on one or more of the following methods: numeric difference, hamming distance, locality-sensitive-hashing, or nearest-neighbor-search. 39. The method as in claim 22, wherein in the event that the first data segment is determined to be similar to the second previously stored data segment, determining whether the first data segment is similar to one or more previously stored segments in addition to the second previously stored data segment. 40. The method as in claim 39, wherein the encoding is based at least in part on the second previously stored data segment and the one or more additional similar previously stored data segments. 41. The method as in claim 39, wherein the one or more previously stored data segments and second previously stored data segment are identified based at least in part on one or more of the following: temporal locality, spatial locality, ease of access, expected compression, or frequency of selection for other compressed segments. 42. The method as in claim 22, wherein the second previously stored data segment was stored as an encoding of a third previously stored data segment. 43. A computer program product for processing data, the computer program product being embodied in a non-transitory computer readable storage medium and comprising computer instructions for: determining, using a processor, whether a first data segment is identical to a first previously stored data segment, wherein in the event that the first data segment is determined to be identical to the first previously stored data segment, a reference to the first data segment is stored in a storage unit; anddetermining, only in the event that the first data segment is determined not to be identical to the first previously stored data segment, whether the first segment is similar to a second previously stored data segment, wherein the first data segment is determined to be similar to the second previously stored data segment using a sketch function and the sketch function comprises one or more functions that returns a same value for similar data segments;in the event that the first data segment is determined to be similar to the second previously stored data segment, computing an encoding of the first data segment, wherein the encoding comprises determining one or more differences between the first data segment and the second previously stored data segment, and storing, in the storage unit, the first data segment using a sequence comprising the one or more differences, one or more first sequence locations corresponding to each of the one or more differences, a reference to the second previously stored data segment, and one or more second sequence references, wherein the one or more second sequence references corresponding to a sequence of data from within the second previously stored segment identifying the subset of the second previously stored segment; andin the event that the first data segment is not determined to be similar to the second previously stored data segment and that the first data segment is determined not to be identical to the first stored data segment, storing the first data segment in the storage unit.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (12)
Farber, David A.; Lachman, Ronald D., De-duplication of data in a data processing system.
Andrei Z. Broder ; Steven C. Glassman ; Charles G. Nelson ; Mark S. Manasse ; Geoffrey G. Zweig, Method for clustering closely resembling data objects.
Miklos Ajtai ; Randal Chilton Burns ; Ronald Fagin ; Larry Joseph Stockmeyer, System and method for differential compression of data from a plurality of binary sources.
Ting, Daniel; Zheng, Ling; Manley, Stephen L.; DeStefano, John Frederick, System and method for managing data deduplication of storage systems utilizing persistent consistency point images.
Morris Robert J. T. (Los Gatos CA), System and method for reducing storage requirement in backup subsystems utilizing segmented compression and differencing.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.