Probabilistic summary data structure based encoding for garbage collection in backup systems
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-012/00
G06F-017/30
G06F-007/00
G06F-017/00
출원번호
UP-0829756
(2007-07-27)
등록번호
US-7783682
(2010-09-13)
발명자
/ 주소
Patterson, R. Hugo
출원인 / 주소
EMC Corporation
대리인 / 주소
Blakely Sokoloff Taylor & Zafman LLP
인용정보
피인용 횟수 :
47인용 특허 :
28
초록▼
A method and apparatus for different embodiments of probabilistic summary data structure based encoding for garbage collection are described. In one embodiment, a method comprises generating a probabilistic summary data structure that represents active blocks of data within a storage device based on
A method and apparatus for different embodiments of probabilistic summary data structure based encoding for garbage collection are described. In one embodiment, a method comprises generating a probabilistic summary data structure that represents active blocks of data within a storage device based on identifications of the active blocks or the data within the active blocks. The method also includes performing garbage collection of at least a portion of the storage device based on the probabilistic summary data structure.
대표청구항▼
What is claimed is: 1. A backup system for performing garbage collection of an allocated address space in a set of one or more storage devices comprising: a set of one or more active storage trees each that are maintained within the set of storage devices, the set of active storage trees representi
What is claimed is: 1. A backup system for performing garbage collection of an allocated address space in a set of one or more storage devices comprising: a set of one or more active storage trees each that are maintained within the set of storage devices, the set of active storage trees representing a version of multiple data, each leaf node of the set of active storage trees representing a block of data from the multiple data that has been backed up onto the set of storage devices; and a garbage collection logic, executing on a processor, responsive to deletion of a version of the multiple data, to generate a probabilistic summary data structure, wherein the probabilistic summary data structure represents blocks of data that are referenced by the set of active storage trees, wherein the garbage collection logic includes an encoder to generate the probabilistic summary data structure, and wherein the encoder includes a hash module to perform a first plurality of hash functions on the referenced blocks to generate a set of offset values and to set bits within the probabilistic summary data structure that are at offsets equal to the set of offset values generated by the plurality of hash functions. 2. The backup system of claim 1, wherein the garbage collection logic further comprises: a locator to identify blocks of data that are referenced by the set of active storage trees by walking at least a portion of the set of active storage trees; and a collector to clean the allocated address space based on the probabilistic summary data structure to mark as unallocated blocks of data that are no longer referenced by the set of active storage trees. 3. The backup system of claim 2, wherein the encoder further comprises: an initialization module to set bits within the probabilistic summary data structure to zero. 4. The backup system of claim 3, wherein the collector comprises: a bit vector generator comprising a hash module to perform the first set of hash functions on each of the blocks within the allocated address space to generate a set of offset values and to set bits in the bit vector that are offsets equal to the set of offset values; an evaluator to compare the bit vector with the probabilistic summary data structure to determine if each of the blocks within the allocated address space is referenced; a copier to copy the determined referenced blocks into an unallocated address space within the set of storage devices; and a deallocator module to mark the allocated address space as unallocated. 5. The backup system of claim 4, wherein new data is to overwrite the blocks of data stored within the previously allocated address space that is marked as unallocated. 6. The backup system of claim 1, further comprising a location table referenced by the leaf nodes to identify the backup locations of the blocks of data in the set of one or more storage devices. 7. The backup system of claim 1, further comprising a computing device to access the set of storage devices, the computing device including a memory for storing the probabilistic summary data structure. 8. The backup system of claim 7, wherein a size of the probabilistic summary data structure is less than a size of the memory of the computing device. 9. The backup system of claim 1, wherein two different nodes from two different storage trees may reference a same node. 10. The backup system of claim 9, wherein one node makes multiple references to a different node. 11. The backup system of claim 10, wherein two different nodes of a same storage tree reference a same node in the same storage tree. 12. The backup system of claim 11, wherein two different nodes from a same file in a storage tree reference a same node in the storage tree. 13. The apparatus of claim 1, wherein the garbage collection logic also generates a probabilistic summary data structure based on a tracking of the unreferenced blocks of data. 14. A backup system for performing garbage collection of an allocated address space in a set of one or more storage devices comprising: a set of one or more active storage trees each representing a version of multiple data, each leaf node of the set of active storage trees representing a block of data from the multiple data that has been backed up onto the set of storage devices; and a garbage collection logic, executing on a processor, responsive to deletion of a version of the multiple data, to walk at least a portion of the set of active storage trees and record in a probabilistic summary data structure based on a Bloom filter which of the blocks of data identified by the part of the set of active storage trees that was walked are still referenced within the multiple data, wherein the garbage collection logic includes an encoder for generating the probabilistic summary data structure based on a Bloom filter, and wherein the encoder includes a Bloom filter to perform a first set of one or more hash functions on the still referenced blocks to generate a set of offset values and to set bits within the probabilistic summary data structure that are at offsets equal to the set of offset values generated by the first set of hash functions. 15. The backup system of claim 14, wherein the garbage collection logic further comprises: a collector for cleaning the allocated address space based on the probabilistic summary data structure to mark as unallocated blocks of data that are no longer referenced by the set of active storage trees. 16. The backup system of claim 15, wherein the encoder further comprises: an initialization module to set bits within the probabilistic summary data structure to zero. 17. The backup system of claim 16, wherein the collector comprises: a bit vector generator comprising the Bloom filter to perform the first set of hash functions on each of the blocks within the allocated address space to generate a set of offset values and to set bits in the bit vector that are offsets equal to the set of offset values; an evaluator to compare the bit vector with the probabilistic summary data structure to determine if each of the blocks within the allocated address space is referenced; a copier to copy the determined referenced blocks into an unallocated address space within the storage device; and a deallocator module to mark the allocated address space as unallocated. 18. The backup system of claim 17, wherein new data is to overwrite the blocks of data stored within the previously allocated address space that is marked as unallocated. 19. An apparatus for performing garbage collection of a range of addresses within an allocated address space of a set of one or more storage devices comprising: a backup system on a computer to back up a file system, said backup system comprising: a backup logic to generate and maintain a plurality of active storage trees, each active storage tree comprising a plurality of leaf nodes each representing blocks of data stored in the set of storage devices; a determination module to determine which range of addresses is to be cleaned within the allocated address space; a first encoder to encode blocks of data within the range of addresses into a first probabilistic summary data structure, wherein the first encoder includes a first Bloom filter to perform a first set of hash functions on each block of data within the range of addresses to be cleaned to generate a set of offset values and to set bits within the first probabilistic summary data structure that are at offsets equal to the set of offset values generated by the first set of hash functions; a locator to identify blocks of data that are referenced by the plurality of active storage trees by walking the plurality of active storage trees; an evaluator to determine blocks of data that are both referenced by the set of active storage trees and within the range of addresses to be cleaned based on the first probabilistic summary data structure; a second encoder to encode the blocks of data that are both referenced and within the range of addresses to be cleaned into a second probabilistic summary data structure, wherein the second encoder includes a second Bloom filter to perform a second set of one or more hash functions on each of the referenced blocks of data that are both referenced and within the range of addresses to be cleaned to generate a set of offset values and to set bits in the second probabilistic summary data structure that are at offsets equal to the set of offset values generated by the second set of hash functions; and a collector to clean the range of addresses within the allocated address space based on the second probabilistic summary data structure to mark as unallocated blocks of data that are no longer referenced. 20. The apparatus of claim 19, wherein the first encoder to encode blocks of data within the range of addresses into the first probabilistic summary data structure comprises: an initialization module to set bits within the first probabilistic summary data structure to zero. 21. The apparatus of claim 20, wherein the evaluator to determine blocks of data that are both referenced and within the range of addresses to be cleaned based on the first probabilistic summary data structure comprises: a bit vector generator comprising the first Bloom filter to generate a bit vector for each of the referenced blocks by performing the first set of hash functions on the referenced blocks to generate a set of offset values and setting bits in the bit vector that are at offsets equal to the set of offset values generated by the first set of hash functions; and an evaluator to compare the bit vector with the first probabilistic summary data structure to determine if the referenced block is in the range of addresses to be cleaned. 22. The apparatus of claim 21, wherein performing the first set of hash functions on each block of data within the range of addresses to be cleaned includes applying the Bloom filter to an identification of each of the blocks or to data within each of the blocks. 23. The apparatus of claim 21, wherein the second encoder to encode the blocks of data into a second probabilistic summary data structure comprises: an initialization module to set bits within the second probabilistic summary data structure to zero. 24. The apparatus of claim 23, wherein the collector to clean the range of addresses to be cleaned within the allocated address space based on the second probabilistic summary data structure comprises: a bit vector generator comprising the second Bloom filter to perform the second set of hash functions on each block of data within the range of addresses to be cleaned to generate a set of offset values and to set bits in the bit vector that are at offsets equal to the set of offset values generated by the second set of hash functions; an evaluator to compare the bit vector with the second probabilistic summary data structure to determine if the block of data is both referenced and within the range of addresses to be cleaned; a copier to copy each block of data that is determined to be both referenced and within the range to be cleaned to an unallocated address space within the storage device; and a deallocator module to mark the range of addresses to be cleaned as unallocated. 25. The apparatus of claim 24, wherein new data is to overwrite the blocks of data stored within the previously allocated address space that is marked as unallocated. 26. The apparatus of claim 24, wherein performing the second set of hash functions on the blocks of data includes applying the Bloom filter to an identification of each of the blocks or to data within each of the blocks. 27. The apparatus of claim 19, further comprising a location table referenced by the leaf nodes to identify the backup locations of the blocks of data in the set of storage devices. 28. The apparatus of claim 19, further comprising a computing device to access the set of storage devices, the computing device including a memory having stored therein the first and second probabilistic summary data structures. 29. The apparatus of claim 28, wherein a size of the probabilistic summary data structure is less than a size of the memory of the computing device. 30. An apparatus comprising: a backup system on a computer to back up a file system, said backup system including, a backup logic to generate a set of trees each representing backup snapshots of said file system at different times by recording references to blocks of backed up data stored in a set of one or more storage devices; and a garbage collection logic coupled to access said set of trees to at least approximate garbage collection of unreferenced ones of said blocks of data by tracking, with a Bloom filter, unreferenced ones of said blocks, wherein the garbage collection logic is to generate a summary data structure based on a tracking of the unreferenced ones of said blocks, wherein a size of the summary data structure is less than a size of a local memory of the backup system, wherein the garbage collection logic includes a hash module to perform a first plurality of hash functions on different ones of the blocks to generate a set of offset values and to set bits within the probabilistic summary data structure that are at offsets equal to the set of offset values generated by the plurality of hash functions. 31. The apparatus of claim 30, wherein blocks of backed up data from two different trees references a same block of backed up data. 32. An apparatus for performing garbage collection in a range of addresses to be cleaned in an allocated address space of one or more storage devices, wherein the allocated address space has stored therein blocks of data that are referenced by a set of one or more active storage trees and blocks of data that are no longer referenced by the set of one or more active storage trees comprising: a backup logic to generate the set of one or more active storage trees, each storage tree comprising one or more leaf nodes representing blocks of data stored within the one or more storage devices on a computer; and a first encoder to determine the range of addresses to be cleaned and to encode blocks of data within the range of addresses to be cleaned into a first probabilistic summary data structure, wherein the first encoder includes, a hash module to perform a first set of one or more hash functions on each block of data within the range of addresses to be cleaned, wherein each of the first set of one or more hash functions generates a set of one or more offset values, and a write logic to set bits within the first probabilistic summary data structure that are at offsets equal to the set of one or more offset values generated by the first set of one or more hash functions; a locator to identify blocks of data that are referenced by the set of one or more storage trees by walking the set of one or more active storage trees; an evaluator to determine blocks of data that are both referenced by the set of one or more storage trees and within the range of addresses to be cleaned based on the first probabilistic summary data structure; a second encoder to encode those blocks of data that are both referenced and within the range of addresses to be cleaned into a second probabilistic summary data structure, wherein the second encoder includes, a hash module to perform a second set of one or more hash functions on each of the blocks of data that are both referenced and within the range of addresses to be cleaned, wherein each of the second set of one or more hash functions generates a set of one or more offset values; and a write logic to set bits in the second probabilistic summary data structure that are at offsets equal to the set of one or more offset values generated by the second set of one or more hash functions; and a collector to clean the range of addresses to be cleaned within the allocated address space based on the second probabilistic summary data structure to mark as unallocated those blocks of data that are no longer referenced. 33. The apparatus of claim 32, further comprising a memory having stored therein the first and second probabilistic summary data structures. 34. The apparatus of claim 32, wherein the first encoder comprises: an initialization module to set bits within the first probabilistic summary data structure to zero. 35. The apparatus of claim 34, wherein the evaluator comprises: a bit vector generator comprising: a hash module to generate a bit vector for each of the referenced blocks by performing the first set of one or more hash functions on the referenced blocks, wherein each of the first set of one or more hash functions generates a set of one or more offset values; and a write logic to set bits in the bit vector that are at offsets equal to the set of one or more offset values generated by the first set of one or more hash functions; and a evaluator to compare the bit vector with the first probabilistic summary data structure to determine if the referenced block is within the range of addresses to be cleaned. 36. The apparatus of claim 35, wherein the second encoder comprises: an initialization module to set bits within the second probabilistic summary data structure to zero. 37. The apparatus of claim 36, wherein the collector comprises: a bit vector generator comprising: a hash module to perform the second set of one or more hash functions on each block of data within the range of addresses to be cleaned, wherein each of the second set of one or more hash functions generates a set of one or more offset values; and a write logic to set bits in the bit vector that are at offsets equal to the set of one or more offset values generated by the second set of one or more hash functions; and a evaluator to compare the bit vector with the second probabilistic summary data structure to determine if the block of data is both referenced and within the range of addresses to be cleaned; a copier to copy each block of data that is determined to be both referenced and within the range to be cleaned to an unallocated address space within the storage device; and a deallocator module to mark the range of addresses to be cleaned as unallocated. 38. The apparatus of claim 37, wherein new data overwrites the blocks of data stored within the range of addresses marked as unallocated.
Gladney Henry M. (Saratoga CA) Lorch Douglas J. (San Jose CA) Mattson Richard L. (San Jose CA), Communication for version management in a distributed information service.
Hitz David ; Malcolm Michael ; Lau James ; Rakitzis Byron, Method for maintaining consistent states of a file system and for creating user-accessible read-only copies of a file s.
Colgrove, John; Hayes, John; Hong, Bo; Wang, Feng; Miller, Ethan; Harmer, Craig, Adjusting a number of storage devices in a storage system that may be utilized to simultaneously service high latency operations.
Colgrove, John; Hayes, John; Hong, Bo; Wang, Feng; Miller, Ethan; Harmer, Craig, Maintaining a target number of storage devices for variable I/O response times in a storage system.
Colgrove, John; Hayes, John; Hong, Bo; Wang, Feng; Miller, Ethan; Harmer, Craig, Reducing a number of storage devices in a storage system that are exhibiting variable I/O response times.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.