Memory efficient sanitization of a deduplicated storage system
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-012/00
G06F-003/06
G06F-012/08
출원번호
US-0763508
(2013-02-08)
등록번호
US-9430164
(2016-08-30)
발명자
/ 주소
Botelho, Fabiano C.
Garg, Nitin
Shilane, Philip N.
Wallace, Grant
출원인 / 주소
EMC Corporation
대리인 / 주소
Blakely, Sokoloff, Taylor & Zafman LLP
인용정보
피인용 횟수 :
6인용 특허 :
9
초록▼
Techniques for sanitizing a storage system are described herein. In one embodiment, for each file stored in the storage system, a list of fingerprints representing data chunks of the file is obtained. In such an embodiment, for each of the fingerprints, identifying a first container storing a data c
Techniques for sanitizing a storage system are described herein. In one embodiment, for each file stored in the storage system, a list of fingerprints representing data chunks of the file is obtained. In such an embodiment, for each of the fingerprints, identifying a first container storing a data chunk corresponding to the fingerprint is identified, and determining a storage location of the first container in which the data chunk is stored is determined. In one embodiment, a bit in copy bit vector (CBV) is populated based on the identified container and the storage location. In one embodiment, after all of the bits corresponding to the data chunks of the first container have been populated in the CBV, data chunks represented by the CBV are copied from the first container to a second container, and records of the data chunks in the first container are erased.
대표청구항▼
1. A computer-implemented method for sanitizing a storage system, the method comprising: for each of a plurality of files stored in a file system of the storage system, obtaining a list of fingerprints representing data chunks of the file from a checkpointed on disk fingerprint-to-container (FTC) in
1. A computer-implemented method for sanitizing a storage system, the method comprising: for each of a plurality of files stored in a file system of the storage system, obtaining a list of fingerprints representing data chunks of the file from a checkpointed on disk fingerprint-to-container (FTC) index, wherein the data chunks are deduplicated data chunks, and wherein at least one data chunk is referenced by multiple files in the file system;for each of the fingerprints, performing a lookup operation based on the fingerprint in a cache storing a plurality of cache entries, each mapping a fingerprint to a container identifier (ID) storing the corresponding data chunk and a chunk ID indicating a storage location of the data chunk within the container;identifying a first container ID identifying a first container storing a data chunk corresponding to the fingerprint from a first cache entry matching the fingerprint,determining from the first cache entry a first chunk ID identifying a storage location of the first container in which the data chunk is stored, andin response to determining that the fingerprint is not found in the cache: looking up the fingerprint in the FTC index to identify the first container ID storing the corresponding data chunk represented by the fingerprint;reading, into the cache, metadata of the first container having the first container ID; andlooking up the first chunk ID, using the fingerprint, in the metadata of the first container having the first container ID;populating a bit in a copy bit vector (CBV) based on the first container ID and the first chunk ID, the CBV including a plurality of bits and each storing a bit value indicating whether a data chunk is to be copied, wherein a data chunk with a corresponding bit having a predetermined bit value in the CBV is a live data chunk, wherein a live data chunk is referenced by at least one of the files in the file system;after all of the bits corresponding to the fingerprints in the plurality of files have been populated in the CBV, copying live data chunks represented by the CBV from the first container to a second container; anderasing records of the data chunks in the first container after the live data chunks of the first container indicated by the CBV have been copied to the second container to reclaim a storage space associated with the first container, including padding a predetermined data value in the first container, and releasing the first container back to a pool of free containers for future reuse. 2. The method of claim 1, wherein performing a lookup operation in a cache comprises: performing a lookup operation in an index based on the fingerprint to identify the first container; andreading a metadata corresponding to the identified first container to determine the storage location, if the first container and storage location cannot be identified by the lookup operation in the cache. 3. The method of claim 2, further comprising storing the metadata obtained from the first container in the cache, the metadata including the fingerprint, a container identifier identifying the container storing the data chunk corresponding to the fingerprint, and a storage location identifier identifying a chunk offset within the identified container in which the data chunk is stored. 4. The method of claim 2, further comprising: receiving a data chunk to be stored in the storage system while sanitization is in progress;storing, in a buffer, a container identifier of a container storing the data chunk and a storage location identifier identifying a chunk offset within the identified container in which the data chunk is stored; andpopulating a bit in the CBV based on the container identifier and storage location identifier stored in the buffer. 5. The method of claim 2, wherein the CBV is included in a container index that maps a fingerprint of a data chunk to a container storing the data chunk, wherein each of the bits in the CBV is set, at the start of the sanitization process, to a predetermined value indicating the corresponding data chunk is dead, and wherein a dead data chunk is not referenced by any of the files in the file system. 6. The method of claim 1, wherein data chunks are copied from the first container to a second container if the first container contains at least one dead data chunk. 7. The method of claim 1, wherein deduplication is disabled during the sanitization process. 8. A non-transitory computer-readable medium having instructions stored therein, which when executed by a computer, cause the computer to perform operations, the operations comprising: for each of a plurality of files stored in a file system of the storage system, obtaining a list of fingerprints representing data chunks of the file from a checkpointed on disk fingerprint-to-container (FTC) index, wherein the data chunks are deduplicated data chunks, and wherein at least one data chunk is referenced by multiple files in the file system;for each of the fingerprints, performing a lookup operation based on the fingerprint in a cache storing a plurality of cache entries, each mapping a fingerprint to a container identifier (ID) storing the corresponding data chunk and a chunk ID indicating a storage location of the data chunk within the container,identifying a first container ID identifying a first container storing a data chunk corresponding to the fingerprint from a first cache entry matching the fingerprint,determining from the first cache entry a first chunk ID identifying a storage location of the first container in which the data chunk is stored, andin response to determining that the fingerprint is not found in the cache: looking UP fingerprint in the FTC index to identify the first container ID storing the corresponding data chunk represented by the fingerprint;reading, into the cache, metadata of the first container having the first container ID; andlooking UP the first chunk ID, using the fingerprint, in the metadata of the first container having the first container ID;populating a bit in a copy bit vector (CBV) based on the first container ID and the first chunk ID, the CBV including a plurality of bits and each storing a bit value indicating whether a data chunk is to be copied, wherein a data chunk with a corresponding bit having a predetermined bit value in the CBV is a live data chunk, wherein a live data chunk is referenced by at least one of the files in the file system;after all of the bits corresponding to the fingerprints in the plurality of files have been populated in the CBV, copying live data chunks represented by the CBV from the first container to a second container; anderasing records of the data chunks in the first container after the live data chunks of the first container indicated by the CBV have been copied to the second container to reclaim a storage space associated with the first container, including padding a predetermined data value in the first container, and releasing the first container back to a pool of free containers for future reuse. 9. The non-transitory computer-readable medium of claim 8, wherein performing a lookup operation in a cache comprises: performing a lookup operation in an index based on the fingerprint to identify the first container; andreading a metadata corresponding to the identified first container to determine the storage location, if the first container and storage location cannot be identified by the lookup operation in the cache. 10. The non-transitory computer-readable medium of claim 9, wherein the operations further comprise storing the metadata obtained from the first container in the cache, the metadata including the fingerprint, a container identifier identifying the container storing the data chunk corresponding to the fingerprint, and a storage location identifier identifying a chunk offset within the identified container in which the data chunk is stored. 11. The non-transitory computer-readable medium of claim 9, wherein the operations further comprise: receiving a data chunk to be stored in the storage system while sanitization is in progress;storing, in a buffer, a container identifier of a container storing the data chunk and a storage location identifier identifying a chunk offset within the identified container in which the data chunk is stored; andpopulating a bit in the CBV based on the container identifier and storage location identifier stored in the buffer. 12. The non-transitory computer-readable medium of claim 9, wherein the CBV is included in a container index that maps a fingerprint of a data chunk to a container storing the data chunk, and wherein each of the bits in the CBV is set, at the start of the sanitization process, to a predetermined value indicating the corresponding data chunk is dead, and wherein a dead data chunk is not referenced by any of the files in the file system. 13. The non-transitory computer-readable medium of claim 8, wherein data chunks are copied from the first container to the second container if the first container contains at least one dead data chunk. 14. The non-transitory computer-readable medium of claim 8, wherein deduplication is disabled during the sanitization process. 15. A data processing system, comprising: a processor; anda memory to store instructions, which when executed from the memory, cause the processor to perform operations, the operations including for each of a plurality of files stored in a file system of the storage system, obtaining a list of fingerprints representing data chunks of the file from a checkpointed on disk fingerprint-to-container (FTC) index, wherein the data chunks are deduplicated data chunks, and wherein at least one data chunk is referenced by multiple files in the file system;for each of the fingerprints, performing a lookup operation based on the fingerprint in a cache storing a plurality of cache entries, each mapping a fingerprint to a container identifier (ID) storing the corresponding data chunk and a chunk ID indicating a storage location of the data chunk within the container,identifying a first container ID identifying a first container storing a data chunk corresponding to the fingerprint from a first cache entry matching the fingerprint,determining from the first cache entry a first chunk ID identifying a storage location of the first container in which the data chunk is stored, andin response to determining that the fingerprint is not found in the cache: looking UP the fingerprint in the FTC index to identify the first container ID storing the corresponding data chunk represented by the fingerprint;reading, into the cache, metadata of the first container having the first container ID; andlooking UP the first chunk ID, using the fingerprint, in the metadata of the first container having the first container ID;populating a bit in a copy bit vector (CBV) based on the first container ID and the first chunk ID, the CBV including a plurality of bits and each storing a bit value indicating whether a data chunk is to be copied, wherein a data chunk with a corresponding bit having a predetermined bit value in the CBV is a live data chunk, wherein a live data chunk is referenced by at least one of the files in the file system;after all of the bits corresponding to the fingerprints in the plurality of files have been populated in the CBV, copying live data chunks represented by the CBV from the first container to a second container; anderasing records of the data chunks in the first container after the live data chunks of the first container indicated by the CBV have been copied to the second container to reclaim a storage space associated with the first container, including padding a predetermined data value in the first container, and releasing the first container back to a pool of free containers for future reuse. 16. The system of claim 15, wherein performing a lookup operation in a cache comprises: performing a lookup operation in an index based on the fingerprint to identify the first container; andreading a metadata corresponding to the identified first container to determine the storage location, if the first container and storage location cannot be identified by the lookup operation in the cache. 17. The system of claim 16, wherein the operations further comprise storing, by the processor, the metadata obtained from the first container in the cache, the metadata including the fingerprint, a container identifier identifying the container storing the data chunk corresponding to the fingerprint, and a storage location identifier identifying a chunk offset within the identified container in which the data chunk is stored. 18. The system of claim 16, wherein the operations further comprise: receiving, by the processor, a data chunk to be stored in the storage system while sanitization is in progress;storing by the processor, in a buffer, a container identifier of a container storing the data chunk and a storage location identifier identifying a chunk offset within the identified container in which the data chunk is stored; andpopulating by the processor, a bit in the CBV based on the container identifier and storage location identifier stored in the buffer. 19. The system of claim 16, wherein the CBV is included in a container index that maps a fingerprint of a data chunk to a container storing the data chunk, and wherein each of the bits in the CBV is set, at the start of the sanitization process, to a predetermined value indicating the corresponding data chunk is dead, and wherein a dead data chunk is not referenced by any of the files in the file system. 20. The system of claim 15, wherein data chunks are copied from the first container to the second container if the first container contains at least one dead data chunk. 21. The system of claim 15, wherein deduplication is disabled during the sanitization process.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (9)
Sekiguchi, Katsutomo, Computer-readable storage medium storing generational garbage collection program.
Agesen Ole ; Detlefs David L. ; White Derek R., Method, apparatus, and article of manufacture for facilitating resource management for applications having two types of program code.
Shuler ; Jr. Robert L. (Friendswood TX), Real-time garbage collection for list processing using restructured cells for increased reference counter size.
Chambliss, David D.; Fischer-Toubol, Jonathan; Glider, Joseph S.; Harnik, Danny; Khaitzin, Ety; Kuttner, Yifat; Moser, Michael; Shatsky, Yosef, Direct lookup for identifying duplicate data in a data deduplication system.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.