Siyao Meng, Prashant Pogde
Dec 6, 2022. Last revised on Mar 15, 2023
This document describes the Ozone snapshot deletion key (blocks) garbage collection design that works with the current Ozone snapshot implementation based on RocksDB checkpoints. Specifically, this design centers around deletedTable
in OM.
Note: “Delayed” in the sense that the keys captured in previous snapshots won’t be cleaned up when the same keys are deleted from the active file system. And the clean up would only happen later when all snapshots holding those keys are deleted.
DeletedTable
In current Ozone design, keys are moved from keyTable
(for OBS buckets) or fileTable
(for FSO buckets) to deletedTable
when they are deleted from the file system in a Ratis transaction. After that, KeyDeletingService#KeyDeletingTask
background task in each run would pick up a maximum of 20,000
(by default, configurable) keys’ blocks from deletedTable
and issue deleteKeyBlocks
command to SCM on those blocks. By default, this task is triggered every 60 seconds, which is configurable as well.
DeletedTable
deletedTable
before its blocks would be removed. Thus, this is the best place to gate-keep block deletion in Ozone snapshots.KeyDeletingTask
inside deletedTable
until all snapshots have released those keys perfectly achieves the purpose of garbage collection. We don’t need to fully scan snapshot’s keyTable
upon its deletion to check if a key still exists in other snapshot (and we are not doing explicit block reference counting in Ozone). If a key exists in a deletedTable
, it must mean that key has been deleted by that point. This piece of information combined with the previous file system state (snapshot) is sufficient to judge whether the blocks can be reclaimed (deleted).deletedTable
approach.In this section, we will describe the core garbage collection logic, along with miscellaneous changes that would be required in all existing components (that we could think of) at this moment. If you find this hard to follow, check out the example in the section below which hopefully should be less abstract.
Assume all snapshots mentioned here are taken in the same bucket.