Design: Ozone Snapshot Deletion Garbage Collection based on key deletedTable

Siyao Meng, Prashant Pogde

Dec 6, 2022. Last revised on Mar 15, 2023

§1. Introduction

This document describes the Ozone snapshot deletion key (blocks) garbage collection design that works with the current Ozone snapshot implementation based on RocksDB checkpoints. Specifically, this design centers around deletedTable in OM.

§2. The Goal of Snapshot Deletion

Reverse the snapshot creation process.
Deal with any delayed key clean up.

Note: “Delayed” in the sense that the keys captured in previous snapshots won’t be cleaned up when the same keys are deleted from the active file system. And the clean up would only happen later when all snapshots holding those keys are deleted.

§3. Background

3.1. What is `DeletedTable`

In current Ozone design, keys are moved from keyTable (for OBS buckets) or fileTable (for FSO buckets) to deletedTable when they are deleted from the file system in a Ratis transaction. After that, KeyDeletingService#KeyDeletingTask background task in each run would pick up a maximum of 20,000 (by default, configurable) keys’ blocks from deletedTable and issue deleteKeyBlocks command to SCM on those blocks. By default, this task is triggered every 60 seconds, which is configurable as well.

3.2. Why use `DeletedTable`

With the current Ozone implementation, every single key deleted are moved to deletedTable before its blocks would be removed. Thus, this is the best place to gate-keep block deletion in Ozone snapshots.
Logically, delaying key blocks deletion in KeyDeletingTask inside deletedTable until all snapshots have released those keys perfectly achieves the purpose of garbage collection. We don’t need to fully scan snapshot’s keyTable upon its deletion to check if a key still exists in other snapshot (and we are not doing explicit block reference counting in Ozone). If a key exists in a deletedTable, it must mean that key has been deleted by that point. This piece of information combined with the previous file system state (snapshot) is sufficient to judge whether the blocks can be reclaimed (deleted).
An alternative garbage collection approach, which utilize snapshot diff list, requires a diff (or two) to be calculated before a snapshot deletion garbage collection can be performed, which could take a long time. Overall efficiency would be much better with the deletedTable approach.

§4. Algorithm: Core logic, and others necessary changes

In this section, we will describe the core garbage collection logic, along with miscellaneous changes that would be required in all existing components (that we could think of) at this moment. If you find this hard to follow, check out the example in the section below which hopefully should be less abstract.

Assume all snapshots mentioned here are taken in the same bucket.