【ETCD】Analysis of the underlying mechanism of ETCD Defrag

Posted by Hao Liang's Blog on Saturday, December 18, 2021

server/storage/backend/backend.go#defrag()

server/storage/backend/backend.go#defragdb()

2. Why do we need defrag

When we use K8s clusters daily, if we frequently add or delete cluster data, we will find a strange phenomenon:

  • Even though the amount of object data in the cluster has not increased significantly, the disk space occupied by etcd data files is increasing.

At this time, check the relevant information. etcd officially recommends using the defrag command of the provided etcdctl tool to defragment the data of each etcd node:

etcdctl  --endpoints=${ENDPOINTS} defrag --command-timeout=60s

Defragmentation can often release a large amount of disk space, preventing the data volume of etcd from reaching the initial setting of –quota-backend-bytes and causing the data to be read-only.

3. What exactly does ‘defrag’ do? Will the data be deleted?

etcd data storage mechanism

etcd data is stored in boltdb in the form of db files:

ls -l /path-to-etcd-dir/etcd/member/snap/db

-rw------- 1 root root 2270765056 Dec 18 17:34 db

The db file stores data in the form of pages. Pages are divided into the following categories:

  • meta page:Record db metadata, including file identification, version number, page size, root bucket (pointing to the bucket in branch page or leaf page), freelist page ID and other information.
  • freeList page:Record which pages in the db are free and available for use.
  • branch page:Record the index of leaf page (leaf page id) and record the corresponding relationship between key and leaf page id.
  • leaf page:record bucket or user key value information.
  • free page:free page.

Bucket/bucket data structure

etcd has 2 data structure in source code: Bucket / bucket

The bucket is a page-oriented data structure, and the root field points to the page id of the root node of the bucket. For example, the root field of a bucket that records user key-value data is the id of the meta page. In other words, the root bucket in the meta page is the root bucket of all buckets.

// bucket represents the on-file representation of a bucket.
// This is stored as the "value" of a bucket key. If the bucket is small enough,
// then its root page can be stored inline in the "value", after the bucket
// header. In the case of inline buckets, the "root" will be 0.
type bucket struct {
    root     pgid   // page id of the bucket's root-level page
    sequence uint64 // monotonically incrementing, used by NextSequence()
}

Bucket is a data structure for B+ Tree, which contains the bucket and its corresponding key-value information. When we access the user key-value, we access the key-value information of the Bucket node.

// Bucket represents a collection of key/value pairs inside the database.
type Bucket struct {
	*bucket
	tx       *Tx                // the associated transaction
	buckets  map[string]*Bucket // subbucket cache
	page     *page              // inline page reference
	rootNode *node              // materialized node for the root page.
	nodes    map[pgid]*node     // node cache

	// Sets the threshold for filling nodes when they split. By default,
	// the bucket will fill to 50% but it can be useful to increase this
	// amount if you know that your write workloads are mostly append-only.
	//
	// This is non-persisted across transactions so it must be set in every Tx.
	FillPercent float64
}

Boltdb logically manages all sub-bucket and key-value data through B+ Tree.

When users need to find certain key-value data:

    1. First find the version number of the corresponding key through the treeIndex (B+ Tree implementation) maintained in etcd memory. (treeIndex maintains the corresponding relationship between key and keyIndex version numbers)
    1. First search the buffer for the key-value information corresponding to the version number.
    1. If the buffer is not hit, search the range in the B+ Tree of boltdb based on the key version number to find the corresponding branch page id, then find the leaf page id of the next layer through the index, and finally locate the user’s key-value.

When a user deletes a key through boltdb Open API:

    1. The leaf page where the key-value is located will delete the corresponding data.
    1. The data corresponding to the key and version number in treeIndex will not be deleted, but will be marked for deletion, and will not be deleted until compression. (Lazy delete allows the watcher to obtain all the status of the key and prevent B+ Tree from frequent rebalancing)
    1. As key-values are continuously deleted, free leaf pages will be generated, which will be recorded in the freeList page.
    1. If a new key-value is added in the future, this part of the free page will be used directly.

Due to the existence of free pages and some leaf pages that are not filled with key-values after the user deletes the key-value, the disk space occupied by boltdb will become larger and larger, and the defrag operation will be required. Release these leaf pages with free fragmentation space so that key-value data can be arranged more compactly on the disk.

defrag analysis

Let’s take a look at the source code of defrag again:

  • start defrag:
// server/storage/backend/backend.go
func (b *backend) defrag() error {
	
	...
	
	// create temporary db file
	dir := filepath.Dir(b.db.Path())
	temp, err := os.CreateTemp(dir, "db.tmp.*")
	
	...
	
	tdbp := temp.Name()
	tmpdb, err := bolt.Open(tdbp, 0600, &options)
	
	...

	dbp := b.db.Path()

	...
	
	// start executing defrag logic
	err = defragdb(b.db, tmpdb, defragLimit)
	...
	
	// After defrag is completed, rename the tmp file to a db file
	err = os.Rename(tdbp, dbp)
	if err != nil {
		b.lg.Fatal("failed to rename tmp database", zap.Error(err))
	}

	...
	
	return nil
}
  • defrag specific implementation logic:
// server/storage/backend/backend.go
func defragdb(odb, tmpdb *bolt.DB, limit int) error {
	// The newly created temporary boltdb starts write transactions.
	tmptx, err := tmpdb.Begin(true)
	if err != nil {
		return err
	}
	// If an exception occurs in defrag, the transaction will be rolled back.
	defer func() {
		if err != nil {
			tmptx.Rollback()
		}
	}()

	// The original boltdb starts read transactions
	tx, err := odb.Begin(false)
	if err != nil {
		return err
	}
	defer tx.Rollback()

	c := tx.Cursor()

	count := 0
	// Traverse all keys of the original boltdb through cursor
	for next, _ := c.First(); next != nil; next, _ = c.Next() {
		b := tx.Bucket(next)
		if b == nil {
			return fmt.Errorf("backend: cannot defrag bucket %s", string(next))
		}

		tmpb, berr := tmptx.CreateBucketIfNotExists(next)
		if berr != nil {
			return berr
		}
		tmpb.FillPercent = 0.9 // for bucket2seq write in for each

		if err = b.ForEach(func(k, v []byte) error {
			count++
			// When the key-value updated by each write transaction exceeds the limit (default 10000), a new write transaction will be reopened.
			// Avoid updating too many key-values in each write transaction and occupying too much memory.
			if count > limit {
				err = tmptx.Commit()
				if err != nil {
					return err
				}
				tmptx, err = tmpdb.Begin(true)
				if err != nil {
					return err
				}
				tmpb = tmptx.Bucket(next)
				tmpb.FillPercent = 0.9 // for bucket2seq write in for each

				count = 0
			}
			// Put the key-value obtained by traversal into the newly created temporary boltdb
			return tmpb.Put(k, v)
		}); err != nil {
			return err
		}
	}

	return tmptx.Commit()
}

As can be seen from the above code, defrag does the following things:

  • Create a temporary db file
  • Start a transaction, traverse the bucket in the old db (save the user’s key-value), and write to the bucket in the new db
  • Every time 10,000 times are written to the db, the write transaction is submitted once (disk persistence)
  • After the key-value writing of all buckets is completed, rename the temporary db file and replace the original db

The above process is actually the process of traversing the key-value data and rewriting it into the leaf page. This can release the leaf page with fragmented space in the original boltdb and allow the leaf page to fill up the key-value. In this way, the disk space occupied by boltdb will be released.

4. The difference between defrag and compact

  • defrag will not delete the data in boltdb, but will free up disk space by defragmenting the leaf page, which is lossless to the data.
  • Compact will delete the historical version data of the key, which is lossy to the data.