【Operating System】Go Runtime's MADV_FREE memory release issue

Posted by Hao Liang's Blog on Saturday, November 27, 2021

1. Background

Related issues:

runtime: memory not being returned to OS #22439

runtime: provide way to disable MADV_FREE

When using applications compiled with go 1.12~1.15, it often happens that after the application is started, the resident memory RSS continues to increase as the running time increases, and the memory is never released.

2. Issue Analysis

Use pprof to analyze various memory usage in Go Runtime. The following is the meaning of various memories in pprof:

Reference: Go pprof memory indicator meaning memo

// The total number of bytes requested from the OS
// It is the sum of the various XxxSys indicators below. Contains the sum of the runtime heap, stack, and other internal data structures.
// It is virtual memory space. Not all are mapped into physical memory.
Sys

// see `Sys`
HeapSys

// The number of bytes of objects still in use and unused objects that have not been released by GC
// It should be smooth at ordinary times, but may appear jagged during gc.
HeapAlloc

//The number of bytes of the object in use.
// One detail is that if a span can contain multiple objects, as long as one object is in use, the entire span is counted.
// `HeapInuse` - `HeapAlloc` is memory reserved in GC and can be used quickly.
HeapInuse
//Memory that has been returned to the OS. Memory that has not been allocated again by the heap.
HeapReleased

//The number of bytes of unused span.
// This part of memory can be returned to the OS, and also contains `HeapReleased`.
// Can be applied for again, even used as stack memory.
// `HeapIdle` - `HeapReleased` is reserved by GC.
HeapIdle

/// ---

// Same as `HeapAlloc`
Alloc

// Accumulated `Alloc`
// Accumulation means that it will continue to increase cumulatively after the program is started and will never decrease.
TotalAlloc

Lookups = 0

// Cumulative number of allocated heap objects
Mallocs

// Cumulative number of released heap objects
Frees

// Number of surviving objects. See `HeapAlloc`
// HeapObjects = `Mallocs` - `Frees`
HeapObjects

// ---
// The meaning of Inuse in XxxInuse below and the meaning of Sys in XxxSys are basically the same as `HeapInuse` and `HeapSys`
// There is no XxxIdle because they are all included in `HeapIdle`

// StackSys is basically equal to StackInuse, plus system thread-level stack memory
Stack = StackInuse / StackSys

// Memory used for MSpan structure
MSpan = MSpanInuse/MSpanSys

// Memory used for MCache structure
MCache = MCacheInuse/MCacheSys

// The following are the memory statistics of XxxSys used by the underlying internal data structures
BuckHashSys
GCSys
OtherSys

// ---
// The following is related to GC

// The trigger threshold for the next GC. When HeapAlloc reaches this value, GC will be triggered.
NextGC

// The unix timestamp of the most recent GC
LastGC

// Start unix timestamp and end unix timestamp of GC in each cycle
// There may be 0 GCs in a cycle, or there may be multiple GCs. If there are multiple GCs, only the last one will be recorded.
PauseNs
PauseEnd

// GC times
NumGC

// The number of times the application forces GC
NumForcedGC

// The total CPU resources occupied by GC. between 0~1
GCCPUFraction

Under normal circumstances, as HeapReleased increases, RSS will decrease (Go Runtime returns used memory to the OS through GC), and through pprof, it is observed that as HeapReleased increases, RSS Still maintains the original value and has not declined. Why does this strange phenomenon happen?

Let’s take a look at go’s memory recycling related code

//src/runtime/mem_linux.go
func sysUnused(v unsafe.Pointer, n uintptr) {
    ...
	var advise uint32
    // In versions go 1.12 ~ 1.15, debug.madvdontneed defaults to 0, so _MADV_DONTNEED is not used to mark recyclable memory.
	if debug.madvdontneed != 0 {
		advise = _MADV_DONTNEED
	} else {
		advise = atomic.Load(&adviseUnused)
	}
	// Prioritize trying to use _MADV_FREE to mark recyclable memory
	if errno := madvise(v, n, int32(advise)); advise == _MADV_FREE && errno != 0 {
		// MADV_FREE was added in Linux 4.5. Fall back to MADV_DONTNEED if it is
		// not supported.
		atomic.Store(&adviseUnused, _MADV_DONTNEED)
		madvise(v, n, _MADV_DONTNEED)
	}
}

It can be seen from the above memory recycling code that in go 1.12 ~ 1.15 versions, priority is given to using _MADV_FREE to pass the madvise system Call to reclaim memory, The following is a brief description of the two marked memory recycling methods called by the madvise system:


       MADV_DONTNEED
              ...

              Note that, when applied to shared mappings, MADV_DONTNEED
              might not lead to immediate freeing of the pages in the
              range.  The kernel is free to delay freeing the pages
              until an appropriate moment.  The resident set size (RSS)
              of the calling process will be immediately reduced
              however.
              
              ...
       MADV_FREE (since Linux 4.5)
              The application no longer requires the pages in the range
              specified by addr and len.  The kernel can thus free these
              pages, but the freeing could be delayed until memory
              pressure occurs.  For each of the pages that has been
              marked to be freed but has not yet been freed, the free
              operation will be canceled if the caller writes into the
              page.  After a successful MADV_FREE operation, any stale
              data (i.e., dirty, unwritten pages) will be lost when the
              kernel frees the pages.  However, subsequent writes to
              pages in the range will succeed and then kernel cannot
              free those dirtied pages, so that the caller can always
              see just written data.  If there is no subsequent write,
              the kernel can free the pages at any time.  Once pages in
              the range have been freed, the caller will see zero-fill-
              on-demand pages upon subsequent page references.

              ...

To put it simply, after the memory marked by MADV_DONTNEED is recycled, the RSS will be reduced immediately and the memory will be released directly.

The memory marked with MADV_FREE adopts a lazy free delayed release method. The kernel will wait until the memory is tight before releasing it, and before releasing, this memory can still be reused.

In go 1.16 version, MADV_DONTNEED is reused as the default memory recycling marking method in Linux.

Related commit: runtime: default to MADV_DONTNEED on Linux

Related code:

// src/runtime/runtime1.go
	debug.cgocheck = 1
	debug.invalidptr = 1
	if GOOS == "linux" {
		// On Linux, MADV_FREE is faster than MADV_DONTNEED,
		// but doesn't affect many of the statistics that
		// MADV_DONTNEED does until the memory is actually
		// reclaimed. This generally leads to poor user
		// experience, like confusing stats in top and other
		// monitoring tools; and bad integration with
		// management systems that respond to memory usage.
		// Hence, default to MADV_DONTNEED.
		debug.madvdontneed = 1
	}

Users using go versions 1.12 to 1.15 can force the use of MADV_DONTNEED to reclaim memory by adding GODEBUG=madvdontneed=1 to the environment variable. For example, add environment variables to the Kubernetes Pod container:

apiVersion: v1
kind: Pod
metadata:
  name: etcd
  namespace: kube-system
spec:
  containers:
  - env:
    - name: GODEBUG
      value: madvdontneed=1
    command:
    - etcd
...