Prometheus TSDB internals

8 minute read

Prometheus has local and remote storage. At the beginning, it creates a local storage, a remote storage, and a fanout storage. A fanout storage is like a wrapper storage that wraps a primary storage and multiple secondary storages, which proxies reads and writes through to the underlying.

Internally, Prometheus uses a type of struct called Head to maintain a series of data, and to persist the in-memory data to the disk.

When opening a local db by tsdb.Open(), it loads data from the write ahead log and prepares the head for writes. It then starts a goroutine to compact two-hour blocks per minute (with exponential backoff on error).

Memory series

In Prometheus, a series is a struct of type memSeries, which contains an ID (also called ref), a label set, and other fields.

A series holds a slice of memory chunks that each of them is of type memChunk, which is used to store timestamp and value pairs (samples). Samples are appended (encoded) into a memory chunk. When some condition matches, a chunk is marked as complete (it will be compacted) and a new chunk is created (analogous to cut). The headChunk field always points to the current chunk to append.

tsdb/head.go

1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
type memSeries struct {
	sync.RWMutex

	ref          uint64
	lset         labels.Labels
	chunks       []*memChunk
	headChunk    *memChunk
	chunkRange   int64
	firstChunkID int

	nextAt        int64 // Timestamp at which to cut the next chunk.
	sampleBuf     [4]sample
	pendingCommit bool // Whether there are samples waiting to be committed to this series.

	app chunkenc.Appender // Current appender for the chunk.

	txs *txRing
}

What’s the condition that a new chuck should be created? nextAt is a field that indicates that. If the sample’s timestamp is greater than or equal to nextAt, a new chunk should be created. nextAt is initialized by chunkRange field (by default it’s 2 hours), but is changed by its increasing velocity when it’s 25% full of 120 samples. The number 120 is based on Gorilla white papers, which offers near-optimal compression ratio.

Tip: For more information about Gorilla white papers, please visit here.

Internally, memChunk is a wrapper around chunkenc.XORChunk. chunkenc.XORChunk and its appender implements Gorilla. Each XORChunk has a bits stream, samples are encoded and compressed into that stream. When compacting the chunk, it checks if the remainder capacity of the bits stream is over a threshold (32). If so, it will decrease the capacity by copying it to a new slice of bytes.

Head

Head handles reads and writes of time series data within a time window. It holds a lastSeriesID filed of type uint64, which starts at 1 for the first series and increases by 1 every time.

Stripe series

Head holds a series field which is of type stripeSeries. When creating a stripeSeries, the size by default is 2^14.

The stripeSeries is like a list of buckets. Each bucket has a map[uint64]*memSeries, seriesHashmap and a stripeLock. map[uint64]*memSeries is a map of series ID and series value. seriesHashmap is a map of hash value of series label set and a slice of *memSeries. Why this? Because the hash value of series label set may not be unique, so it needs to use a slice to store series values.

So by using such a struct, it not only can quickly find a series by its ID, but also by a label set. Given an ID, it finds the bucket by id & size - 1. Given a label set hash, it finds the bucket by hash & size - 1.

tsdb/head.go

1567
1568
1569
1570
1571
1572
type stripeSeries struct {
	size   int
	series []map[uint64]*memSeries
	hashes []seriesHashmap
	locks  []stripeLock
}

Head appender

In every scrape, Head returns a new headAppender which the scraper uses to add samples.

For adding, it has two methods: Add and AddFast. Add first checks if the series identified by the passed label set already exists in stripeSeries. If not, it creates one, adds it to stripeSeries and postings. For each label, it collects possible values for the label name into values, and adds its name and value as keys of a symbols map. Note that stripeSeries, postings, values and symbols are all fields of the head. Finally, it adds the created series to series field of headAppender (used when writing WAL) and calls AddFast to do remainder work.

As it name suggests, AddFast runs faster because it doesn’t need to create a series. It should only be called when a series has already been created in the head. It finds the created series by ID from the head, and then adds the sample and series to samples slice and sampleSeries slice respectively (the two slices are one-to-one).

tsdb/head.go

936
937
938
939
940
941
942
943
944
945
946
type headAppender struct {
	head         *Head
	minValidTime int64 // No samples below this timestamp are allowed.
	mint, maxt   int64

	series       []record.RefSeries
	samples      []record.RefSample
	sampleSeries []*memSeries

	appendID, cleanupAppendIDsBelow uint64
}

When all samples for a scrape are added in headAppender, the scraper calls headAppender.Commit() to complete. It checks if there are newly created series (series is not empty), if so, it writes them to ahead log. Then it checks if samples is empty, if not, it writes them to ahead log. Finally, it iterates samples, add appends each sample to its series. It’s at this time that all samples for a scrape are truly added to series.

Compact

The compactor first checks whether the head is compactable. If yes, it compacts head. Finally, it compacts blocks. When the head time range is 1.5 times the chunk range, the head will be compacted.

Compact head

For this operation, it mainly calls Compactor.Write. It allocates a new ULID, and initializes a BlockMeta with min/max time range and compaction information. It then creates a new temporary directory with <ULID>.tmp as name. It populates a block, and writes BlockMeta to a meta.json file in the block root directory. After that, it creates an empty tombstones file by passing in an empty in-memory tombstone reader. Finally, it replaces the temporary directory name with ULID.

The relevant code location is shown in below snippet.

tsdb/compact.go

526
527
func (c *LeveledCompactor) write(dest string, meta *BlockMeta, blocks ...BlockReader) (err error) {
...

Compact data if possible. After successful compaction blocks are reloaded which will also trigger blocks to be deleted that fall out of the retention window. If no blocks are compacted, the retention window state doesn’t change. Thus, this is sufficient to reliably delete old data. Old blocks are only deleted on reload based on the new block’s parent information. See DB.reload documentation for further information.

Postings

Postings is a struct that stores a map of label name and a map of label value and series id pair pair. That’s to say, the key of the top level map is the label name from a set of labels, with its value being a map. Furthermore, The key of the second level map is the label value from a set of labels, with its value being a slice of the series ids. So we can quickly get all series ids by a label.

It also provides an empty label which is called allPostingsKey. When adding a new series id for a set of labels, the series id is also added for that empty label. So we can quickly get all series ids in one Postings.

tsdb/index/postings.go

38
39
40
41
42
type MemPostings struct {
	mtx     sync.RWMutex
	m       map[string]map[string][]uint64
	ordered bool
}

On-disk layout

The directory structure of a Prometheus server’s data directory looks something like this:

./data
├ 01BKGV7JBM69T2G1BGBGM6KB12
│   └ meta.json
├ 01BKGTZQ1SYQJTR4PB43C8PD98
│   ├ chunks
│   │   └ 000001
│   ├ tombstones
│   ├ index
│   └ meta.json
├ 01BKGTZQ1HHWHV8FBJXW1Y3W0K
│   └ meta.json
├ 01BKGV7JC0RY8A6MACW02A2PJD
│   ├ chunks
│   │   └ 000001
│   ├ tombstones
│   ├ index
│   └ meta.json
└ wal
    ├ 00000002
    └ checkpoint.000001

Chunks

Each chunks file is composed of the following:

A header section which contains a magic number, a format version, and some padding bytes.
Multiple chunk sections that each of them contains a chunk of encoded timestamp/value pairs.

Chunks are segmented into segment files when its file size is over 512MiB. A chunk in a file can be referenced from the index file by uint64. The lower 4 bytes of the reference value is the chunk’s offset in the file, and the higher 4 bytes are the segment sequence number.

Tip: For more information about the chunks file format, please visit here.

Index

The index file format is composed of following sections:

A header section which contains a magic number and format version.
A symbol table section which holds a sorted list of deduplicated strings to reduce the total index size, and are referenced by subsequent sections.
A series section which contains a sequence of series. Each of them contains its label set and chunks. The chunk contains time range information and chunk file’s offset from which the chunk data begins.
Multiple label index sections which are no longer used.
Multiple postings sections that each of them contains a list of series which are associated with a given label name and value.
A label index table section which is no longer used.
A postings table section which contains a sequence of postings offset entries. Each entry contains a label pair and an offset of the index file from which the postings (that contains a sequence of series) begins.
A TOC (table of contents) section that contains the offsets to the beginning of the above sections.

Tip: For more information about the index file format, please visit here.

Tombstones

The tombstones file is composed of the following:

A header section which contains a magic number and format version.
A tombstones section that contains a sequence of tombstone. Each tombstone contains a ref and a time range.

Tip: For more information about the tombstones file format, please visit here.