prometheus-junkyard / tsdb Goto Github PK

View Code? Open in Web Editor NEW

834.0 834.0 180.0 6.59 MB

The Prometheus time series database layer.

License: Apache License 2.0

Go 99.75% Makefile 0.25%

tsdb's People

Stargazers

Watchers

Forkers

gouthamve mylearning2017 mattbostock telendt yanghongkjxy qinguoan adohe-zz colstuwjx dilyevsky wjtian whidbey joehandzik gdvalle ankon kalivarapureshma criteo-forks shaybix alin-amana krasi-georgiev mbrukman basph dim simonpasquier josedonizetti jacksontj shomron pgier ranbochen nipuntalukdar jingkaihe linpingchuan aleksi objectiser cstyan dumbomir avjafrey dailypips danielqsj thinker0 shubheksha remeika yoshinori-koide red7hj anarcat codesome cofyc sipian seven7777777 mahak makhov bwplotka oiooj mjtrangoni lulzzz georgeerickson jsonkey josephlim75 fusakla free codwu melan bobmshannon jpisaac mknapphrt benoitknecht koalacxr cpatulea slonguo infoverload lenage vishalbelsare gzcassan jerryhax vvhh2002 textworld michaelhenrique182 csmarchbanks nomyself joshuakwan janickic mrsiano ben-dasilva richardartoul glutamatt kogent forging2012 junaidpk jalawala albert19882016 sabreoss ljun0228 joewrightss vn-ki kungehero naivewong mucahitkurt bboreham yeya24 bluemutedwisdom glightfoot

tsdb's Issues

Unknown series references in WAL after crash

After ungraceful termination sometimes samples in the WAL reference series that do not exist yet. This should strictly not happen as all newly created series are added (in-order) to the WAL before any of their samples.

Needs investigation but no critical impact at this stage.

Fix block ranges

Currently block ranges allow fine-grained configuration. Practically, this has little to no benefit for Prometheus. It adds a lot of complexity however when dealing with those blocks in other ways, for example backup procedures.

I propose switching to a time slicing approach where possible block ranges after different compaction levels are fixed, as well as their position in time. This prevents arbitrary overlapping and interleaving of blocks at a global level.

Suppose we want a maximum block size of 9 days, then our valid ranges are as follows:

The largest range is aligned with the 0 timestamp (epoch in Prometheus), smaller ones are always aligned with their parent range. Thus compaction will always result in larger blocks fitting the slices.

This can still be configurable in TSDB, especially for tests, but should be hard-coded in the using application.
For Prometheus I would fix it to the above values.

Alignment with wall clock weeks would have been nice but not all that important. TSDB having no notion of wall clocks and Unix 0 not aligning with the beginning of a week would just require a bunch of number shifting that just doesn't seem worth it.

@gouthamve @brian-brazil

Document that labels must be sorted before passing to (*headAppender).Add()

It seems that labels must be sorted before passing them to (*headAppender).Add(), otherwise the labelsets will be considered distinct.

We should either ensure the labels are sorted (if the overhead is not significant) or document that the labels must be sorted.

invalid memory address or nil pointer dereference in tsdb.newChunkSeriesIterator

I got this crash on two different test servers scraping the same targets and having the same rules (but the two traces originate in different rules, one is a recording rule, the other is an alert, so I assume it's not a specific "query of death"). Both call traces following:

This is built from commit 8c483e27d3d53b39ee0211c28eeeff7626c9ac99 in prometheus/prometheus.

1

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x1626deb]

goroutine 24276602 [running]:
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.newChunkSeriesIterator(0xc54107e358, 0x1, 0x1, 0xc661317860)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/querier.go:559 +0x3b
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*chunkSeries).Iterator(0xc6fd83ee70, 0x15ee512, 0xc4e8db21e0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/querier.go:463 +0x41
github.com/prometheus/prometheus/storage/tsdb.series.Iterator(0x26a0940, 0xc6fd83ee70, 0x10, 0xc4e8db63c0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/storage/tsdb/tsdb.go:114 +0x31
github.com/prometheus/prometheus/storage/tsdb.(*series).Iterator(0xc54107e360, 0xc4e8db21e0, 0x493e0)
        <autogenerated>:15 +0x56
github.com/prometheus/prometheus/promql.(*Engine).populateIterators.func2(0x7fd8658e8e28, 0xc4924a0c00, 0xeee5679e69)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:497 +0x1bd
github.com/prometheus/prometheus/promql.inspector.Visit(0xc83071f6c0, 0x7fd8658e8e28, 0xc4924a0c00, 0x7fd8658e8e28, 0x18ba1a0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:306 +0x3a
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc83071f6c0, 0x7fd8658e8e28, 0xc4924a0c00)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:255 +0x58
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc83071f6c0, 0x269d0c0, 0xc83071f6e0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:275 +0x1d1
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc83071f6c0, 0x7fd8658e8e80, 0xc74d118d60)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:285 +0x6ab
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc83071f6c0, 0x7fd8658e8dd0, 0xc72d2c1c20)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:278 +0x510
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc83071f6c0, 0x7fd8658e8d20, 0xc87f43dbc0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:281 +0x90c
github.com/prometheus/prometheus/promql.Inspect(0x7fd8658e8d20, 0xc87f43dbc0, 0xc83071f6c0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:316 +0x4b
github.com/prometheus/prometheus/promql.(*Engine).populateIterators(0xc4f94df8c0, 0x7fd880377000, 0xc4924a0cc0, 0xc72d2c1c70, 0x48, 0x1a0ab60, 0x1
7d7c80, 0xc4924a0b40)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:513 +0x2f5
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc4f94df8c0, 0x7fd880377000, 0xc4924a0cc0, 0xc87f43dc00, 0xc72d2c1c70, 0x0, 0x0, 0
x0, 0x0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:348 +0x117
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc4f94df8c0, 0x7fd880377000, 0xc4924a0cc0, 0xc87f43dc00, 0x0, 0x0, 0x0, 0x0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:328 +0x3a0
github.com/prometheus/prometheus/promql.(*query).Exec(0xc87f43dc00, 0x7fd8658e3000, 0xc466fde0c0, 0xed09fb730)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:171 +0x52
github.com/prometheus/prometheus/rules.RecordingRule.Eval(0xc4f07e766d, 0x23, 0x26a4440, 0xc4991cf680, 0x0, 0x0, 0x0, 0x7fd8658e3000, 0xc466fde0c0
, 0xed09fb730, ...)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/recording.go:57 +0x109
github.com/prometheus/prometheus/rules.(*RecordingRule).Eval(0xc4991cf8c0, 0x7fd8658e3000, 0xc466fde0c0, 0xed09fb730, 0x17faa0e2, 0x275ad60, 0xc4f
94df8c0, 0xc4204df73f, 0x0, 0xc4fc08b900, ...)
        <autogenerated>:8 +0xe7
github.com/prometheus/prometheus/rules.(*Group).Eval.func1(0xc93696d020, 0x1b1a91f, 0x9, 0xc42064a9b0, 0xed09fb730, 0x17faa0e2, 0x275ad60, 0x26ac4
40, 0xc4991cf8c0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/manager.go:273 +0x1b1
created by github.com/prometheus/prometheus/rules.(*Group).Eval
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/manager.go:321 +0x174

2

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x1626deb]

goroutine 27015751 [running]:
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.newChunkSeriesIterator(0xc7e5073288, 0x1, 0x1, 0xc8d5c71aa0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/querier.go:559 +0x3b
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*chunkSeries).Iterator(0xc951d1f6b0, 0xc4221a9000, 0x455570)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/querier.go:463 +0x41
github.com/prometheus/prometheus/storage/tsdb.series.Iterator(0x26a0940, 0xc951d1f6b0, 0xc8d5c78e80, 0xc5f0b5e580)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/storage/tsdb/tsdb.go:114 +0x31
github.com/prometheus/prometheus/storage/tsdb.(*series).Iterator(0xc7e5073290, 0xc8d5c78e80, 0x493e0)
        <autogenerated>:15 +0x56
github.com/prometheus/prometheus/promql.(*Engine).populateIterators.func2(0x7f521f953950, 0xc4eafa2600, 0xeee5679e69)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:497 +0x1bd
github.com/prometheus/prometheus/promql.inspector.Visit(0xc6e6060400, 0x7f521f953950, 0xc4eafa2600, 0x7f521f953950, 0x10)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:306 +0x3a
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc6e6060400, 0x7f521f953950, 0xc4eafa2600)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:255 +0x58
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc6e6060400, 0x7f521f953848, 0xc6e60582c0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:281 +0x90c
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc6e6060400, 0x7f521f9538f8, 0xc73a114190)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:278 +0x510
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc6e6060400, 0x7f521f953848, 0xc6e6058340)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:281 +0x90c
github.com/prometheus/prometheus/promql.Inspect(0x7f521f953848, 0xc6e6058340, 0xc6e6060400)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:316 +0x4b
github.com/prometheus/prometheus/promql.(*Engine).populateIterators(0xc4203cb240, 0x7f521c113a30, 0xc4eafa26c0, 0xc73a1141e0, 0x48, 0x1a0ab60, 0x17d7c80, 0xc4eafa2540)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:513 +0x2f5
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc4203cb240, 0x7f521c113a30, 0xc4eafa26c0, 0xc6e6058380, 0xc73a1141e0, 0x0, 0x0, 0x0, 0x0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:348 +0x117
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc4203cb240, 0x7f521c113a30, 0xc4eafa26c0, 0xc6e6058380, 0x0, 0x0, 0x0, 0x0)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:328 +0x3a0
github.com/prometheus/prometheus/promql.(*query).Exec(0xc6e6058380, 0x7f521f9dc4f8, 0xc420572c80, 0xed09fb76b)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:171 +0x52
github.com/prometheus/prometheus/rules.(*AlertingRule).Eval(0xc4201b6930, 0x7f521f9dc4f8, 0xc420572c80, 0xed09fb76b, 0xc5bcac8, 0x275ad60, 0xc4203cb240, 0xc42026fb5f, 0x0, 0x0, ...)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/alerting.go:159 +0x145
github.com/prometheus/prometheus/rules.(*Group).Eval.func1(0xc832a9a810, 0x1b18bfe, 0x8, 0xc420536b90, 0xed09fb76b, 0xc5bcac8, 0x275ad60, 0x26ac400, 0xc4201b6930)
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/manager.go:273 +0x1b1
created by github.com/prometheus/prometheus/rules.(*Group).Eval
        /home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/manager.go:321 +0x174

Investigate reduction of memory spikes during compaction

Especially when compacting larger time ranges, there are very notable spikes in memory usage.
When under querying load, the memory occupied by queries smoothens the spikes to a good degree, but not entirely.

It would be cool if it was possible to reduce the spikes a fair bit. For that we first need a proper understanding which allocations are actually causing it. Profiling so far did not give a clear answer yet.

No license present in repository or code

At a glance, I noticed that no LICENSE or similar file is present, and none of the source files contain the typical license block from other Prometheus projects.

I assume this was just a minor oversight?

/cc @fabxc

Change chunk file format

Currently we write lists of chunk to a chunk file. Each chunk starts with a flag for its used compression. A CRC32 checksum over all chunks of an inserted is written at the end. I'm currently pondering with a few things:

Have one checksum per chunk. The 4 byte overhead is likely fine and the checksum applies some sort of artificial grouping over an otherwise independent pool of chunks. That we add chunks in batches for the same series is just an optimisation so they are co-located but no requirement in any way. This seems generally to make sense as the space requirements are irrelevant.
Move the checksum into the index. More contentious, we could not verify chunks without the index anymore.
Move even the chunk compression flag into the index as well. Even more contentious.
The next step would even be to hold the chunk length in the index as well. At this point chunk files really just become a binary blob container.

I for the last three points there's no clearly right way to do it. So just interested in opinions.

Panic on querying in head block

goroutine 330766 [running]:
net/http.(*conn).serve.func1(0xc42a4dc000)
       	/usr/local/Cellar/go/1.8/libexec/src/net/http/server.go:1721 +0xd0
panic(0x24d7fa0, 0x3346720)
       	/usr/local/Cellar/go/1.8/libexec/src/runtime/panic.go:489 +0x2cf
github.com/prometheus/prometheus/vendor/github.com/fabxc/tsdb.(*headBlock).Querier.func1.1(0x74, 0x66, 0x1011300)
       	/Users/fabxc/repos/src/github.com/prometheus/prometheus/vendor/github.com/fabxc/tsdb/head.go:222 +0xb6
sort.medianOfThree_func(0xc432c9ac48, 0xc4331160c0, 0x74, 0x66, 0x58)
       	/usr/local/Cellar/go/1.8/libexec/src/sort/zfuncversion.go:53 +0x3e
sort.doPivot_func(0xc432c9ac48, 0xc4331160c0, 0x0, 0x75, 0x3c01770, 0x0)
       	/usr/local/Cellar/go/1.8/libexec/src/sort/zfuncversion.go:78 +0x5f0
sort.quickSort_func(0xc432c9ac48, 0xc4331160c0, 0x0, 0x75, 0xe)
       	/usr/local/Cellar/go/1.8/libexec/src/sort/zfuncversion.go:143 +0x80
sort.Slice(0x23c81c0, 0xc4331160a0, 0xc432c9ac48)
       	/usr/local/Cellar/go/1.8/libexec/src/sort/sort.go:251 +0xe3
github.com/prometheus/prometheus/vendor/github.com/fabxc/tsdb.(*headBlock).Querier.func1(0x32ad7a0, 0xc433116080, 0x1, 0x32ad7a0)
       	/Users/fabxc/repos/src/github.com/prometheus/prometheus/vendor/github.com/fabxc/tsdb/head.go:223 +0x23e
github.com/prometheus/prometheus/vendor/github.com/fabxc/tsdb.(*blockQuerier).Select(0xc42c2e0bc0, 0xc42c2ea690, 0x1, 0x1, 0xc42c2e1040, 0xc42c2e9bc0)

This occasionally happens after a fair amount of spam-querying.

Export the oldest timestamp in the db

This would help us query the remote storage from the correct timestamp and also enable other use-cases such as prometheus/prometheus#2988

This is trivial to implement as it would be a method on *DB where we just return the MinTime of the oldest block.

Testing: Internal tests or API tests?

We are currently testing the package within the package, i.e. for example: via package tsdb as opposed to package tsdb_test.

I have seen that usually testing the API as a "user" of the package works well in the initial stages. We can have _internal_test.go if we want to test internal functions. This will also give us a little more clarity into what to expose and what to not.

If you think this is a good approach, I will move the existing tests to be external.

Ref: 2'nd point in https://medium.com/@benbjohnson/structuring-tests-in-go-46ddee7a25c#.rf85iysgp

Artifacts in query result

In aggregation queries over longer time we see artifacts where result graphs are briefly interrupted.
This probably happens at block boundaries but is not current caught by PromQL tests even if spanning multiple blocks.

This is probably a minor bug that just need to be tracked down. Finding a PromQL test that can reproduce the behavior would be a good first step.

is it a bug?

a.activeWriters maybe decrease twice when commit error ?

func (a *headAppender) Commit() error {
defer atomic.AddUint64(&a.activeWriters, ^uint64(0))
defer putHeadAppendBuffer(a.samples)
defer a.mtx.RUnlock()

func (a *headAppender) Rollback() error {
a.mtx.RUnlock()
atomic.AddUint64(&a.activeWriters, ^uint64(0))

Drop sequential block numbering

Currently blocks are stored with sequence numbers attached to them that reflect the ordering of their covered intervals (b-00001, b-00002, ...).
This is mostly nice for running ls and seeing the blocks and their size in order but it has little purpose in other ways. Once we start backfilling, we potentially have to rename them all or adding a buffer in front and start with b-10001 for example. Overall, it might not be worth it.

We might just want to rename the block directory to the ULID that's also shown in their meta.json file.

Investigate possibility of tools modifying persisted blocks

There are various use cases to transform/move/... older data in some way. For example downsampling or shipping it off into a LTS. There are many ways to build those tools and ideally it won't be a concern of the core tsdb.

In theory, tsdb should act sufficiently atomic on all file systems we aim to support so that external tool could do those things out of band. We would need a way to toggle compaction (c.f. #4) and to trigger a reload of the file system state in applications using tsdb.

List ancestor blocks in bock meta

Currently we track "how old" a block is roughly by the "level" field in the meta.json file. It is generally incremented on every compaction. With compactions happening more frequently triggered by other mechanisms, e.g. for deletes, that field becomes relatively ambiguous/meaningless.

I propose to instead have compaction.ancestors field, which contains a list of all block that are contained within the block. On compaction of multiple blocks, those lists are simply merged.
They can obviously get rather long (168 entries for 1 week at 1h min block size), but not beyond reasonable.

It could just be a list of ULIDs. But it's probably a good idea to make them objects, so we can add other fields in the future.

This gives us generally a better idea of how much compaction was done in grand total. It is also really helpful, when reconciling backed up data.

Provide functionality to merge SeriesSets

Currently we can query multiple SeriesSets via label selectors but cannot merge two of them together while eliminating duplicates.
Internally we have similar functionality, so this is merely about making it accessible. We mostly need this for federation where we can have multiple AND'ed selectors (e.g. abc{de=~"fg"}) which are OR'd together.

Either we provide an external method OR'ing SeriesSets or extend the querying interface to accept more complex selectors, which would generally be trees.

Use proper iterator of persisted postings

When accessing an index file we currently create a list of postings which we wrap in a list iterator.
Instead we should implement an iterator directly acting on the plain bytes.

https://github.com/fabxc/tsdb/blob/2ef3682560a31bd03f0ba70eb6ec509512ad0de8/index.go#L701-L714

@gouthamve this should be a low hanging fruit but interesting as it is a nice example of efficiency of using iterators.

Add 'table of contents' to index file

An entry at the end of the index file documenting where the main sections start (and end) would help making analysis easier. For example, all postings lists of the inverted indices are currently stored uncompressed, which unexpectedly has very little size overhead.
Having better understanding what is how expensive size-wise, makes prioritising improvements easier.

Currently all references/offsets to entries in index or chunk files are stored as absolute positions of those files. One consideration was making them relative to start of their respective section. This needs said TOC in the index file as a prerequisite.

blockQuerier returns data outside the time-range

I was writing tests for blockQuerier and realised that it doesn't check the returned values to be inside (mint, maxt).

So blockQuerier returns all the values from a chunk if a chunk is partially inside (mint, maxt). This is because blockSeriesSet has no info about (mint, maxt) and populatedChunkSeries returns all chunks partially inside the time-range. It doesn't remove the values outside (mint, maxt).

dbAppender.Add says it's looking at last byte, actually just inverting it

At https://github.com/prometheus/tsdb/blob/778103b45060697e8452f3d00a6e8fe1f11306da/db.go#L522

There's:

	// Store last byte of sequence number in 3rd byte of refernece.
	return ref | (uint64(h.meta.Sequence^0xff) << 40), nil

The ^ should be a &. Similarly in AddFast.

Discussion: plan support for Blocks archival/retrieval from S3/GCS/other blob storage

The new model of Blocks that have time ranges lends itself to providing long-term storage of Prometheus data. Instead of just keeping Blocks on local storage and deleting old ones past a certain time, it would be possible to have them be pushed to blob storage like S3 or GCS.

The benefits are many:

ability to store significantly longer periods of time of data than a single disk provides (e.g. months instead of days)
ability to restore Blocks on disk in case disk fails (e.g. a Persistent Volume gets corrupted irrecoverably)

Upload (milestone 1)

The current compaction algorithm compacts all blocks (including old ones) according to "current-running" parameters of TSDB. We'd need to introduce a marker: compaction-fully-complete which means that a given Block should no longer be compacted. Such a block would be subject to asynchronous upload to a blob storage bucket.

Recovery of failures (milestone 1)

If TSDB has a blob storage configured, it will sync a configurable (tsdb.min.synced.horizon=4d) amount of blocks from blob storage to disk. This means that even if Prometheus is started from scratch, it can serve historic data in case it's local disk gets corrupted.

WAL files are a problem, but I think it is acceptable to tell users that a certain amount of dataloss can happen.

Dynamic retrieval of past data (milestone 2)

In case a query comes in that is not satisfiable from local-on disk (because the stored Block horizon is not "wide enough"), it should be possible for Prometheus to download Blocks from cold storage on demand. The queries will be slow, but IMHO that's acceptable for old data.

This presents the problem of: how much and how long such "on demand" past data should be kept for. I think a good solution here would be an LRU cache of a configured size. This would allow control of how much disk space should be used.

Relation to "normal" TSDB

Of course, hacking this into TSDB itself is a bad idea. However, all of the above can be implemented as a wrapper for a real TSDB, with the same interface that TSDB exposes:

on init, make sure the "minimum horizon" is satisfied before initializing "real TSDB"
in the background, upload fully compacted Blcoks to cold storage
on query time, check if it fits in the "real TSDB" and if not do the dynamic fetches.

@fabxc @kdima, what do you think?

Label Values composite index

I see from this doc that we are going to have composite indexes. But it is still un-implemented.

@fabxc I am assuming this is a TODO item and you have not dropped the idea of a composite index.

Better heuristic for cutting new chunks

Currently we cut a new chunk at 130 samples. If the sampling frequency fits 140 samples into a block, that results in uneven chunks. There a many similar scenarios causing similar imbalances.
The XOR encoding reaches good comression at 30 samples, near-ideal at 60, and certainly ideal at about 120 (c.f. http://www.vldb.org/pvldb/vol8/p1816-teller.pdf, page 1820).

I would propose to make a decision at the 30th sample of every chunk. We look at the chunk's average sample frequency so far and extrapolate how many more samples will fit until the end of the block.
We then check how many idealized chunks of 120 samples are necessary to fill it. We round that down to err on the side of better compression and determine the end timestamp of the chunk based on that.

Naive formula:

s  = start time of current chunk
s' = start time of next chunk
m  = max timestamp for the block
r  = time range of chunk so far

s' = s + (m - s) / floor((m - s) / r / 4)

There are probably smarter heuristics. But this one is cheap, just runs once per chunk and does what we want most of the time I think.

dbAppender.Rollback actually Commits

It should be calling rollback.

https://github.com/prometheus/tsdb/blob/778103b45060697e8452f3d00a6e8fe1f11306da/db.go#L657

Use fast access to postings depending on label matcher type

This TODO in the querier points out an important optimization: https://github.com/fabxc/tsdb/blob/a4be181d3cb836e7e77a412374f19230cfc9cf46/querier.go#L165-L177

Currently we take a label matcher, iterate over all values for the label and add all matching values to our matching result. We than lookup all postings lists for the result label/value pairs.
This is necessary for regexes or potentially user-defined matchers. But for the base matchers we define, we can do better – in particular for a trivial equality matcher.

We can try to upgrade the matcher interface to the equality matcher and then skip the iteration all together and just lookup the postings list directly.

That saves a lot of time when equality matching high-cardinality labels such as the instance label.

Later, one could do similar things for regexp matching. A lot of regexp matchers have a fixed prefix, which could be extracted to limit the total range of label values that are matched against the total regexp. This works because the label value index is sorted.

@gouthamve the first part is a low hanging fruit.

Implement vertical query merging and compaction

We should add support to query and compact time-overlapping blocks. Probably no need for sophisticated handling of overlapping chunks.

This is a rather complex one and we have to discuss specifics. But once done, it makes our life for restoring from backups etc. a lot easier.

Incorrect metadata being populated on deletions

The numSeries and numSamples values after a block with tombstones is compacted are wrong.

While I have the fix ready for numSeries, numSamples is a little tricky. When we are re-encoding a chunk we donot know the number of samples the old chunk holds.

cc @fabxc

Add SelectPostings method to IndexReader

I think we should a method to IndexReader that is of the form:

SelectPostings(...labels.Matcher) Postings

Right now, we have the same functionality being implemented by the querier here: https://github.com/fabxc/tsdb/blob/master/querier.go#L125-L139

Why?
Indexing makes more sense for this compared to the querier. We will be probably able to use the internal data structures better instead of having to stick the interface.

Also, are instantiating a new Querier everytime, if we want to add optimisations, such as caching, adding them into the index makes more sense.

Challenges
Because of the iterator pattern, we are lazy evaluating the "absent" metrics. But luckily, Postings is also an iterator hence we can bring in the lazy evaluations here also.

Handle "missing index" on first compaction

On startup we allocate one block in the past that won't get any data to guarantee the time window from the most recent sample into the past in which data can be appended.
Typically this will not get any data and on compaction an empty block is created. With no series written, there's also no index. On querying that block we get an "index doesn't exist" error.

We should either stash the empty block completely or ensure that at least an empty index is created.

As we have index for all individual label pairs, the requirement on the index on the metric name to always exist comes from the using application's level. So the former option is probably preferable.

We run into similar problems when Prometheus was shut down for a long time and a bunch of minimum-sized blocks are created to fill the empty time. (That happens because currently we ensure the entire time range covered by blocks to have no holes or overlaps.)

[Q] SeriesIterator.Seek definition fuzzy

So I am working on tests and I am not sure if the behaviour is right for SeriesIterator.Seek. There are 2 variables (ts, tt) in the comment and I am assuming both are the same.

According to the definition, if there is a chunk with values at 1, 2, 4, 5, Seek(3) should return the value at 2, no? I am seeing that we return the value at 4.

Bulk Imports

Implement bulk imports for prefilling.

Ref: prometheus/prometheus#535

Allow fast access to most recent sample

For federation in Prometheus we want fast access to the most recent sample. This is currently not served by the interface we have which only allows forward-streaming access to a series for a time window.
That's the interface actually reflecting properly how data is stored in the database. We do have an efficient cache of the most recent sample but that is strictly speaking an implementation detail. Meaning: extending the SeriesIterator interface by a Last() call would seem like a workaround at best.

First, need for such a fast access should be benchmarked again in context of the new storage. If it turns out to still be necessary, I'm leaning towards exposing the in-memory block type and making it accessible – potentially as yet another interface.
The federation layer would then use this instead of the general SeriesIterator interface.

Sounds sane? @beorn7 @brian-brazil

Collisions are likely in headAppender.Add's ref logic

https://github.com/prometheus/tsdb/blob/778103b45060697e8452f3d00a6e8fe1f11306da/head.go#L308

headAppender.Add uses a 31bit random number to generate refs. A collision is likely at about 64k new series from a target/rule, which is almost certainly going to happen.

An approach such as a central counter would be safer, though 31 bits still feels a tad small.

Implement deletions

We want to provide a way to delete data based on time series selectors as well as time windows. For persisted (generally immutable) blocks, this can be implemented by adding tombstone records that add information on deleted data.
Those tombstones are considered when querying and resolved on the next compaction of a block by actually dropping the data.

For in-memory blocks, we likely want to do something similar as cleaning up the inverted indices is too complex/expensive.

On top of this fine-grained retention policies can be added instead of a global one for all series.

Clarify copyright notice for adapted go-tsz code

Some of the files in https://github.com/prometheus/tsdb/tree/master/chunks are in large part adapted code from https://github.com/dgryski/go-tsz, but currently we just have the Prometheus copyright header on the top of those files. We should follow https://github.com/dgryski/go-tsz/blob/master/LICENSE.

cpu usage 100%

I scrape 15w samples per second, without rules cpu usage 40% (v1.5.2 300%).
After I used about 300 rules for aggregation, cpu full used with 20% iowait (nvme device). (machine got a 38 cores cpu, v1.5.2 usage 600% without iowait).

Attached file is pprof result.
evaluation_interval: 1s

pprof.prometheus.localhost:9090.samples.cpu.001.pb.gz

Implement backups

Our design generally allows easy backups. Necessary steps for this are:

Disable compactor temporarily
Create a backup directory that hard-links to all existing block files
Snapshot() method that allows compacting current in-memory blocks into fully persistent blocks (not strictly necessary but nicer than backing up write-ahead log segments)
Re-enable compactor so we don't ramp up memory during 5. – because all compaction actions replace old files rather than modifying, the inodes the hard links point to remain valid
An external tool, e.g. rsync, can copy the backup directory somewhere safely and delete the directory

To disable/enable the compactor and trigger snapshots we can simply rely on the using application to make the method accessible or provide an optional convenience HTTP server in the tsdb directly.

Compaction with downsampling

Prometheus currently has no downsampling support. It can be achived via federation, but it's way too messy.

Maybe now it is possible to integrate downsampling support into compaction process?

Another usefull feature would be to have different ttl for different metrics.
For example in our setup, a lot of metrics are aggregated via recording rules, and after "recording" they are never queried again.

Concurrent map write in (*headAppender).Add()

I'm seeing a panic due to concurrent map writes in: (*headAppender).Add():
https://github.com/prometheus/tsdb/blob/25d45465189ea9b7f2c894188c06d1853bfa79ba/head.go#L322-L323

Is the intention that the caller to Add() should implement their own locking? Is there a reason that Add() shouldn't manage a mutex itself?

Is this is a documentation issue, happy to raise a PR add a comment to clarify.

Tests depend on Prometheus

So #103 is currently failing as we are depending on prometheus/prometheus for tests and a change broke the compatibility.

While for now the hack works, we ideally need to remove the dependency.

/cc @fabxc @brian-brazil

Improve time complexity of Postings merge

The procedure for more than two lists of different set operations works similarly. So the number of k set operations merely modifies the factor (O(k*n)) instead of the exponent (O(n^k)) of our worst-case lookup runtime. A great improvement.

-- https://fabxc.org/blog/2017-04-10-writing-a-tsdb/

That's not the most efficient way of performing k-way merge. Your implementation has indeed O(nk) time complexity, but this short drop-in replacement:

func Merge(its ...Postings) Postings {
	l := len(its)
	switch l {
	case 0:
		return nil
	case 1:
		return its[0]
	default:
		m := l / 2
		return newMergePostings(Merge(its[:m]...), Merge(its[m:]...))
	}
}

has O(n log k) (hope I got it right). There are many other efficient implementations of k-way merge, one of the most popular (I think) is based on binary heap (container/heap in Go land).

I could prepare a couple of implementations and benchmark them. Alternatively, we can take a look how other search engines do that (https://github.com/blevesearch/bleve ?) and go with that.

@fabxc

code bug in querier.go

func (s *populatedChunkSeries) Next() bool {
	for s.set.Next() {
		lset, chks := s.set.At()

		for i, c := range chks {
			if c.MaxTime < s.mint {
				chks = chks[1:]
				continue
			}

chks length is changed above, below code will not work
as you want. It will panic for "slice bounds out of range".
I think it should be something like:

             var out int
             for i, c := range chks {
			if c.MaxTime < s.mint {
				out++
				continue
			}
................
              chks=chks[out:]
               if len(chks) == 0 {
			continue
		}

Position mapping blocks queries

A memory block receives series from writers in random order. Based on that it builds the inverted index, which relies on monotonically increasing IDs for each series.
For querying, we want series to be in lexicographic order of their label sets to efficiently merge them between blocks.

For that we have a position mapper, which can reorder an iteration over inverted index efficiently and give us the needed order on querying. Whenever a new series appears the position mapper has to be updated.
Being more strict by default, this is currently checked before querying and the query blocks until the mapper is updated. This guarantees that any series is immediately visible after successful insertion. This causes some queries to respond slowly (sorting hundreds to thousands of label sets may take several seconds) and being severely impacted by environments, where a new series appears every few seconds.

Two points here:

make updating of the position mapper more efficient. If we insert 1 new series, insertion sort is way more efficient than sorting all series again
can we make the mapper completely non-blocking, i.e. by maintaining leveled mappings that can be merged in the background?
consider what consistency guarantees we want to give. Do we really want new series to be immediately queryable once Commit() on the insertion returned? Are we fine with a delay of a few seconds? Should queriers block or the writer creating a new series?

@brian-brazil @beorn7

This might be a bit of an abstract issue as its a deeper internal.

How to run tests?

This project does not vendor its dependencies, so I had to go get them manually.

But when I tried to run tests I got the following.

$ go test
# github.com/prometheus/tsdb
head_test.go:26:2: cannot find package "github.com/prometheus/prometheus/pkg/labels" in any of:
	/usr/local/go/src/github.com/prometheus/prometheus/pkg/labels (from $GOROOT)
	/Users/telendt/goprojects/src/github.com/prometheus/prometheus/pkg/labels (from $GOPATH)
FAIL	github.com/prometheus/tsdb [setup failed]

If I switch github.com/prometheus/prometheus to dev-2.0, I get:

$ go test
# github.com/prometheus/tsdb
./compact.go:60: undefined: prometheus.Registerer
./compact.go:90: undefined: prometheus.Registerer
./db.go:133: undefined: prometheus.Registerer
./db.go:155: undefined: prometheus.Registerer
FAIL	github.com/prometheus/tsdb [build failed]

What's the prometheus version you test it against?

Cache/pre-compute series mappings between blocks

Comparing label sets is relatively expensive. The block design requires us to do a lot of those comparisons when merging series with the same label sets from different blocks.
This mostly affects range queries and gets more significant the more blocks a query spans. The performance penality becomes relatively stronger the more series are queried.

While blocks should generally stay independent, it should be possible to compute (and potentially persist) mappings of series IDs from different block, which could significantly speed up merges.

This will be particularly tricky for memory blocks where series constantly get added.

Mystery deadlock

There's a deadlock that hits Prometheus servers under load some time within 12 hours typically. It is known to occur in servers that are never queried so is likely located in the write path.

Goroutine dumps show to be entirely useless as all waiting goroutines are waiting to acquire a lock.
Thus, it's most likely inconsistent locking order or exiting a locked procedure without unlocking again.

CI integration

I plan to work on tests most of next week. Should I help setup Travis while at it?

Negativ regexp matching is buggy

From a quick observation something about doing negative regular expression matching is not working correctly.

Ideas for TSDB CLI tool

A ls that gives a tabular view of blocks and info about them
Scanning tool that checks for corruption and reports on which sections in which file are affected
Doing compactions on blocks offline
Getting info about values for labels, series for matchers, sizes of postings lists – anything querying related basically
Downsampling all or some series in blocks.
Tombstoning (and possibly immediately compacting) a set of series in blocks. That's effectively dynamic retention.

Lots of things possible. But not all of them are super useful aside from things being possible in theory.
Some are possible through Prometheus' APIs in one way or another. But it seems fine to have multiple interfaces to the same functionality.

Add format documentation (binary layout of index/chunks)

I've just read Fabian's blogpost which gives high level introduction to new V3 storage. I'd like to get more familiar with the binary layout of index/chunks but I'm not sure where to look (should I dive deep into source code?). Are there some documents around that describe it and if so, can you link then in the README?