prometheus-junkyard / tsdb Goto Github PK
View Code? Open in Web Editor NEWThe Prometheus time series database layer.
License: Apache License 2.0
The Prometheus time series database layer.
License: Apache License 2.0
After ungraceful termination sometimes samples in the WAL reference series that do not exist yet. This should strictly not happen as all newly created series are added (in-order) to the WAL before any of their samples.
Needs investigation but no critical impact at this stage.
Currently block ranges allow fine-grained configuration. Practically, this has little to no benefit for Prometheus. It adds a lot of complexity however when dealing with those blocks in other ways, for example backup procedures.
I propose switching to a time slicing approach where possible block ranges after different compaction levels are fixed, as well as their position in time. This prevents arbitrary overlapping and interleaving of blocks at a global level.
Suppose we want a maximum block size of 9 days, then our valid ranges are as follows:
The largest range is aligned with the 0 timestamp (epoch in Prometheus), smaller ones are always aligned with their parent range. Thus compaction will always result in larger blocks fitting the slices.
This can still be configurable in TSDB, especially for tests, but should be hard-coded in the using application.
For Prometheus I would fix it to the above values.
Alignment with wall clock weeks would have been nice but not all that important. TSDB having no notion of wall clocks and Unix 0 not aligning with the beginning of a week would just require a bunch of number shifting that just doesn't seem worth it.
It seems that labels must be sorted before passing them to (*headAppender).Add()
, otherwise the labelsets will be considered distinct.
We should either ensure the labels are sorted (if the overhead is not significant) or document that the labels must be sorted.
I got this crash on two different test servers scraping the same targets and having the same rules (but the two traces originate in different rules, one is a recording rule, the other is an alert, so I assume it's not a specific "query of death"). Both call traces following:
This is built from commit 8c483e27d3d53b39ee0211c28eeeff7626c9ac99 in prometheus/prometheus.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x1626deb]
goroutine 24276602 [running]:
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.newChunkSeriesIterator(0xc54107e358, 0x1, 0x1, 0xc661317860)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/querier.go:559 +0x3b
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*chunkSeries).Iterator(0xc6fd83ee70, 0x15ee512, 0xc4e8db21e0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/querier.go:463 +0x41
github.com/prometheus/prometheus/storage/tsdb.series.Iterator(0x26a0940, 0xc6fd83ee70, 0x10, 0xc4e8db63c0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/storage/tsdb/tsdb.go:114 +0x31
github.com/prometheus/prometheus/storage/tsdb.(*series).Iterator(0xc54107e360, 0xc4e8db21e0, 0x493e0)
<autogenerated>:15 +0x56
github.com/prometheus/prometheus/promql.(*Engine).populateIterators.func2(0x7fd8658e8e28, 0xc4924a0c00, 0xeee5679e69)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:497 +0x1bd
github.com/prometheus/prometheus/promql.inspector.Visit(0xc83071f6c0, 0x7fd8658e8e28, 0xc4924a0c00, 0x7fd8658e8e28, 0x18ba1a0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:306 +0x3a
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc83071f6c0, 0x7fd8658e8e28, 0xc4924a0c00)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:255 +0x58
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc83071f6c0, 0x269d0c0, 0xc83071f6e0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:275 +0x1d1
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc83071f6c0, 0x7fd8658e8e80, 0xc74d118d60)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:285 +0x6ab
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc83071f6c0, 0x7fd8658e8dd0, 0xc72d2c1c20)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:278 +0x510
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc83071f6c0, 0x7fd8658e8d20, 0xc87f43dbc0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:281 +0x90c
github.com/prometheus/prometheus/promql.Inspect(0x7fd8658e8d20, 0xc87f43dbc0, 0xc83071f6c0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:316 +0x4b
github.com/prometheus/prometheus/promql.(*Engine).populateIterators(0xc4f94df8c0, 0x7fd880377000, 0xc4924a0cc0, 0xc72d2c1c70, 0x48, 0x1a0ab60, 0x1
7d7c80, 0xc4924a0b40)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:513 +0x2f5
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc4f94df8c0, 0x7fd880377000, 0xc4924a0cc0, 0xc87f43dc00, 0xc72d2c1c70, 0x0, 0x0, 0
x0, 0x0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:348 +0x117
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc4f94df8c0, 0x7fd880377000, 0xc4924a0cc0, 0xc87f43dc00, 0x0, 0x0, 0x0, 0x0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:328 +0x3a0
github.com/prometheus/prometheus/promql.(*query).Exec(0xc87f43dc00, 0x7fd8658e3000, 0xc466fde0c0, 0xed09fb730)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:171 +0x52
github.com/prometheus/prometheus/rules.RecordingRule.Eval(0xc4f07e766d, 0x23, 0x26a4440, 0xc4991cf680, 0x0, 0x0, 0x0, 0x7fd8658e3000, 0xc466fde0c0
, 0xed09fb730, ...)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/recording.go:57 +0x109
github.com/prometheus/prometheus/rules.(*RecordingRule).Eval(0xc4991cf8c0, 0x7fd8658e3000, 0xc466fde0c0, 0xed09fb730, 0x17faa0e2, 0x275ad60, 0xc4f
94df8c0, 0xc4204df73f, 0x0, 0xc4fc08b900, ...)
<autogenerated>:8 +0xe7
github.com/prometheus/prometheus/rules.(*Group).Eval.func1(0xc93696d020, 0x1b1a91f, 0x9, 0xc42064a9b0, 0xed09fb730, 0x17faa0e2, 0x275ad60, 0x26ac4
40, 0xc4991cf8c0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/manager.go:273 +0x1b1
created by github.com/prometheus/prometheus/rules.(*Group).Eval
/home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/manager.go:321 +0x174
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x1626deb]
goroutine 27015751 [running]:
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.newChunkSeriesIterator(0xc7e5073288, 0x1, 0x1, 0xc8d5c71aa0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/querier.go:559 +0x3b
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*chunkSeries).Iterator(0xc951d1f6b0, 0xc4221a9000, 0x455570)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/querier.go:463 +0x41
github.com/prometheus/prometheus/storage/tsdb.series.Iterator(0x26a0940, 0xc951d1f6b0, 0xc8d5c78e80, 0xc5f0b5e580)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/storage/tsdb/tsdb.go:114 +0x31
github.com/prometheus/prometheus/storage/tsdb.(*series).Iterator(0xc7e5073290, 0xc8d5c78e80, 0x493e0)
<autogenerated>:15 +0x56
github.com/prometheus/prometheus/promql.(*Engine).populateIterators.func2(0x7f521f953950, 0xc4eafa2600, 0xeee5679e69)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:497 +0x1bd
github.com/prometheus/prometheus/promql.inspector.Visit(0xc6e6060400, 0x7f521f953950, 0xc4eafa2600, 0x7f521f953950, 0x10)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:306 +0x3a
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc6e6060400, 0x7f521f953950, 0xc4eafa2600)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:255 +0x58
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc6e6060400, 0x7f521f953848, 0xc6e60582c0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:281 +0x90c
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc6e6060400, 0x7f521f9538f8, 0xc73a114190)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:278 +0x510
github.com/prometheus/prometheus/promql.Walk(0x269d140, 0xc6e6060400, 0x7f521f953848, 0xc6e6058340)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:281 +0x90c
github.com/prometheus/prometheus/promql.Inspect(0x7f521f953848, 0xc6e6058340, 0xc6e6060400)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/ast.go:316 +0x4b
github.com/prometheus/prometheus/promql.(*Engine).populateIterators(0xc4203cb240, 0x7f521c113a30, 0xc4eafa26c0, 0xc73a1141e0, 0x48, 0x1a0ab60, 0x17d7c80, 0xc4eafa2540)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:513 +0x2f5
github.com/prometheus/prometheus/promql.(*Engine).execEvalStmt(0xc4203cb240, 0x7f521c113a30, 0xc4eafa26c0, 0xc6e6058380, 0xc73a1141e0, 0x0, 0x0, 0x0, 0x0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:348 +0x117
github.com/prometheus/prometheus/promql.(*Engine).exec(0xc4203cb240, 0x7f521c113a30, 0xc4eafa26c0, 0xc6e6058380, 0x0, 0x0, 0x0, 0x0)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:328 +0x3a0
github.com/prometheus/prometheus/promql.(*query).Exec(0xc6e6058380, 0x7f521f9dc4f8, 0xc420572c80, 0xed09fb76b)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/promql/engine.go:171 +0x52
github.com/prometheus/prometheus/rules.(*AlertingRule).Eval(0xc4201b6930, 0x7f521f9dc4f8, 0xc420572c80, 0xed09fb76b, 0xc5bcac8, 0x275ad60, 0xc4203cb240, 0xc42026fb5f, 0x0, 0x0, ...)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/alerting.go:159 +0x145
github.com/prometheus/prometheus/rules.(*Group).Eval.func1(0xc832a9a810, 0x1b18bfe, 0x8, 0xc420536b90, 0xed09fb76b, 0xc5bcac8, 0x275ad60, 0x26ac400, 0xc4201b6930)
/home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/manager.go:273 +0x1b1
created by github.com/prometheus/prometheus/rules.(*Group).Eval
/home/bjoern/gocode/src/github.com/prometheus/prometheus/rules/manager.go:321 +0x174
Especially when compacting larger time ranges, there are very notable spikes in memory usage.
When under querying load, the memory occupied by queries smoothens the spikes to a good degree, but not entirely.
It would be cool if it was possible to reduce the spikes a fair bit. For that we first need a proper understanding which allocations are actually causing it. Profiling so far did not give a clear answer yet.
At a glance, I noticed that no LICENSE
or similar file is present, and none of the source files contain the typical license block from other Prometheus projects.
I assume this was just a minor oversight?
/cc @fabxc
Currently we write lists of chunk to a chunk file. Each chunk starts with a flag for its used compression. A CRC32 checksum over all chunks of an inserted is written at the end. I'm currently pondering with a few things:
I for the last three points there's no clearly right way to do it. So just interested in opinions.
goroutine 330766 [running]:
net/http.(*conn).serve.func1(0xc42a4dc000)
/usr/local/Cellar/go/1.8/libexec/src/net/http/server.go:1721 +0xd0
panic(0x24d7fa0, 0x3346720)
/usr/local/Cellar/go/1.8/libexec/src/runtime/panic.go:489 +0x2cf
github.com/prometheus/prometheus/vendor/github.com/fabxc/tsdb.(*headBlock).Querier.func1.1(0x74, 0x66, 0x1011300)
/Users/fabxc/repos/src/github.com/prometheus/prometheus/vendor/github.com/fabxc/tsdb/head.go:222 +0xb6
sort.medianOfThree_func(0xc432c9ac48, 0xc4331160c0, 0x74, 0x66, 0x58)
/usr/local/Cellar/go/1.8/libexec/src/sort/zfuncversion.go:53 +0x3e
sort.doPivot_func(0xc432c9ac48, 0xc4331160c0, 0x0, 0x75, 0x3c01770, 0x0)
/usr/local/Cellar/go/1.8/libexec/src/sort/zfuncversion.go:78 +0x5f0
sort.quickSort_func(0xc432c9ac48, 0xc4331160c0, 0x0, 0x75, 0xe)
/usr/local/Cellar/go/1.8/libexec/src/sort/zfuncversion.go:143 +0x80
sort.Slice(0x23c81c0, 0xc4331160a0, 0xc432c9ac48)
/usr/local/Cellar/go/1.8/libexec/src/sort/sort.go:251 +0xe3
github.com/prometheus/prometheus/vendor/github.com/fabxc/tsdb.(*headBlock).Querier.func1(0x32ad7a0, 0xc433116080, 0x1, 0x32ad7a0)
/Users/fabxc/repos/src/github.com/prometheus/prometheus/vendor/github.com/fabxc/tsdb/head.go:223 +0x23e
github.com/prometheus/prometheus/vendor/github.com/fabxc/tsdb.(*blockQuerier).Select(0xc42c2e0bc0, 0xc42c2ea690, 0x1, 0x1, 0xc42c2e1040, 0xc42c2e9bc0)
This occasionally happens after a fair amount of spam-querying.
This would help us query the remote storage from the correct timestamp and also enable other use-cases such as prometheus/prometheus#2988
This is trivial to implement as it would be a method on *DB
where we just return the MinTime
of the oldest block.
We are currently testing the package within the package, i.e. for example: via package tsdb
as opposed to package tsdb_test
.
I have seen that usually testing the API as a "user" of the package works well in the initial stages. We can have _internal_test.go
if we want to test internal functions. This will also give us a little more clarity into what to expose and what to not.
If you think this is a good approach, I will move the existing tests to be external.
Ref: 2'nd point in https://medium.com/@benbjohnson/structuring-tests-in-go-46ddee7a25c#.rf85iysgp
In aggregation queries over longer time we see artifacts where result graphs are briefly interrupted.
This probably happens at block boundaries but is not current caught by PromQL tests even if spanning multiple blocks.
This is probably a minor bug that just need to be tracked down. Finding a PromQL test that can reproduce the behavior would be a good first step.
a.activeWriters maybe decrease twice when commit error ?
func (a *headAppender) Commit() error {
defer atomic.AddUint64(&a.activeWriters, ^uint64(0))
defer putHeadAppendBuffer(a.samples)
defer a.mtx.RUnlock()
func (a *headAppender) Rollback() error {
a.mtx.RUnlock()
atomic.AddUint64(&a.activeWriters, ^uint64(0))
Currently blocks are stored with sequence numbers attached to them that reflect the ordering of their covered intervals (b-00001
, b-00002
, ...).
This is mostly nice for running ls
and seeing the blocks and their size in order but it has little purpose in other ways. Once we start backfilling, we potentially have to rename them all or adding a buffer in front and start with b-10001
for example. Overall, it might not be worth it.
We might just want to rename the block directory to the ULID that's also shown in their meta.json file.
There are various use cases to transform/move/... older data in some way. For example downsampling or shipping it off into a LTS. There are many ways to build those tools and ideally it won't be a concern of the core tsdb.
In theory, tsdb should act sufficiently atomic on all file systems we aim to support so that external tool could do those things out of band. We would need a way to toggle compaction (c.f. #4) and to trigger a reload of the file system state in applications using tsdb.
Currently we track "how old" a block is roughly by the "level" field in the meta.json file. It is generally incremented on every compaction. With compactions happening more frequently triggered by other mechanisms, e.g. for deletes, that field becomes relatively ambiguous/meaningless.
I propose to instead have compaction.ancestors
field, which contains a list of all block that are contained within the block. On compaction of multiple blocks, those lists are simply merged.
They can obviously get rather long (168 entries for 1 week at 1h min block size), but not beyond reasonable.
It could just be a list of ULIDs. But it's probably a good idea to make them objects, so we can add other fields in the future.
This gives us generally a better idea of how much compaction was done in grand total. It is also really helpful, when reconciling backed up data.
Currently we can query multiple SeriesSets via label selectors but cannot merge two of them together while eliminating duplicates.
Internally we have similar functionality, so this is merely about making it accessible. We mostly need this for federation where we can have multiple AND'ed selectors (e.g. abc{de=~"fg"}
) which are OR'd together.
Either we provide an external method OR'ing SeriesSets or extend the querying interface to accept more complex selectors, which would generally be trees.
When accessing an index file we currently create a list of postings which we wrap in a list iterator.
Instead we should implement an iterator directly acting on the plain bytes.
https://github.com/fabxc/tsdb/blob/2ef3682560a31bd03f0ba70eb6ec509512ad0de8/index.go#L701-L714
@gouthamve this should be a low hanging fruit but interesting as it is a nice example of efficiency of using iterators.
An entry at the end of the index file documenting where the main sections start (and end) would help making analysis easier. For example, all postings lists of the inverted indices are currently stored uncompressed, which unexpectedly has very little size overhead.
Having better understanding what is how expensive size-wise, makes prioritising improvements easier.
Currently all references/offsets to entries in index or chunk files are stored as absolute positions of those files. One consideration was making them relative to start of their respective section. This needs said TOC in the index file as a prerequisite.
I was writing tests for blockQuerier and realised that it doesn't check the returned values to be inside (mint, maxt)
.
So blockQuerier returns all the values from a chunk if a chunk is partially inside (mint, maxt)
. This is because blockSeriesSet
has no info about (mint, maxt)
and populatedChunkSeries
returns all chunks
partially inside the time-range. It doesn't remove the values outside (mint, maxt)
.
At https://github.com/prometheus/tsdb/blob/778103b45060697e8452f3d00a6e8fe1f11306da/db.go#L522
There's:
// Store last byte of sequence number in 3rd byte of refernece.
return ref | (uint64(h.meta.Sequence^0xff) << 40), nil
The ^
should be a &
. Similarly in AddFast.
The new model of Blocks that have time ranges lends itself to providing long-term storage of Prometheus data. Instead of just keeping Blocks on local storage and deleting old ones past a certain time, it would be possible to have them be pushed to blob storage like S3 or GCS.
The benefits are many:
The current compaction algorithm compacts all blocks (including old ones) according to "current-running" parameters of TSDB. We'd need to introduce a marker: compaction-fully-complete
which means that a given Block should no longer be compacted. Such a block would be subject to asynchronous upload to a blob storage bucket.
If TSDB has a blob storage configured, it will sync a configurable (tsdb.min.synced.horizon=4d
) amount of blocks from blob storage to disk. This means that even if Prometheus is started from scratch, it can serve historic data in case it's local disk gets corrupted.
WAL files are a problem, but I think it is acceptable to tell users that a certain amount of dataloss can happen.
In case a query comes in that is not satisfiable from local-on disk (because the stored Block horizon is not "wide enough"), it should be possible for Prometheus to download Blocks from cold storage on demand. The queries will be slow, but IMHO that's acceptable for old data.
This presents the problem of: how much and how long such "on demand" past data should be kept for. I think a good solution here would be an LRU cache of a configured size. This would allow control of how much disk space should be used.
Of course, hacking this into TSDB itself is a bad idea. However, all of the above can be implemented as a wrapper for a real TSDB, with the same interface that TSDB exposes:
Currently we cut a new chunk at 130 samples. If the sampling frequency fits 140 samples into a block, that results in uneven chunks. There a many similar scenarios causing similar imbalances.
The XOR encoding reaches good comression at 30 samples, near-ideal at 60, and certainly ideal at about 120 (c.f. http://www.vldb.org/pvldb/vol8/p1816-teller.pdf, page 1820).
I would propose to make a decision at the 30th sample of every chunk. We look at the chunk's average sample frequency so far and extrapolate how many more samples will fit until the end of the block.
We then check how many idealized chunks of 120 samples are necessary to fill it. We round that down to err on the side of better compression and determine the end timestamp of the chunk based on that.
Naive formula:
s = start time of current chunk
s' = start time of next chunk
m = max timestamp for the block
r = time range of chunk so far
s' = s + (m - s) / floor((m - s) / r / 4)
There are probably smarter heuristics. But this one is cheap, just runs once per chunk and does what we want most of the time I think.
It should be calling rollback.
https://github.com/prometheus/tsdb/blob/778103b45060697e8452f3d00a6e8fe1f11306da/db.go#L657
This TODO in the querier points out an important optimization: https://github.com/fabxc/tsdb/blob/a4be181d3cb836e7e77a412374f19230cfc9cf46/querier.go#L165-L177
Currently we take a label matcher, iterate over all values for the label and add all matching values to our matching result. We than lookup all postings lists for the result label/value pairs.
This is necessary for regexes or potentially user-defined matchers. But for the base matchers we define, we can do better – in particular for a trivial equality matcher.
We can try to upgrade the matcher interface to the equality matcher and then skip the iteration all together and just lookup the postings list directly.
That saves a lot of time when equality matching high-cardinality labels such as the instance label.
Later, one could do similar things for regexp matching. A lot of regexp matchers have a fixed prefix, which could be extracted to limit the total range of label values that are matched against the total regexp. This works because the label value index is sorted.
@gouthamve the first part is a low hanging fruit.
We should add support to query and compact time-overlapping blocks. Probably no need for sophisticated handling of overlapping chunks.
This is a rather complex one and we have to discuss specifics. But once done, it makes our life for restoring from backups etc. a lot easier.
The numSeries and numSamples values after a block with tombstones is compacted are wrong.
While I have the fix ready for numSeries, numSamples is a little tricky. When we are re-encoding a chunk we donot know the number of samples the old chunk holds.
cc @fabxc
I think we should a method to IndexReader that is of the form:
SelectPostings(...labels.Matcher) Postings
Right now, we have the same functionality being implemented by the querier here: https://github.com/fabxc/tsdb/blob/master/querier.go#L125-L139
Why?
Indexing makes more sense for this compared to the querier. We will be probably able to use the internal data structures better instead of having to stick the interface.
Also, are instantiating a new Querier everytime, if we want to add optimisations, such as caching, adding them into the index makes more sense.
Challenges
Because of the iterator pattern, we are lazy evaluating the "absent" metrics. But luckily, Postings is also an iterator hence we can bring in the lazy evaluations here also.
On startup we allocate one block in the past that won't get any data to guarantee the time window from the most recent sample into the past in which data can be appended.
Typically this will not get any data and on compaction an empty block is created. With no series written, there's also no index. On querying that block we get an "index doesn't exist" error.
We should either stash the empty block completely or ensure that at least an empty index is created.
As we have index for all individual label pairs, the requirement on the index on the metric name to always exist comes from the using application's level. So the former option is probably preferable.
We run into similar problems when Prometheus was shut down for a long time and a bunch of minimum-sized blocks are created to fill the empty time. (That happens because currently we ensure the entire time range covered by blocks to have no holes or overlaps.)
So I am working on tests and I am not sure if the behaviour is right for SeriesIterator.Seek
. There are 2 variables (ts
, tt
) in the comment and I am assuming both are the same.
According to the definition, if there is a chunk with values at 1, 2, 4, 5
, Seek(3)
should return the value at 2
, no? I am seeing that we return the value at 4
.
Implement bulk imports for prefilling.
For federation in Prometheus we want fast access to the most recent sample. This is currently not served by the interface we have which only allows forward-streaming access to a series for a time window.
That's the interface actually reflecting properly how data is stored in the database. We do have an efficient cache of the most recent sample but that is strictly speaking an implementation detail. Meaning: extending the SeriesIterator
interface by a Last()
call would seem like a workaround at best.
First, need for such a fast access should be benchmarked again in context of the new storage. If it turns out to still be necessary, I'm leaning towards exposing the in-memory block type and making it accessible – potentially as yet another interface.
The federation layer would then use this instead of the general SeriesIterator
interface.
Sounds sane? @beorn7 @brian-brazil
https://github.com/prometheus/tsdb/blob/778103b45060697e8452f3d00a6e8fe1f11306da/head.go#L308
headAppender.Add uses a 31bit random number to generate refs. A collision is likely at about 64k new series from a target/rule, which is almost certainly going to happen.
An approach such as a central counter would be safer, though 31 bits still feels a tad small.
We want to provide a way to delete data based on time series selectors as well as time windows. For persisted (generally immutable) blocks, this can be implemented by adding tombstone records that add information on deleted data.
Those tombstones are considered when querying and resolved on the next compaction of a block by actually dropping the data.
For in-memory blocks, we likely want to do something similar as cleaning up the inverted indices is too complex/expensive.
On top of this fine-grained retention policies can be added instead of a global one for all series.
Some of the files in https://github.com/prometheus/tsdb/tree/master/chunks are in large part adapted code from https://github.com/dgryski/go-tsz, but currently we just have the Prometheus copyright header on the top of those files. We should follow https://github.com/dgryski/go-tsz/blob/master/LICENSE.
I scrape 15w samples per second, without rules cpu usage 40% (v1.5.2 300%).
After I used about 300 rules for aggregation, cpu full used with 20% iowait (nvme device). (machine got a 38 cores cpu, v1.5.2 usage 600% without iowait).
Attached file is pprof result.
evaluation_interval: 1s
Our design generally allows easy backups. Necessary steps for this are:
To disable/enable the compactor and trigger snapshots we can simply rely on the using application to make the method accessible or provide an optional convenience HTTP server in the tsdb directly.
Prometheus currently has no downsampling support. It can be achived via federation, but it's way too messy.
Maybe now it is possible to integrate downsampling support into compaction process?
Another usefull feature would be to have different ttl for different metrics.
For example in our setup, a lot of metrics are aggregated via recording rules, and after "recording" they are never queried again.
I'm seeing a panic due to concurrent map writes in: (*headAppender).Add()
:
https://github.com/prometheus/tsdb/blob/25d45465189ea9b7f2c894188c06d1853bfa79ba/head.go#L322-L323
Is the intention that the caller to Add()
should implement their own locking? Is there a reason that Add()
shouldn't manage a mutex itself?
Is this is a documentation issue, happy to raise a PR add a comment to clarify.
So #103 is currently failing as we are depending on prometheus/prometheus for tests and a change broke the compatibility.
While for now the hack works, we ideally need to remove the dependency.
/cc @fabxc @brian-brazil
The procedure for more than two lists of different set operations works similarly. So the number of k set operations merely modifies the factor (O(k*n)) instead of the exponent (O(n^k)) of our worst-case lookup runtime. A great improvement.
-- https://fabxc.org/blog/2017-04-10-writing-a-tsdb/
That's not the most efficient way of performing k-way merge. Your implementation has indeed O(nk)
time complexity, but this short drop-in replacement:
func Merge(its ...Postings) Postings {
l := len(its)
switch l {
case 0:
return nil
case 1:
return its[0]
default:
m := l / 2
return newMergePostings(Merge(its[:m]...), Merge(its[m:]...))
}
}
has O(n log k)
(hope I got it right). There are many other efficient implementations of k-way merge, one of the most popular (I think) is based on binary heap (container/heap
in Go land).
I could prepare a couple of implementations and benchmark them. Alternatively, we can take a look how other search engines do that (https://github.com/blevesearch/bleve ?) and go with that.
func (s *populatedChunkSeries) Next() bool {
for s.set.Next() {
lset, chks := s.set.At()
for i, c := range chks {
if c.MaxTime < s.mint {
chks = chks[1:]
continue
}
chks length is changed above, below code will not work
as you want. It will panic for "slice bounds out of range".
I think it should be something like:
var out int
for i, c := range chks {
if c.MaxTime < s.mint {
out++
continue
}
................
chks=chks[out:]
if len(chks) == 0 {
continue
}
A memory block receives series from writers in random order. Based on that it builds the inverted index, which relies on monotonically increasing IDs for each series.
For querying, we want series to be in lexicographic order of their label sets to efficiently merge them between blocks.
For that we have a position mapper, which can reorder an iteration over inverted index efficiently and give us the needed order on querying. Whenever a new series appears the position mapper has to be updated.
Being more strict by default, this is currently checked before querying and the query blocks until the mapper is updated. This guarantees that any series is immediately visible after successful insertion. This causes some queries to respond slowly (sorting hundreds to thousands of label sets may take several seconds) and being severely impacted by environments, where a new series appears every few seconds.
Two points here:
Commit()
on the insertion returned? Are we fine with a delay of a few seconds? Should queriers block or the writer creating a new series?This might be a bit of an abstract issue as its a deeper internal.
This project does not vendor its dependencies, so I had to go get
them manually.
But when I tried to run tests I got the following.
$ go test
# github.com/prometheus/tsdb
head_test.go:26:2: cannot find package "github.com/prometheus/prometheus/pkg/labels" in any of:
/usr/local/go/src/github.com/prometheus/prometheus/pkg/labels (from $GOROOT)
/Users/telendt/goprojects/src/github.com/prometheus/prometheus/pkg/labels (from $GOPATH)
FAIL github.com/prometheus/tsdb [setup failed]
If I switch github.com/prometheus/prometheus
to dev-2.0, I get:
$ go test
# github.com/prometheus/tsdb
./compact.go:60: undefined: prometheus.Registerer
./compact.go:90: undefined: prometheus.Registerer
./db.go:133: undefined: prometheus.Registerer
./db.go:155: undefined: prometheus.Registerer
FAIL github.com/prometheus/tsdb [build failed]
What's the prometheus version you test it against?
Comparing label sets is relatively expensive. The block design requires us to do a lot of those comparisons when merging series with the same label sets from different blocks.
This mostly affects range queries and gets more significant the more blocks a query spans. The performance penality becomes relatively stronger the more series are queried.
While blocks should generally stay independent, it should be possible to compute (and potentially persist) mappings of series IDs from different block, which could significantly speed up merges.
This will be particularly tricky for memory blocks where series constantly get added.
There's a deadlock that hits Prometheus servers under load some time within 12 hours typically. It is known to occur in servers that are never queried so is likely located in the write path.
Goroutine dumps show to be entirely useless as all waiting goroutines are waiting to acquire a lock.
Thus, it's most likely inconsistent locking order or exiting a locked procedure without unlocking again.
I plan to work on tests most of next week. Should I help setup Travis while at it?
From a quick observation something about doing negative regular expression matching is not working correctly.
ls
that gives a tabular view of blocks and info about themLots of things possible. But not all of them are super useful aside from things being possible in theory.
Some are possible through Prometheus' APIs in one way or another. But it seems fine to have multiple interfaces to the same functionality.
I've just read Fabian's blogpost which gives high level introduction to new V3 storage. I'd like to get more familiar with the binary layout of index/chunks but I'm not sure where to look (should I dive deep into source code?). Are there some documents around that describe it and if so, can you link then in the README?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.