Comments (13)
yes, docs look like this:
https://grafana.com/docs/tempo/latest/configuration/s3/#lifecycle-policy
I think we're looking for a little more information here about how to actually configure this policy.
from tempo.
I think it makes more sense in doRetention()
then doCompaction()
so it can be part of the block cleanup process. This one is tricky b/c it would be difficult to tell an "stranded" block from one that is currently being compacted. In fact they would basically look the same.
from tempo.
Some more notes: When an S3 multipart upload fails to complete, the partially written object is not present in the normal object listings. However there is a separate API to list outstanding multipart uploads, and to abort them. In the current tempo backend implementation, the largest data
object is written first. If this upload fails, because there are no sibling files written yet (i.e. index, bloom filters), then there are no traces for this block at all except for the outstanding multi-part uploads on the separate API. Meaning, the proposed idea of detecting partially written blocks and determining their start time from object or folder timestamps, and then deleting the object, will not be in that straight-forward manner for S3.
from tempo.
I've one question @joe-elliott :)
So, this should run on the compactionLoop()
? Each time we doCompaction()
we clean up the blocks?
Or it's more like the Janitor that we have on diskcache
?
from tempo.
Okey!
So, from reading the doRetention()
code: we iterate through the in-memory blocklist and compact anything that is past retention, and then we iterate over the compacted blocklist cleaning up all the blocks.
But if we want to clean "stranded" blocks on this phase, It's still tricky b/c the "stranded" blocks don't have an EndTime (they don't have a meta.json
nor a meta.compacted.json
).
Maybe we can collect the possible "stranded" blocks on each doRetention()
(well, with some freq), and if the certain blocks are detected on the next loop (or... X time later) clean them? Does that make some sense? 😄
from tempo.
Yup. The problem is that compaction can take over an hour sometimes and is variable depending on how large you want to make your blocks. So it's difficult to wait a certain amount of time and then remove the block.
As I think about it this is stickier than I had thought. Some alternatives to fixing this:
- Add a configurable timeout defaulted to multiple hours. If a block has existed for multiple hours w/o a
meta.json
ormeta.compacted.json
delete it. - Just include the stranded blocks in the normal retention. However, if a compaction is occurring right at the edge of total retention this might cause some weird errors.
- Instead of adding the above to the compactor add it to the tempo-cli tool as something that can be run as a manual process.
- Do nothing and simply advise operators through documentation of this issue. Suggest the use of S3 or GCS lifecycle rules to clean up the old blocks. This is what Grafana is currently doing.
For context I made this issue when we were seeing regular compactor OOMs. In that scenario we were seeing 100s of stranded blocks and I wanted a way to clean them up. Now that things have stabilized I think this is issue is way less pressing.
from tempo.
I was thinking along option (1) as well. In order to tell how long a block has existed with no meta, maybe they could be tagged with a 3rd kind of meta (meta.inprogress.json). It's hard to say how long compaction will ever take, so maybe something generous like 24 hours in progress == stranded.
Option (4) use lifecycle rules is attractive and handles most use cases. Do we foresee common use of other backends like local file storage where there is no lifecycle rule to rely on?
from tempo.
Maybe Option (4) can be paired with Option (3). I mean, using lifecycle rules for most use cases, and giving the option to manually clean them with tempocli. Probably some users would automate the process.
Option (1) looks great! Maybe, instead of creating another meta, we can retrieve the block creation time? Sounds like something handy to have.
from tempo.
Recently it has occurred to me that this also occurs on any rollout. When a new version of Tempo is rolled out the compactors will stop mid compaction and immediately exit. It's possible we should make compactors clean up on shutdown, but it seems like simply adding this "janitor" functionality would be an easier route.
from tempo.
Consensus is that since there is no straightforward way to fix this, it is better to rely on bucket lifecycle cleanup and do nothing else right now. To complete this ticket let's document how to best cleanup with lifecycle policies.
Question: Do we know what Azure Blob storage is handled? We feel confident in S3 and GCS cleanup.
from tempo.
I just ran in the same issue. The documentation says you should setup a lifecycle policy but I'm not sure what the filter for the policy should look like to avoid deleting actual traces after one day.
A lifecycle policy is recommended that deletes incomplete multipart uploads after one day.
from tempo.
@joe-elliott Will this still make sense after Parquet?
from tempo.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.
from tempo.
Related Issues (20)
- [TraceQL Metrics] Tempo converts start/end incorrectly
- [DOCS] Outdated documentation setting up a test app HOT 2
- [TraceQL] Arbitrary math with static values and attributes does not work
- [DOC] Provide use cases for multitenancy in doc HOT 1
- Allow querying traces by span instrumentation "Library Name" HOT 1
- Grafana Tempo throws no org Id Error even after disabling Multi tenancy HOT 3
- Search query TraceQL is truncated to 1024 characters HOT 10
- Support cross tenancy TraceQL queries HOT 2
- Meta-monitoring mixin dashboards rely on deprecated visualisation plugin
- s3 metrics always increasing HOT 2
- Invalid second event generated after span ended HOT 1
- [DOC] Create landing page for Tempo
- Unable to upgrade tempo on Ubuntu 22.04 LTS from apt repository HOT 1
- Traces are not available after 30 minutes on Windows HOT 5
- Span count diff between tempo cli and grafana ui HOT 1
- Workload Identity Federation only working with uniform bucket level access? HOT 2
- Backend blocks are missing compression\encoding HOT 3
- Support encryption in transit on Memcached cache HOT 1
- Add support in homebrew for tempo-cli installation HOT 2
- Bad 2.4.1 release? HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tempo.