Giter VIP home page Giter VIP logo

Comments (13)

joe-elliott avatar joe-elliott commented on July 16, 2024 2

yes, docs look like this:

https://grafana.com/docs/tempo/latest/configuration/s3/#lifecycle-policy

I think we're looking for a little more information here about how to actually configure this policy.

from tempo.

joe-elliott avatar joe-elliott commented on July 16, 2024 1

I think it makes more sense in doRetention() then doCompaction() so it can be part of the block cleanup process. This one is tricky b/c it would be difficult to tell an "stranded" block from one that is currently being compacted. In fact they would basically look the same.

from tempo.

mdisibio avatar mdisibio commented on July 16, 2024 1

Some more notes: When an S3 multipart upload fails to complete, the partially written object is not present in the normal object listings. However there is a separate API to list outstanding multipart uploads, and to abort them. In the current tempo backend implementation, the largest data object is written first. If this upload fails, because there are no sibling files written yet (i.e. index, bloom filters), then there are no traces for this block at all except for the outstanding multi-part uploads on the separate API. Meaning, the proposed idea of detecting partially written blocks and determining their start time from object or folder timestamps, and then deleting the object, will not be in that straight-forward manner for S3.

from tempo.

dgzlopes avatar dgzlopes commented on July 16, 2024

I've one question @joe-elliott :)

So, this should run on the compactionLoop()? Each time we doCompaction() we clean up the blocks?

Or it's more like the Janitor that we have on diskcache?

from tempo.

dgzlopes avatar dgzlopes commented on July 16, 2024

Okey!

So, from reading the doRetention() code: we iterate through the in-memory blocklist and compact anything that is past retention, and then we iterate over the compacted blocklist cleaning up all the blocks.

But if we want to clean "stranded" blocks on this phase, It's still tricky b/c the "stranded" blocks don't have an EndTime (they don't have a meta.json nor a meta.compacted.json).

Maybe we can collect the possible "stranded" blocks on each doRetention() (well, with some freq), and if the certain blocks are detected on the next loop (or... X time later) clean them? Does that make some sense? 😄

from tempo.

joe-elliott avatar joe-elliott commented on July 16, 2024

Yup. The problem is that compaction can take over an hour sometimes and is variable depending on how large you want to make your blocks. So it's difficult to wait a certain amount of time and then remove the block.

As I think about it this is stickier than I had thought. Some alternatives to fixing this:

  • Add a configurable timeout defaulted to multiple hours. If a block has existed for multiple hours w/o a meta.json or meta.compacted.json delete it.
  • Just include the stranded blocks in the normal retention. However, if a compaction is occurring right at the edge of total retention this might cause some weird errors.
  • Instead of adding the above to the compactor add it to the tempo-cli tool as something that can be run as a manual process.
  • Do nothing and simply advise operators through documentation of this issue. Suggest the use of S3 or GCS lifecycle rules to clean up the old blocks. This is what Grafana is currently doing.

For context I made this issue when we were seeing regular compactor OOMs. In that scenario we were seeing 100s of stranded blocks and I wanted a way to clean them up. Now that things have stabilized I think this is issue is way less pressing.

from tempo.

mdisibio avatar mdisibio commented on July 16, 2024

I was thinking along option (1) as well. In order to tell how long a block has existed with no meta, maybe they could be tagged with a 3rd kind of meta (meta.inprogress.json). It's hard to say how long compaction will ever take, so maybe something generous like 24 hours in progress == stranded.
Option (4) use lifecycle rules is attractive and handles most use cases. Do we foresee common use of other backends like local file storage where there is no lifecycle rule to rely on?

from tempo.

dgzlopes avatar dgzlopes commented on July 16, 2024

Maybe Option (4) can be paired with Option (3). I mean, using lifecycle rules for most use cases, and giving the option to manually clean them with tempocli. Probably some users would automate the process.

Option (1) looks great! Maybe, instead of creating another meta, we can retrieve the block creation time? Sounds like something handy to have.

from tempo.

joe-elliott avatar joe-elliott commented on July 16, 2024

Recently it has occurred to me that this also occurs on any rollout. When a new version of Tempo is rolled out the compactors will stop mid compaction and immediately exit. It's possible we should make compactors clean up on shutdown, but it seems like simply adding this "janitor" functionality would be an easier route.

from tempo.

mdisibio avatar mdisibio commented on July 16, 2024

Consensus is that since there is no straightforward way to fix this, it is better to rely on bucket lifecycle cleanup and do nothing else right now. To complete this ticket let's document how to best cleanup with lifecycle policies.

Question: Do we know what Azure Blob storage is handled? We feel confident in S3 and GCS cleanup.

from tempo.

NiklasWagner avatar NiklasWagner commented on July 16, 2024

I just ran in the same issue. The documentation says you should setup a lifecycle policy but I'm not sure what the filter for the policy should look like to avoid deleting actual traces after one day.

A lifecycle policy is recommended that deletes incomplete multipart uploads after one day.

https://github.com/grafana/tempo/blob/b3159d65c182fd187761afd9abbacb0c55bddb9e/docs/tempo/website/configuration/s3.md#lifecycle-policy

from tempo.

knylander-grafana avatar knylander-grafana commented on July 16, 2024

@joe-elliott Will this still make sense after Parquet?

from tempo.

github-actions avatar github-actions commented on July 16, 2024

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

from tempo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.