Giter VIP home page Giter VIP logo

Comments (4)

keith-turner avatar keith-turner commented on June 2, 2024 1

To provide metric(s) that are actionable you may need tablet info - and adding tablet info would drastically increase the cardinality of the metric - it would not be a good candidate for a tag.

I don't think per tablet information is needed in the metrics system. Currently we have compaction services that bin tablets to different queues for compactions. We generate metrics on the number queued and and running for each queue. If a user sees that the number of queued tablets is too high for compaction service its an indication that further investigation is needed, but the metrics system does not provide the information needed to solve the question of why too many things are queued.

Thinking along those same lines maybe each compaction service could have another metric for problems. Not sure if we need a metrics for each kinds of problem or just a general problem counter per compaction service. But this falls into that general idea of binning tablets into different categories and counting those. If the counts for any category seem off the metric system can not help you figure out why the counts are off, but it does indicate that action is needed.

Below is an example of what I am thinking about. There are two compaction services each with two queues for running compactions. For each of the queues there are three metrics, the number of compactions currently running, the number queued, and the number that recently ran and failed. Also each compaction service has a count of the number of tablet where it had a problem planning compactions. This planning problem counter would cover this issue. I think these are reasonable counters. When someone sees failed compactions they have to go look at compactor process logs and try to figure out what failed an why. When someone sees planning failures they need to go look at the manager logs and see what happened.

Compaction service Metric count
CS1 small.running 50
CS1 small.queued 500
CS1 small.failed 0
CS1 large.running 100
CS1 large.queued 10000
CS1 large.failed 10
CS1 planning_problems 20
CS2 small.running 30
CS2 small.queued 10
CS2 small.failed 0
CS2 large.running 500
CS2 large.queued 400
CS2 large.failed 0
CS2 planning_problems 0

from accumulo.

EdColeman avatar EdColeman commented on June 2, 2024

Would there be an advantage to provide the "raw" numbers used in the calculation, say number of files, number of files selected / planned instead of a "flag" for no files selected?

That would provide additional insight and allow post processing / alerting if that was desired. It would also allow for measuring and tuning by the user to achieve their goals.

from accumulo.

keith-turner avatar keith-turner commented on June 2, 2024

Would there be an advantage to provide the "raw" numbers used in the calculation, say number of files, number of files selected / planned instead of a "flag" for no files selected?

Yes. Trying to figure out the best way to do that. That was what #4091 was opened about. Thinking that #4089 and #4090 can alert that a problem exists and then will need information to help solve the problem which could be #4091 or something else. The problem could occur for lots of tablets, so want to avoid dumping lots of detailed information for each tablet when the problem happens. Need to somehow balance detecting the problem, fixing the problem, and not emitting too much logging.

from accumulo.

EdColeman avatar EdColeman commented on June 2, 2024

This may be tough - and possibly something not suitable for metrics. Not sure what the right approach would be other than logging.

To provide metric(s) that are actionable you may need tablet info - and adding tablet info would drastically increase the cardinality of the metric - it would not be a good candidate for a tag.

from accumulo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.