Is your feature request related to a problem? Please describe. <p

Emit a metric for tablets that have more files than the scan max and are not compacting about accumulo HOT 4 OPEN

keith-turner commented on June 2, 2024

Emit a metric for tablets that have more files than the scan max and are not compacting

from accumulo.

Comments (4)

keith-turner commented on June 2, 2024 1

To provide metric(s) that are actionable you may need tablet info - and adding tablet info would drastically increase the cardinality of the metric - it would not be a good candidate for a tag.

I don't think per tablet information is needed in the metrics system. Currently we have compaction services that bin tablets to different queues for compactions. We generate metrics on the number queued and and running for each queue. If a user sees that the number of queued tablets is too high for compaction service its an indication that further investigation is needed, but the metrics system does not provide the information needed to solve the question of why too many things are queued.

Thinking along those same lines maybe each compaction service could have another metric for problems. Not sure if we need a metrics for each kinds of problem or just a general problem counter per compaction service. But this falls into that general idea of binning tablets into different categories and counting those. If the counts for any category seem off the metric system can not help you figure out why the counts are off, but it does indicate that action is needed.

Below is an example of what I am thinking about. There are two compaction services each with two queues for running compactions. For each of the queues there are three metrics, the number of compactions currently running, the number queued, and the number that recently ran and failed. Also each compaction service has a count of the number of tablet where it had a problem planning compactions. This planning problem counter would cover this issue. I think these are reasonable counters. When someone sees failed compactions they have to go look at compactor process logs and try to figure out what failed an why. When someone sees planning failures they need to go look at the manager logs and see what happened.

Compaction service	Metric	count
CS1	small.running	50
CS1	small.queued	500
CS1	small.failed	0
CS1	large.running	100
CS1	large.queued	10000
CS1	large.failed	10
CS1	planning_problems	20
CS2	small.running	30
CS2	small.queued	10
CS2	small.failed	0
CS2	large.running	500
CS2	large.queued	400
CS2	large.failed	0
CS2	planning_problems	0

from accumulo.

EdColeman commented on June 2, 2024

Would there be an advantage to provide the "raw" numbers used in the calculation, say number of files, number of files selected / planned instead of a "flag" for no files selected?

That would provide additional insight and allow post processing / alerting if that was desired. It would also allow for measuring and tuning by the user to achieve their goals.

from accumulo.

keith-turner commented on June 2, 2024

Would there be an advantage to provide the "raw" numbers used in the calculation, say number of files, number of files selected / planned instead of a "flag" for no files selected?

Yes. Trying to figure out the best way to do that. That was what #4091 was opened about. Thinking that #4089 and #4090 can alert that a problem exists and then will need information to help solve the problem which could be #4091 or something else. The problem could occur for lots of tablets, so want to avoid dumping lots of detailed information for each tablet when the problem happens. Need to somehow balance detecting the problem, fixing the problem, and not emitting too much logging.

from accumulo.

EdColeman commented on June 2, 2024

This may be tough - and possibly something not suitable for metrics. Not sure what the right approach would be other than logging.

To provide metric(s) that are actionable you may need tablet info - and adding tablet info would drastically increase the cardinality of the metric - it would not be a good candidate for a tag.

from accumulo.

Emit a metric for tablets that have more files than the scan max and are not compacting about accumulo HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent