Comments (4)
To provide metric(s) that are actionable you may need tablet info - and adding tablet info would drastically increase the cardinality of the metric - it would not be a good candidate for a tag.
I don't think per tablet information is needed in the metrics system. Currently we have compaction services that bin tablets to different queues for compactions. We generate metrics on the number queued and and running for each queue. If a user sees that the number of queued tablets is too high for compaction service its an indication that further investigation is needed, but the metrics system does not provide the information needed to solve the question of why too many things are queued.
Thinking along those same lines maybe each compaction service could have another metric for problems. Not sure if we need a metrics for each kinds of problem or just a general problem counter per compaction service. But this falls into that general idea of binning tablets into different categories and counting those. If the counts for any category seem off the metric system can not help you figure out why the counts are off, but it does indicate that action is needed.
Below is an example of what I am thinking about. There are two compaction services each with two queues for running compactions. For each of the queues there are three metrics, the number of compactions currently running, the number queued, and the number that recently ran and failed. Also each compaction service has a count of the number of tablet where it had a problem planning compactions. This planning problem counter would cover this issue. I think these are reasonable counters. When someone sees failed compactions they have to go look at compactor process logs and try to figure out what failed an why. When someone sees planning failures they need to go look at the manager logs and see what happened.
Compaction service | Metric | count |
---|---|---|
CS1 | small.running | 50 |
CS1 | small.queued | 500 |
CS1 | small.failed | 0 |
CS1 | large.running | 100 |
CS1 | large.queued | 10000 |
CS1 | large.failed | 10 |
CS1 | planning_problems | 20 |
CS2 | small.running | 30 |
CS2 | small.queued | 10 |
CS2 | small.failed | 0 |
CS2 | large.running | 500 |
CS2 | large.queued | 400 |
CS2 | large.failed | 0 |
CS2 | planning_problems | 0 |
from accumulo.
Would there be an advantage to provide the "raw" numbers used in the calculation, say number of files, number of files selected / planned instead of a "flag" for no files selected?
That would provide additional insight and allow post processing / alerting if that was desired. It would also allow for measuring and tuning by the user to achieve their goals.
from accumulo.
Would there be an advantage to provide the "raw" numbers used in the calculation, say number of files, number of files selected / planned instead of a "flag" for no files selected?
Yes. Trying to figure out the best way to do that. That was what #4091 was opened about. Thinking that #4089 and #4090 can alert that a problem exists and then will need information to help solve the problem which could be #4091 or something else. The problem could occur for lots of tablets, so want to avoid dumping lots of detailed information for each tablet when the problem happens. Need to somehow balance detecting the problem, fixing the problem, and not emitting too much logging.
from accumulo.
This may be tough - and possibly something not suitable for metrics. Not sure what the right approach would be other than logging.
To provide metric(s) that are actionable you may need tablet info - and adding tablet info would drastically increase the cardinality of the metric - it would not be a good candidate for a tag.
from accumulo.
Related Issues (20)
- Scan Server ZooKeeper entries are not removed on shutdown. HOT 3
- Code to periodically clean up empty tserver nodes in ZK may need to be generlized HOT 2
- Accumulo lacks a way to get status on table level compactions HOT 2
- Create a non-compacting compaction as a way to move files
- Add Busy/Idle metric for compactors HOT 1
- Moving the coordinator resets compaction duration HOT 1
- Lower time to host ondemand tablets HOT 3
- CredentialProviderToken in accumulo-client.properties stores the serialized token which contains the password, exposing the password out of the JCEKS file HOT 3
- Implement upgrade from 3.1 to 4.0 HOT 1
- Add IT to verify Scan Servers remove their references on shutdown HOT 3
- accumulo-client properties like "ssl.truststore.password" do not get found inside a CredentialProvider if one is provided
- accumulo-cluster does not start a CompactionCoordinator on different host HOT 1
- Add ability to filter out metrics HOT 3
- ServiceLockData for monitor should include port
- The admin serviceStatus command needs additional changes from 2.1
- FATE threads should not be started in the constructor HOT 2
- Tablet with lots of file may not be readable on scan servers for long periods of time.
- Estimation of rfile entries is not implemented for fenced files. HOT 1
- Service status command is not functional in elasticity after merge HOT 2
- Integeration test testFatePrintAndSummaryCommandsWithInProgressTxns was dropped in merge HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from accumulo.