teslamotors / kafka-helmsman Goto Github PK
View Code? Open in Web Editor NEWkafka-helmsman is a repository of tools that focus on automating a Kafka deployment
License: MIT License
kafka-helmsman is a repository of tools that focus on automating a Kafka deployment
License: MIT License
I've started relying on the freshness tracker for kafka consumer health alerting. Recently some of the freshness tracker metrics seem to be unreliable. I have a topic with 900 partitions. Checking offset lag via the kafka API, I see per partition offset lags oscillating between 0 and 1k. In the attached graphs, I'm singling out a single partition. The first graph shows the freshness-derived lag, the second shows burrow's reported offset lag. I can't figure out why the freshness lag is so far off and oscillates between zero and many hours.
Have you encountered something like this before?
pre-stop command is very helpful for heathcheck like lag on some topics, but there is no way to do a 'pre-start' command, which could be used for upgrading.
With the current Kafka version (2.4.1), quota enforcement was implemented through the use of a Zookeeper admin client, as using the KafkaAdminClient only support quota configuration with Kafka >= 2.6, client and server-side.
With the introduction of quota enforcement functionality in this project, we had to add in the Kafka server library (which contains the ZK admin client code), which in turn required complicating the dependency environment with various Scala libraries, and scala bazel_rules
.
When we are ready to upgrade Kafka to >= 2.6, it would make sense to bring remove these Scala dependencies and go back to a light-weight dependencies.yaml
with just Java libraries. This entails
Hi, wi a clone of the repo on master branch
XXXXX@YYYY:~/REPO/test/kafka-helmsman$ bazel build //...:all
INFO: Analyzed 79 targets (0 packages loaded, 0 targets configured).
INFO: Found 79 targets...
INFO: Elapsed time: 0.070s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
XXXXX@YYYY:~/REPO/test/kafka-helmsman$ bazel version
Bazelisk version: v1.7.5
Build label: 3.4.1
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Jul 14 06:27:53 2020 (1594708073)
Build timestamp: 1594708073
Build timestamp as int: 1594708073
XXXXX@YYYY:~/REPO/test/kafka-helmsman$ javac -version
javac 1.8.0_282
build command do not produce anything , no jar
Thank you.
Topic enforcer can not alter replication factor once a topic has been created (Kafka doesn't allow it). Currently, enforcer run finishes silently with no indication of replication factor drift even if it detects one. A better ux would be to log the drift and inform the user that its non enforceable.
Burrow already exposes an endpoint to get cluster information (https://github.com/linkedin/Burrow/wiki/http-request-kafka-cluster-detail). We should able to use that to get the default bootstrap.servers for the cluster, rather than having to provide it as a config property. This only marginally helps us get around the rest of the config for things like SSL, but at least would help keep things consistent.
There go a very lazy docker example of ConsumerFreshness_deploy
FROM openjdk:11-jre-slim
ADD ConsumerFreshness_deploy.jar ConsumerFreshness_deploy.jar
ADD conf.yaml conf.yaml
CMD java -jar ConsumerFreshness_deploy.jar --conf conf.yaml
version: "3"
services:
burrow:
build:
context: ./burrow/
dockerfile: Dockerfile
volumes:
- ./burrow/burrow.toml:/etc/burrow/burrow.toml
ports:
- 8000:8000
depends_on:
- zookeeper
- kafka
time_lag:
build:
context: ./tesla/
dockerfile: Dockerfile
volumes:
- ./tesla/conf.yaml:/conf.yaml
- ./tesla/ConsumerFreshness_deploy.jar:/ConsumerFreshness_deploy.jar
ports:
- 8099:8081
depends_on:
- burrow
- kafka
...
any suggestion is super welcome ๐
For example, when a consumer is not available from burrow we dump a large error message in the logs
2022-06-29 09:03:52 ERROR [main] c.t.d.c.f.ConsumerFreshness:312 - Failed to read Burrow status for consumer example.missing.consumer. Skipping
java.io.IOException: Response was not successful: Response{protocol=http/1.1, code=404, message=Not Found, url=http://my.burrow/v3/kafka/my-cluster/consumer/example.missing.consumer/lag}
at com.tesla.data.consumer.freshness.Burrow.request(Burrow.java:95)
at com.tesla.data.consumer.freshness.Burrow.getConsumerGroupStatus(Burrow.java:111)
at com.tesla.data.consumer.freshness.Burrow$ClusterClient.getConsumerGroupStatus(Burrow.java:144)
at com.tesla.data.consumer.freshness.ConsumerFreshness.measureConsumer(ConsumerFreshness.java:307)
at com.tesla.data.consumer.freshness.ConsumerFreshness.measureCluster(ConsumerFreshness.java:271)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:440)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
But these consumers can be missing lag information if burrow has include/exclusions, making these error messages just clog the logs.
Conversely, its hard to diagnose a bug for a consumer if you don't know what freshness tracker is seeing. For example, a consumer is showing as having increasing lag but burrow & kafka both say that it is up-to-date on the latest commit (this occurred recently). If this persists past a freshness-tracker restart, something is wonky in the tracker and you would want to turn on some debug logging (even if it is verbose) to see what is going on.
Currently, we are very generous with the failure constraints for a cluster, from ConsumerFreshness (ln 281-293):
// if all the consumer measurements succeed, then we return the cluster name
// otherwise, Future.get will throw an exception representing the failure to measure a consumer (and thus the
// failure to successfully monitor the cluster).
return Futures.whenAllSucceed(completedConsumers).call(client::getCluster, this.executor);
}
/**
* Measure the freshness for all the topic/partitions currently consumed by the given consumer group. To maintain
* the existing contract, a consumer measurement fails ({@link Future#get()} throws an exception) only if:
* - burrow group status lookup fails
* - execution is interrupted
* Failure to actually measure the consumer is swallowed into a log message & metric update; obviously, this is less
* than ideal for many cases, but it will be addressed later.
However, SSL connection issues (i.e. a misconfiguration) only show up when querying the consumers. So you can have a valid burrow lookup for the cluster (b/c burrow is configured correctly) but freshness fails for each consumer because the tracker misconfigured. You would never know though (from the kafka_consumer_freshness_last_success_run_timestamp
metric) since that will not get incremented for the failures.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.