teslamotors / kafka-helmsman Goto Github PK

View Code? Open in Web Editor NEW

97.0 97.0 31.0 277 KB

kafka-helmsman is a repository of tools that focus on automating a Kafka deployment

License: MIT License

Python 7.87% Shell 0.67% Java 85.10% Starlark 6.37%

kafka-helmsman's People

Contributors

Stargazers

Watchers

kafka-helmsman's Issues

Strange freshness calculation for some topic partitions

I've started relying on the freshness tracker for kafka consumer health alerting. Recently some of the freshness tracker metrics seem to be unreliable. I have a topic with 900 partitions. Checking offset lag via the kafka API, I see per partition offset lags oscillating between 0 and 1k. In the attached graphs, I'm singling out a single partition. The first graph shows the freshness-derived lag, the second shows burrow's reported offset lag. I can't figure out why the freshness lag is so far off and oscillates between zero and many hours.

Have you encountered something like this before?

kafka_roller cannot be used for a broker upgrade

pre-stop command is very helpful for heathcheck like lag on some topics, but there is no way to do a 'pre-start' command, which could be used for upgrading.

Use KafkaAdminClient for quota enforcement + upgrade Kafka libs to 2.6

With the current Kafka version (2.4.1), quota enforcement was implemented through the use of a Zookeeper admin client, as using the KafkaAdminClient only support quota configuration with Kafka >= 2.6, client and server-side.

With the introduction of quota enforcement functionality in this project, we had to add in the Kafka server library (which contains the ZK admin client code), which in turn required complicating the dependency environment with various Scala libraries, and scala bazel_rules.

When we are ready to upgrade Kafka to >= 2.6, it would make sense to bring remove these Scala dependencies and go back to a light-weight dependencies.yaml with just Java libraries. This entails

removing Kafka server library
upgrading kafka-client library
remove unneeded dependencies that were added in the below PRs
use KafkaAdminClient to for quota configuration

#54
#53

build do not produce anything

Hi, wi a clone of the repo on master branch

XXXXX@YYYY:~/REPO/test/kafka-helmsman$ bazel build //...:all
INFO: Analyzed 79 targets (0 packages loaded, 0 targets configured).
INFO: Found 79 targets...
INFO: Elapsed time: 0.070s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
XXXXX@YYYY:~/REPO/test/kafka-helmsman$ bazel version
Bazelisk version: v1.7.5
Build label: 3.4.1
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Jul 14 06:27:53 2020 (1594708073)
Build timestamp: 1594708073
Build timestamp as int: 1594708073
XXXXX@YYYY:~/REPO/test/kafka-helmsman$ javac -version
javac 1.8.0_282

build command do not produce anything , no jar

Thank you.

Topic enforcer should indicate replication factor config drift

Topic enforcer can not alter replication factor once a topic has been created (Kafka doesn't allow it). Currently, enforcer run finishes silently with no indication of replication factor drift even if it detects one. A better ux would be to log the drift and inform the user that its non enforceable.

freshness-tracker should be able to read cluster bootstrap servers from burrow

Burrow already exposes an endpoint to get cluster information (https://github.com/linkedin/Burrow/wiki/http-request-kafka-cluster-detail). We should able to use that to get the default bootstrap.servers for the cluster, rather than having to provide it as a config property. This only marginally helps us get around the rest of the config for things like SSL, but at least would help keep things consistent.

Docker example of ConsumerFreshness_deploy

There go a very lazy docker example of ConsumerFreshness_deploy

FROM openjdk:11-jre-slim
ADD ConsumerFreshness_deploy.jar ConsumerFreshness_deploy.jar
ADD conf.yaml conf.yaml
CMD java -jar ConsumerFreshness_deploy.jar --conf conf.yaml

version: "3"
services:
  burrow:
    build:
      context: ./burrow/
      dockerfile: Dockerfile
    volumes:
      - ./burrow/burrow.toml:/etc/burrow/burrow.toml
    ports:
      - 8000:8000
    depends_on:
      - zookeeper
      - kafka

  time_lag:
    build:
      context: ./tesla/
      dockerfile: Dockerfile
    volumes:
      - ./tesla/conf.yaml:/conf.yaml
      - ./tesla/ConsumerFreshness_deploy.jar:/ConsumerFreshness_deploy.jar
    ports:
      - 8099:8081
    depends_on:
      - burrow
      - kafka
...

any suggestion is super welcome 👍

Freshness tracker verbosity could be improved for errors and debugging

For example, when a consumer is not available from burrow we dump a large error message in the logs

2022-06-29 09:03:52 ERROR [main] c.t.d.c.f.ConsumerFreshness:312 - Failed to read Burrow status for consumer example.missing.consumer. Skipping
java.io.IOException: Response was not successful: Response{protocol=http/1.1, code=404, message=Not Found, url=http://my.burrow/v3/kafka/my-cluster/consumer/example.missing.consumer/lag}
        at com.tesla.data.consumer.freshness.Burrow.request(Burrow.java:95)
        at com.tesla.data.consumer.freshness.Burrow.getConsumerGroupStatus(Burrow.java:111)
        at com.tesla.data.consumer.freshness.Burrow$ClusterClient.getConsumerGroupStatus(Burrow.java:144)
        at com.tesla.data.consumer.freshness.ConsumerFreshness.measureConsumer(ConsumerFreshness.java:307)
        at com.tesla.data.consumer.freshness.ConsumerFreshness.measureCluster(ConsumerFreshness.java:271)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:440)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
        at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)

But these consumers can be missing lag information if burrow has include/exclusions, making these error messages just clog the logs.

Conversely, its hard to diagnose a bug for a consumer if you don't know what freshness tracker is seeing. For example, a consumer is showing as having increasing lag but burrow & kafka both say that it is up-to-date on the latest commit (this occurred recently). If this persists past a freshness-tracker restart, something is wonky in the tracker and you would want to turn on some debug logging (even if it is verbose) to see what is going on.

Freshness tracker should fail a cluster iteration if all partitions for all consumers fails

Currently, we are very generous with the failure constraints for a cluster, from ConsumerFreshness (ln 281-293):

    // if all the consumer measurements succeed, then we return the cluster name
    // otherwise, Future.get will throw an exception representing the failure to measure a consumer (and thus the
    // failure to successfully monitor the cluster).
    return Futures.whenAllSucceed(completedConsumers).call(client::getCluster, this.executor);
  }

  /**
   * Measure the freshness for all the topic/partitions currently consumed by the given consumer group. To maintain
   * the existing contract, a consumer measurement fails ({@link Future#get()} throws an exception) only if:
   *  - burrow group status lookup fails
   *  - execution is interrupted
   * Failure to actually measure the consumer is swallowed into a log message & metric update; obviously, this is less
   * than ideal for many cases, but it will be addressed later.

However, SSL connection issues (i.e. a misconfiguration) only show up when querying the consumers. So you can have a valid burrow lookup for the cluster (b/c burrow is configured correctly) but freshness fails for each consumer because the tracker misconfigured. You would never know though (from the kafka_consumer_freshness_last_success_run_timestamp metric) since that will not get incremented for the failures.

teslamotors / kafka-helmsman Goto Github PK

kafka-helmsman's People

Contributors

Stargazers

Watchers

Forkers

kafka-helmsman's Issues

Strange freshness calculation for some topic partitions

kafka_roller cannot be used for a broker upgrade

Use KafkaAdminClient for quota enforcement + upgrade Kafka libs to 2.6

build do not produce anything

Topic enforcer should indicate replication factor config drift

freshness-tracker should be able to read cluster bootstrap servers from burrow

Docker example of ConsumerFreshness_deploy

Freshness tracker verbosity could be improved for errors and debugging

Freshness tracker should fail a cluster iteration if all partitions for all consumers fails

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent