Giter VIP home page Giter VIP logo

Comments (12)

oleksiyk avatar oleksiyk commented on August 17, 2024

When do you kill the broker? Before the send call or? Is it the only broker?

from kafka.

oleksiyk avatar oleksiyk commented on August 17, 2024

What is the fulfilment value in the first case?

from kafka.

jotto avatar jotto commented on August 17, 2024

It's extremely difficult to reproduce locally (apparently).

  • I'm using the same setup from this #63 (2 brokers) where one of the kafka brokers is loaded with config/server1.properties and the other is loaded with config/server2.properties.

  • I'm using this bash script to kill -9 whichever broker is currently the leader for mytopic:0

     serverid=$(./bin/kafka-topics.sh --zookeeper localhost --describe 2> /dev/null | egrep -o "Leader: ." | cut -d' ' -f2); ps aux | grep "config.server${serverid}" | awk '{print $2}' | xargs kill -9
  • Then I wait for the other broker to become leader and watch the log output from the no-kafka client. Most of the time the producer.send fulfills with:

     [ { topic: 'mytopic', partition: 0, error: null, offset: 3374 } ]
  • Then I boot the kafka broker up that was killed, wait for it to be recognized, and restart the whole process

  • Very rarely (in my testing), the producer.send is rejected with:

     { [KafkaError: This request is for a topic or partition that does not exist on this broker.]
       name: 'KafkaError',
       code: 'UnknownTopicOrPartition',
       message: 'This request is for a topic or partition that does not exist on this broker.' }

I don't yet know how to reproduce this other than manual retries as I've indicated above. It happened in production for us (AWS) on our first manual QA test of shutting down brokers and verifying no data loss.

from kafka.

jotto avatar jotto commented on August 17, 2024

I can reproduce with more regularity by publishing a message every 5ms (and randomly kill -9 1 of 2 brokers)

It looks like findLeader is failing more than once and then getting caught in the root/highest level catch of producer#send (instead of retried)

Adding a catch with a metadata refresh and retry to the promise returned from _prepareProduceRequest seems to solve the problem:

// Producer.prototype._send
(function _try(_data, attempt) {
    attempt = attempt || 1;

    return self._prepareProduceRequest(_data).then(function (requests) {
      // code removed for clarity
    }).catch(function(err) {
      // this is just quick and dirty, it doesn't respect retry limit
      return Promise.delay(task.options.retries.delay).then(function () {
          return self.client.updateMetadata().then(function () {
              return _try(_data, ++attempt);
          });
      });
    });
}(data))
.then(function () {
    task.resolve(result);
})
.catch(function (err) {
    task.reject(err);
});

If this is truly the root cause, the BDD spec would look something like this:

describe('Producer', function () {
  describe('#send', function () {
    describe('when findLeader fails more than once', function () {
      it('#send retries up to task.options.retries times');
    });
  });
}

This is my naive estimation of what the stack trace looks like:

  • #send
    • #_send
      • #_prepareProduceRequest
        • #client.findLeader
          • throw UnknownTopicOrPartition, and RETRY
          • throw UnknownTopicOrPartition again (and don't retry)
    • promise returned by #_send is rejected

and here is the contents of topicMetadata and brokerConnections when the error is thrown:

self.topicMetadata { mytopic: 
   { '0': 
      { error: [Object],
        partitionId: 0,
        leader: -1,
        replicas: [Object],
        isr: [] } } }
self.brokerConnections { '2': 
   Connection {
     options: 
      { host: '192.168.128.13',
        port: 9093,
        ssl: [Object],
        connectionTimeout: 3000,
        initialBufferSize: 262144 },
     connected: false,
     buffer: <Buffer 00 ... >,
     offset: 0,
     queue: [] } }

from kafka.

oleksiyk avatar oleksiyk commented on August 17, 2024

You nailed it down pretty correctly. The reason is that either findLeader or getTopicPartitions will only retry once and then throw on next failed attempt. However they work in assumption that only local copy of metadata can't be trusted. If the topic/partition isn't found in local copy - they refresh the metadata and lookup again. However in this particular case they still receive the outdated metadata from Kafka which they trust and thus throw lookup error.

While this can be fixed, and producer.send can easily retry on error thrown in _prepareProduceRequest this also means that producer.send will retry sending for topic/partitions that really don't exist in Kafka. I'm struggling with a decision here.

from kafka.

jotto avatar jotto commented on August 17, 2024

will retry sending for topic/partitions that really don't exist in Kafka

There is a scenario where they (topic/partitions) don't exist at all, but there's also a scenario where they temporarily don't exist (leader election recovering from a timed out broker).

Some food for thought:

  • Why does it matter that we retry a producer.send if the topic/partition doesn't exist as long as it's capped at a configured maximum attempts?
    • as a naive caller of producer.send why would UnknownTopicOrPartition during findLeader be treated any differently than UnknownTopicOrPartition during the actual send?
  • If we don't retry, then the caller bears more responsibility and/or knowledge of no-kafka implementation
    • the caller has to be aware of leader election topic/partitions issues and thus implement their own retry
      • if we're OK with this, it'd probably be helpful to change the error or add context so we know it was thrown during findLeader (hinting to the caller that this may be a retry-able error)

but I don't have much experience with the Java client so I don't know what the generally accepted behavior is.

from kafka.

oleksiyk avatar oleksiyk commented on August 17, 2024

Can you try the latest commit eb27877 ?

from kafka.

jotto avatar jotto commented on August 17, 2024

I pulled the latest 3.0, which includes eb27877 and I'm still seeing the same failure. In fact, if I add console.log statements within the if statement changed on line 58, I never see it called.

I'm not totally surprised though, eb27877 doesn't change how errors thrown during findLeader are handled does it? (unless I'm not thinking about the callstack correctly)

from kafka.

oleksiyk avatar oleksiyk commented on August 17, 2024

It does change actually. What is your producer code?

from kafka.

oleksiyk avatar oleksiyk commented on August 17, 2024

You can see by the following test that in case of wrong topic it doesn't throw now, it actually follows the same 'retries' logic and fulfils the promise with correct result (error result).

from kafka.

oleksiyk avatar oleksiyk commented on August 17, 2024

Try 2df4507

from kafka.

jotto avatar jotto commented on August 17, 2024

@oleksiyk 2df4507 appears to solve the problem, it makes sense why it solves the problem, and looks like a great solution.

Thanks for sticking with it.

from kafka.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.