When a Kafka leader is kill -9 'd, the producer has in

You nailed it down pretty correctly. The reason is that either <code class="notranslat

Can you try the latest commit <a class="commit-link" data-hovercard-type="commit" data

I pulled the latest 3.0, which includes <a class="commit-link" data-hovercard-type="co

You can see by the following <a href="https://github.com/oleksiyk/kafka/blob/3.0/test/

Producer sometimes throws (rejects promise) `UnknownTopicOrPartition` about kafka HOT 12 CLOSED

oleksiyk commented on August 17, 2024

Producer sometimes throws (rejects promise) `UnknownTopicOrPartition`

from kafka.

Comments (12)

oleksiyk commented on August 17, 2024

When do you kill the broker? Before the send call or? Is it the only broker?

from kafka.

oleksiyk commented on August 17, 2024

What is the fulfilment value in the first case?

from kafka.

jotto commented on August 17, 2024

It's extremely difficult to reproduce locally (apparently).

I'm using the same setup from this #63 (2 brokers) where one of the kafka brokers is loaded with config/server1.properties and the other is loaded with config/server2.properties.

I'm using this bash script to kill -9 whichever broker is currently the leader for mytopic:0

 serverid=$(./bin/kafka-topics.sh --zookeeper localhost --describe 2> /dev/null | egrep -o "Leader: ." | cut -d' ' -f2); ps aux | grep "config.server${serverid}" | awk '{print $2}' | xargs kill -9

Then I wait for the other broker to become leader and watch the log output from the no-kafka client. Most of the time the producer.send fulfills with:
```
 [ { topic: 'mytopic', partition: 0, error: null, offset: 3374 } ]
```
Then I boot the kafka broker up that was killed, wait for it to be recognized, and restart the whole process

Very rarely (in my testing), the producer.send is rejected with:

 { [KafkaError: This request is for a topic or partition that does not exist on this broker.]
   name: 'KafkaError',
   code: 'UnknownTopicOrPartition',
   message: 'This request is for a topic or partition that does not exist on this broker.' }

I don't yet know how to reproduce this other than manual retries as I've indicated above. It happened in production for us (AWS) on our first manual QA test of shutting down brokers and verifying no data loss.

from kafka.

jotto commented on August 17, 2024

I can reproduce with more regularity by publishing a message every 5ms (and randomly kill -9 1 of 2 brokers)

It looks like findLeader is failing more than once and then getting caught in the root/highest level catch of producer#send (instead of retried)

Adding a catch with a metadata refresh and retry to the promise returned from _prepareProduceRequest seems to solve the problem:

// Producer.prototype._send
(function _try(_data, attempt) {
    attempt = attempt || 1;

    return self._prepareProduceRequest(_data).then(function (requests) {
      // code removed for clarity
    }).catch(function(err) {
      // this is just quick and dirty, it doesn't respect retry limit
      return Promise.delay(task.options.retries.delay).then(function () {
          return self.client.updateMetadata().then(function () {
              return _try(_data, ++attempt);
          });
      });
    });
}(data))
.then(function () {
    task.resolve(result);
})
.catch(function (err) {
    task.reject(err);
});

If this is truly the root cause, the BDD spec would look something like this:

describe('Producer', function () {
  describe('#send', function () {
    describe('when findLeader fails more than once', function () {
      it('#send retries up to task.options.retries times');
    });
  });
}

This is my naive estimation of what the stack trace looks like:

#send
- #_send
  - #_prepareProduceRequest
    - #client.findLeader
      - throw UnknownTopicOrPartition, and RETRY
      - throw UnknownTopicOrPartition again (and don't retry)
- promise returned by #_send is rejected

and here is the contents of topicMetadata and brokerConnections when the error is thrown:

self.topicMetadata { mytopic: 
   { '0': 
      { error: [Object],
        partitionId: 0,
        leader: -1,
        replicas: [Object],
        isr: [] } } }
self.brokerConnections { '2': 
   Connection {
     options: 
      { host: '192.168.128.13',
        port: 9093,
        ssl: [Object],
        connectionTimeout: 3000,
        initialBufferSize: 262144 },
     connected: false,
     buffer: <Buffer 00 ... >,
     offset: 0,
     queue: [] } }

from kafka.

oleksiyk commented on August 17, 2024

You nailed it down pretty correctly. The reason is that either findLeader or getTopicPartitions will only retry once and then throw on next failed attempt. However they work in assumption that only local copy of metadata can't be trusted. If the topic/partition isn't found in local copy - they refresh the metadata and lookup again. However in this particular case they still receive the outdated metadata from Kafka which they trust and thus throw lookup error.

While this can be fixed, and producer.send can easily retry on error thrown in _prepareProduceRequest this also means that producer.send will retry sending for topic/partitions that really don't exist in Kafka. I'm struggling with a decision here.

from kafka.

jotto commented on August 17, 2024

will retry sending for topic/partitions that really don't exist in Kafka

There is a scenario where they (topic/partitions) don't exist at all, but there's also a scenario where they temporarily don't exist (leader election recovering from a timed out broker).

Some food for thought:

Why does it matter that we retry a producer.send if the topic/partition doesn't exist as long as it's capped at a configured maximum attempts?
- as a naive caller of producer.send why would UnknownTopicOrPartition during findLeader be treated any differently than UnknownTopicOrPartition during the actual send?
If we don't retry, then the caller bears more responsibility and/or knowledge of no-kafka implementation
- the caller has to be aware of leader election topic/partitions issues and thus implement their own retry
  - if we're OK with this, it'd probably be helpful to change the error or add context so we know it was thrown during findLeader (hinting to the caller that this may be a retry-able error)

but I don't have much experience with the Java client so I don't know what the generally accepted behavior is.

from kafka.

oleksiyk commented on August 17, 2024

Can you try the latest commit eb27877 ?

from kafka.

jotto commented on August 17, 2024

I pulled the latest 3.0, which includes eb27877 and I'm still seeing the same failure. In fact, if I add console.log statements within the if statement changed on line 58, I never see it called.

I'm not totally surprised though, eb27877 doesn't change how errors thrown during findLeader are handled does it? (unless I'm not thinking about the callstack correctly)

from kafka.

oleksiyk commented on August 17, 2024

It does change actually. What is your producer code?

from kafka.

oleksiyk commented on August 17, 2024

You can see by the following test that in case of wrong topic it doesn't throw now, it actually follows the same 'retries' logic and fulfils the promise with correct result (error result).

from kafka.

oleksiyk commented on August 17, 2024

Try 2df4507

from kafka.

jotto commented on August 17, 2024

@oleksiyk 2df4507 appears to solve the problem, it makes sense why it solves the problem, and looks like a great solution.

Thanks for sticking with it.

from kafka.

Producer sometimes throws (rejects promise) `UnknownTopicOrPartition` about kafka HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent