Giter VIP home page Giter VIP logo

Comments (4)

nvanbenschoten avatar nvanbenschoten commented on July 17, 2024

(40003) job 967625704145649667: could not mark as succeeded: result is ambiguous: error=can't combine batch responses with errors, have errors <nil> and [NotLeaseHolderError] proxy failed, update client information desc: r71:/Tenant/1{0/Table/11-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING != et:desc:r72:/Tenant/10/Table/{12-25} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease:repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0; r72: replica not lease holder; current lease is repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0 [propagate] (last error: key range /Tenant/10/Table/15/1/967625704145649667/0-/Tenant/10/Table/54/1/967625704145649667/"legacy_progress"/1915-08-24T05:21:08.625059999Z/0/NULL outside of bounds of range /Tenant/10/Table/11-/Tenant/10/Table/12; suggested ranges: [desc: r71:/Tenant/10/Table/1{1-2} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=8], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING desc: r72:/Tenant/10/Table/{12-25} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING desc: r73:/Tenant/1{0/Table/25-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING])

@andrewbaptist do we expect this form of ambiguous error?

from cockroach.

andrewbaptist avatar andrewbaptist commented on July 17, 2024

I don't think we do. Let me take a closer look at where this "failure to combine" error is coming from.

from cockroach.

andrewbaptist avatar andrewbaptist commented on July 17, 2024

I understand how this can happen now, and I was worried about this exact scenario when I was fixing #123146. Specifically the scenario of the "client", "proxy node" and "meta2" having 3 different views of the world can cause this problem to happen. I thought about a reasonably easy fix to this but didn't implement it as part of that PR as it was a little larger than I would have liked.

The correct solution is to have 2 guarantees:

  1. Immediately reject requests if the proxy node detects that the client had stale information compared to what the proxy node has
  2. Not allow the proxy node to update its state information during its request evaluation.

The way the code is structured, we almost have both these guarantees, but they are not airtight right now. This should be a very rare case where we are hitting these scenarios, but based on the fact that this test is failing, it needs to be stronger.

Let me take a look at how easy it will be to tighten this up.

from cockroach.

andrewbaptist avatar andrewbaptist commented on July 17, 2024

@arulajmani - We should chat about this bug.

As I’m looking through this issue, I’m realizing I might have made a mistake with the way proxying is done on the proxy node. Specifically there are 3 parts of the proxy code, and it is the third one where there is an issue:

  1. On the sender side deciding if it should proxy and to whom (this is OK)
  2. On the proxy node, evaluating locally and then determining if it should proxy (this is also OK)
  3. On the proxy node, sending the proxy request through DistSender (this is the problem)

For the DistSender proxy piece, I had originally conceived of there being some ability to handle the proxy node having more up-to-date information than the client and therefore taking advantage of its local DistSender cache to send the request, but as I’ve gone through the last couple bugs related to this I’m realizing this was a bad decision. Specifically proxy requests should just use the Transport rather than the RangeCache at all.

So there are a couple options:

  1. Add a new method to DistSender (something like - SendProxyRequest) that basically validates the request against its local cache, and if the request carries stale information, it returns an error immediately, otherwise it creates a Transport and sends the request without all the retries in sendPartialBatch/sendToReplicas.
  2. Attempt to use the existing methods, but make sure that certain operations (like splitting a batch) can’t happen.

I prefer the first, but I need to be careful to ensure I handle things like Context cancellation and other errors that can happen. I don’t think the refactor will be very big and it would help reduce some of the complexity of the already complex

from cockroach.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.