Comments (4)
(40003) job 967625704145649667: could not mark as succeeded: result is ambiguous: error=can't combine batch responses with errors, have errors <nil> and [NotLeaseHolderError] proxy failed, update client information desc: r71:/Tenant/1{0/Table/11-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=7], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING != et:desc:r72:/Tenant/10/Table/{12-25} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease:repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0; r72: replica not lease holder; current lease is repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0 [propagate] (last error: key range /Tenant/10/Table/15/1/967625704145649667/0-/Tenant/10/Table/54/1/967625704145649667/"legacy_progress"/1915-08-24T05:21:08.625059999Z/0/NULL outside of bounds of range /Tenant/10/Table/11-/Tenant/10/Table/12; suggested ranges: [desc: r71:/Tenant/10/Table/1{1-2} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=8], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING desc: r72:/Tenant/10/Table/{12-25} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING desc: r73:/Tenant/1{0/Table/25-1} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=9], lease: repl=(n1,s1):1 seq=1 start=0,0 exp=1715366334.000521313,0 pro=1715366328.000521313,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING])
@andrewbaptist do we expect this form of ambiguous error?
from cockroach.
I don't think we do. Let me take a closer look at where this "failure to combine" error is coming from.
from cockroach.
I understand how this can happen now, and I was worried about this exact scenario when I was fixing #123146. Specifically the scenario of the "client", "proxy node" and "meta2" having 3 different views of the world can cause this problem to happen. I thought about a reasonably easy fix to this but didn't implement it as part of that PR as it was a little larger than I would have liked.
The correct solution is to have 2 guarantees:
- Immediately reject requests if the proxy node detects that the client had stale information compared to what the proxy node has
- Not allow the proxy node to update its state information during its request evaluation.
The way the code is structured, we almost have both these guarantees, but they are not airtight right now. This should be a very rare case where we are hitting these scenarios, but based on the fact that this test is failing, it needs to be stronger.
Let me take a look at how easy it will be to tighten this up.
from cockroach.
@arulajmani - We should chat about this bug.
As I’m looking through this issue, I’m realizing I might have made a mistake with the way proxying is done on the proxy node. Specifically there are 3 parts of the proxy code, and it is the third one where there is an issue:
- On the sender side deciding if it should proxy and to whom (this is OK)
- On the proxy node, evaluating locally and then determining if it should proxy (this is also OK)
- On the proxy node, sending the proxy request through
DistSender
(this is the problem)
For the DistSender
proxy piece, I had originally conceived of there being some ability to handle the proxy node having more up-to-date information than the client and therefore taking advantage of its local DistSender
cache to send the request, but as I’ve gone through the last couple bugs related to this I’m realizing this was a bad decision. Specifically proxy requests should just use the Transport rather than the RangeCache at all.
So there are a couple options:
- Add a new method to
DistSender
(something like -SendProxyRequest
) that basically validates the request against its local cache, and if the request carries stale information, it returns an error immediately, otherwise it creates a Transport and sends the request without all the retries insendPartialBatch/sendToReplicas
. - Attempt to use the existing methods, but make sure that certain operations (like splitting a batch) can’t happen.
I prefer the first, but I need to be careful to ensure I handle things like Context cancellation and other errors that can happen. I don’t think the refactor will be very big and it would help reduce some of the complexity of the already complex
from cockroach.
Related Issues (20)
- logictest: select_for_update is flaky HOT 14
- sql/tests: TestRandomSyntaxGeneration failed HOT 1
- cli/democluster: TestTransientClusterSimulateLatencies failed
- ccl/backupccl: TestDataDriven_user_defined_functions_in_defaults failed HOT 1
- pkg/sql/logictest/tests/fakedist-vec-off/fakedist-vec-off_test: TestLogic_select_for_update failed HOT 1
- roachtest: replicagc-changed-peers/restart=true failed HOT 2
- roachtest: tpce/c=5000/nodes=3 failed HOT 2
- pkg/sql/copy/copy_test: TestCopyFromRandom failed HOT 1
- kv/kvserver/intentresolver: TestIntentResolutionUnavailableRange failed HOT 2
- roachtest: tpce/c=5000/nodes=3 failed HOT 2
- roachtest: kv95/enc=false/nodes=3/mt-shared-process failed [azure] HOT 1
- roachtest: replicate/wide failed HOT 3
- pkg/sql/logictest/tests/local-vec-off/local-vec-off_test: TestLogic_select_for_update failed HOT 1
- admission: classify TTL writes as elastic work
- pkg/ccl/testccl/sqlccl/sqlccl_test: TestShowCreateRedactableValues failed
- changefeedccl: add data driven testing HOT 1
- Sentry: panic.go:884: runtime error: invalid memory address or nil pointer dereference (1) attached stack trace -- stack trace: | runtime.gopanic | GOROOT/src/runtime/panic.go:884 | runtime.p...
- Sentry: panic.go:884: runtime error: invalid memory address or nil pointer dereference (1) attached stack trace -- stack trace: | runtime.gopanic | GOROOT/src/runtime/panic.go:884 | github.co...
- roachtest: tpce/c=5000/nodes=3 failed HOT 1
- roachtest: unoptimized-query-oracle: OOM after a query with `st_memcollect` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cockroach.