Giter VIP home page Giter VIP logo

Comments (8)

andrewbaptist avatar andrewbaptist commented on June 21, 2024

Let's chat about this tomorrow if you are available. I have a few ideas how we could handle this, but I'm not sure what the best approach is. I can try and create a unit test that reproduces this behavior as well if its useful.

from cockroach.

blathers-crl avatar blathers-crl commented on June 21, 2024

cc @cockroachdb/replication

from cockroach.

nvanbenschoten avatar nvanbenschoten commented on June 21, 2024

I'm surprised that the lease proposal is not rejected immediately in this case, and that we instead perform the proposal and wait for a timeout.

I wonder if this is related to #118435.

If so, I also wonder whether we made the gap described in that issue bigger when we introduced CheckQuorum. Did CheckQuorum make it more common for followers to have a Lead value of None? Maybe not. We'd want to look at why these proposals are passing the check in maybeRejectUnsafeProposalLocked.

from cockroach.

andrewbaptist avatar andrewbaptist commented on June 21, 2024

To answer the question of why its not rejected:

Looking at the logs for replica_proposal_buf I see:

I240308 16:48:23.261892 254 kv/kvserver/replica_proposal_buf.go:724 ⋮ [T1,Vsystem,n2,s2,r123/2:‹/Table/106/1/8{49280…-67544…}›,raft] 937  proposing lease acquisition even though we're not the leader; the leader is unknown

I don't think its reasonable to not propose this. Once we are partitioned from the leader, we no longer know who the leader is and we begin to PreVote to become leader. Our PreVote will fail because the quorum still has connectivity to the leader and will reject our PreVote. However we don't "remember" in any meaningful way that we are failing to win a prevote.

from cockroach.

nvanbenschoten avatar nvanbenschoten commented on June 21, 2024

I don't think its reasonable to not propose this. Once we are partitioned from the leader, ...

Isn't what you're describing exactly why we should not propose this?

from cockroach.

erikgrinaker avatar erikgrinaker commented on June 21, 2024

I wonder if this is related to #118435.

Yeah, this is the same thing.

As noted on that issue, I think this will make replica circuit breakers mostly useless, and cause clients to get stuck on unavailable ranges again (which is the purpose of the replica circuit breakers). They will still trip in cases where there was already an in-flight proposal at the time quorum was lost, but otherwise there won't be a new proposal to trip them (consider e.g. the cold start case). I think this is better handled with DistSender-level range circuit breakers anyway, but they don't exist yet, and won't for 24.1.

I'm not sure why #120150 doesn't fail any tests. Possibly because all the tests actually use in-flight proposals like a Put when tripping the breaker, or because they all manually trip the breakers.

In any case, if we do this, then I'd be inclined to just tear out the replica circuit breakers entirely, since they don't serve much of a purpose anymore. We should be deliberate about the availability tradeoff we're making though.

from cockroach.

andrewbaptist avatar andrewbaptist commented on June 21, 2024

I don't fully understand the problematic case where this will cause clients to get stuck. I originally expected some test failures from #120150 and was also surprised when there weren't any.

I also agree that DistSender level range circuit breakers will handle certain failures better, but maybe we need a good test that demonstrates the problem first if we do implement this change.

from cockroach.

erikgrinaker avatar erikgrinaker commented on June 21, 2024

I don't fully understand the problematic case where this will cause clients to get stuck.

Start a single, lone replica in a RF=3 Raft group (i.e. without quorum), and send a request to it. Normally, this should error out once the circuit breaker trips after 60 seconds (as should all subsequent requests), but with that change I think they'll hang indefinitely instead. We need to propose something for the circuit breakers to trip (the condition is that a proposal hasn't gone through in 60 seconds), and if we don't propose the lease acquisition then nothing else will get proposed either, so the replica circuit breakers will never trip.

from cockroach.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.