Giter VIP home page Giter VIP logo

Comments (3)

arjan-bal avatar arjan-bal commented on September 28, 2024 1

Root Cause

  • Balancer Wrapper calls connect in a new go routine: https://github.com/grpc/grpc-go/blame/bdd707e642e40cf75db5ac3f0f6af48077f48368/balancer_wrapper.go#L276C22-L276C22
  • This in turn calls addrConn.connect() which briefly locks the mutex, ensures the channel is idle releases the mutex before calling resetTransport here:

    grpc-go/clientconn.go

    Lines 914 to 925 in bdd707e

    if ac.state != connectivity.Idle {
    if logger.V(2) {
    logger.Infof("connect called on addrConn in non-idle state (%v); ignoring.", ac.state)
    }
    ac.mu.Unlock()
    return nil
    }
    ac.mu.Unlock()
    ac.resetTransport()
    return nil
    }
  • resetTransport locks the mutex when it starts, it also sets the state to connecting to prevent parallel connections:

    grpc-go/clientconn.go

    Lines 1234 to 1262 in bdd707e

    func (ac *addrConn) resetTransport() {
    ac.mu.Lock()
    acCtx := ac.ctx
    if acCtx.Err() != nil {
    ac.mu.Unlock()
    return
    }
    addrs := ac.addrs
    backoffFor := ac.dopts.bs.Backoff(ac.backoffIdx)
    // This will be the duration that dial gets to finish.
    dialDuration := minConnectTimeout
    if ac.dopts.minConnectTimeout != nil {
    dialDuration = ac.dopts.minConnectTimeout()
    }
    if dialDuration < backoffFor {
    // Give dial more time as we keep failing to connect.
    dialDuration = backoffFor
    }
    // We can potentially spend all the time trying the first address, and
    // if the server accepts the connection and then hangs, the following
    // addresses will never be tried.
    //
    // The spec doesn't mention what should be done for multiple addresses.
    // https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md#proposed-backoff-algorithm
    connectDeadline := time.Now().Add(dialDuration)
    ac.updateConnectivityState(connectivity.Connecting, nil)

When addrConn.connect releases the mutex after checking for idleness, another call to addrConn.connect can come in which also sees the channel as idle because resetTransport hasn't acquired the lock yet. So we have two connection attempts in parallel.

A simple fix it to set the state to connecting while addrConn.connect has the mutex locked. I tried it and it fixed the flakiness. Will discuss with the team and raise a PR.

from grpc-go.

arjan-bal avatar arjan-bal commented on September 28, 2024

There is roughly 0.4% flakiness when run on forge: 399 out of 100000 failures

from grpc-go.

arjan-bal avatar arjan-bal commented on September 28, 2024

Investigation

It looks like the subchannel picks the same address twice in case of failures

tlogger.go:116: INFO clientconn.go:1329 [core] [Channel #522 SubChannel #523]Subchannel picks a new address "127.0.0.1:44757" to connect  (t=+1.285688ms)
tlogger.go:116: INFO clientconn.go:1329 [core] [Channel #522 SubChannel #523]Subchannel picks a new address "127.0.0.1:44757" to connect  (t=+1.429717ms)

tryAllAddrs is being called twice.

from grpc-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.