Giter VIP home page Giter VIP logo

Comments (5)

easwars avatar easwars commented on July 18, 2024 1

@sebastianjonasson : You are correct in your assessment about when the timer is created in the DNS resolver's watcher goroutine. I was also going to fix it the same way you have attempted to fix it. But I wanted to have a test for it, which is why I asked you if you had an easy reproduction. Since you don't have an easy repro, I will work on a unit test for this scenario. Thank you for reporting this issue and all the information you have provided along the way.

from grpc-go.

purnesh42H avatar purnesh42H commented on July 18, 2024

@easwars could you verify if the the patch from PR #6668 can cause this delay?

from grpc-go.

easwars avatar easwars commented on July 18, 2024

I can see how #6668 could be the problem maker. I understand the problem, and the fix required as well.

Is there a way to easily reproduce the problem though (without any complicated k8s setup)? That would be helpful to verify the fix. Thanks.

from grpc-go.

sebastianjonasson avatar sebastianjonasson commented on July 18, 2024

Thanks for the response @easwars!

Unfortunately I've not been able to reproduce this locally. All debugging have been done in our cloud based dev environment.

I've done some additional testing and found that the problem still remains on master. However, I've been able to get this to work by checking out master and reverting the watcher function to use a timer instead of time.After. Example here: https://github.com/sebastianjonasson/grpc-go/pull/1/files.

It seems like the issue is that the current implementation assigns the duration instead of starting the timer as the previous (<=v1.59.0) implementation did. This means that the current implementation will always wait 30s after receiving the <-rn message, while in the previous implementation the timer would have expired when receiving the <-rn message yielding an immediate resolve.

var waitTime time.Duration
if err == nil {
// Success resolving, wait for the next ResolveNow. However, also wait 30
// seconds at the very least to prevent constantly re-resolving.
backoffIndex = 1
waitTime = MinResolutionInterval
select {
case <-d.ctx.Done():
return
case <-d.rn:
}
} else {
// Poll on an error found in DNS Resolver or an error received from
// ClientConn.
waitTime = backoff.DefaultExponential.Backoff(backoffIndex)
backoffIndex++
}
select {
case <-d.ctx.Done():
return
case <-internal.TimeAfterFunc(waitTime):
}

To try to prove my case I've added the following debug logs for master and my test branch to highlight the behavior:

  • debug: start timer
  • debug: wait for timer
  • debug: timer done

As you can see in the logs for master we wait 30s upon receiving the <-rn message, while the logs for the test branch shows immediate resolution since the timer has already expired.

Logs master:

2024/05/17 09:30:48 INFO: [core] [Channel #1]Channel switches to new LB policy "round_robin"
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel created
2024/05/17 09:30:48 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 09:30:48 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 09:30:48 INFO: [dns] debug: start timer
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to READY
2024/05/17 09:30:48 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:2):{{Addr: "10.10.10.40:8000", ServerName: "", }}]}
2024/05/17 09:30:48 INFO: [core] [Channel #1]Channel Connectivity change to READY
2024/05/17 09:32:05 INFO: [core] [Channel #1]Channel Connectivity change to SHUTDOWN
2024/05/17 09:32:05 INFO: [core] [Channel #1]Closing the name resolver
2024/05/17 09:32:05 INFO: [core] [Channel #1]ccBalancerWrapper: closing
2024/05/17 09:32:05 INFO: [core] [Channel #1 SubChannel #4]Subchannel Connectivity change to SHUTDOWN
2024/05/17 09:32:05 INFO: [core] [Channel #1 SubChannel #4]Subchannel deleted
2024/05/17 09:32:05 INFO: [core] [Channel #1]Channel deleted
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE
2024/05/17 09:37:16 INFO: [dns] debug: wait for timer
2024/05/17 09:37:16 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 09:37:16 INFO: [core] [Channel #1]Channel Connectivity change to IDLE
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:16 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 09:37:16 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:16 INFO: [core] [Channel #1]Channel Connectivity change to TRANSIENT_FAILURE
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:17 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:18 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:21 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:25 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:32 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:32 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:32 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:46 INFO: [dns] debug: timer done
2024/05/17 09:37:46 INFO: [core] [Channel #1]Resolver state updated: {
"Addresses": [
{
"Addr": "10.10.10.50:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "10.10.10.50:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()

Logs test branch:

2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel created
2024/05/17 11:35:54 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 11:35:54 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 11:35:54 INFO: [dns] debug: start timer
2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.20.50:8000" to connect
2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to READY
2024/05/17 11:35:54 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:2):{{Addr: "10.10.20.50:8000", ServerName: "", }}]}
2024/05/17 11:35:54 INFO: [core] [Channel #1]Channel Connectivity change to READY
2024/05/17 11:37:15 INFO: [core] [Channel #1]Channel Connectivity change to SHUTDOWN
2024/05/17 11:37:15 INFO: [core] [Channel #1]Closing the name resolver
2024/05/17 11:37:15 INFO: [core] [Channel #1]ccBalancerWrapper: closing
2024/05/17 11:37:15 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to SHUTDOWN
2024/05/17 11:37:15 INFO: [core] [Channel #1 SubChannel #2]Subchannel deleted
2024/05/17 11:37:15 INFO: [core] [Channel #1]Channel deleted
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE
2024/05/17 11:46:18 INFO: [dns] debug: wait for timer
2024/05/17 11:46:18 INFO: [dns] debug: timer done
2024/05/17 11:46:18 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 11:46:18 INFO: [core] [Channel #1]Channel Connectivity change to IDLE
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.20.50:8000" to connect
2024/05/17 11:46:18 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 11:46:18 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.20.50:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.20.50:8000: connect: connection refused"
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.20.50:8000: connect: connection refused"
2024/05/17 11:46:18 INFO: [core] [Channel #1]Channel Connectivity change to TRANSIENT_FAILURE
2024/05/17 11:46:18 INFO: [core] [Channel #1]Resolver state updated: {
"Addresses": [
{
"Addr": "10.10.20.140:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "10.10.20.140:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #4]Subchannel created

from grpc-go.

sebastianjonasson avatar sebastianjonasson commented on July 18, 2024

Great! Thanks for looking into this!

from grpc-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.