Giter VIP home page Giter VIP logo

Comments (14)

dfawley avatar dfawley commented on September 27, 2024

We probably need more information to be able to help, here.

In our production environment our telemetry will indicate that one of our gRPC servers returns code UNAVAILABLE about once per day. What is particularly interesting is that all evidence indicates that the request completes processing through the middleware and the application layer but somehow fails in what seems to be the gRPC internals and is out of our control.

What is your telemetry? You say it's otel -- is that google.golang.org/grpc/stats/opentelemetry or something else? On what side (client/server) is it reporting UNAVAILABLE? What about your middleware?

Interesting the trace does seem to indicate that the client gives up on the request since the span is 14ms from the client side but 340ms from the server side.

Do you know what status the server is reporting for the RPC in this case? Is it a success? We'd really like to see some client-side debug logs here if you are able to get it. Possibly the client starts the RPC, then the connection is lost, but the server doesn't notice that until after it processes the RPC?

from grpc-go.

Sovietaced avatar Sovietaced commented on September 27, 2024

What is your telemetry? You say it's otel -- is that google.golang.org/grpc/stats/opentelemetry or something else?

This is the stats handler implementation we are using: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go

Do you know what status the server is reporting for the RPC in this case?

The server reports UNAVAILABLE, but the server side middleware does not see any errors. So yeah, it is possible that by the time the server tries to write the response the connection is gone. It's just a bit annoying for us since we alert on a few status codes (INTERNAL, UNAVAILABLE) and this alert fires off once a day.

I just confirmed that the client side is getting CANCELLED due to the context getting timed out/cancelled. It seems strange to me that this results in UNAVAILABLE on the server side though.

We are running with full logging enabled on the server side, we can do the same on the client side if the logs are not helpful the next time this issue occurs.

from grpc-go.

arjan-bal avatar arjan-bal commented on September 27, 2024

@Sovietaced, based on the fact that the nginx ingress reports an HTTP 499, can you confirm if the client closing the transport (and not just cancelling the request context) before the server sends back a response?

from grpc-go.

dfawley avatar dfawley commented on September 27, 2024

This is the stats handler implementation we are using: open-telemetry/opentelemetry-go-contrib@main/instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go

We provide one as I described earlier, which we recommend using instead. I hope you understand that we cannot support this one contributed to the otel repo.

It seems strange to me that this results in UNAVAILABLE on the server side though.

Server-side status codes mostly only make sense if the server actually responds to the RPC. If our library is converting client cancelation into UNAVAILABLE errors, we'd need some help determining where that's happening (ideally in the form of some minimal reproduction code).

There's also always potential races where the server responds with a status code but the client cancels the RPC before it receives that response.

from grpc-go.

vtolstov avatar vtolstov commented on September 27, 2024

i have the same error, but in my case frontend is the grpc-java that calls some endpoint.
my repro case:

  1. send first request to endpoint with invalid data, so my backend sends grpc error like Unauthorized
  2. immediately send the second request to the same endpoint with valid data - my backend responds with 0 code, but nginx ingress in logs says that it returns 499 status code and grpc-java receives grpc error.

from grpc-go.

Sovietaced avatar Sovietaced commented on September 27, 2024

We provide one as I described earlier, which we recommend using instead. I hope you understand that we cannot support this one contributed to the otel repo.

I understand, however if you look at the stats_handler in the OTEL repo it seems that it is merely observing status codes as passed to it so I don't think the stats_handler is the issue.

There's also always potential races where the server responds with a status code but the client cancels the RPC before it receives that response.

Yeah, given that this seems to happen only once a day I assume this is some weird race. Coincidentally, now that I have enabled the additional logging environment variables it hasn't been reproduced yet :)

from grpc-go.

Sovietaced avatar Sovietaced commented on September 27, 2024

Alright. The issue just happened twice and i have server side logs.


2024-09-12 15:16:06.664 | 2024/09/12 22:16:06 WARNING: [core] [Server #6]grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" |  
-- | -- | --
  |   | 2024-09-12 15:16:06.630 | 2024/09/12 22:16:06 INFO: [transport] [server-transport 0xc0002f0c00] loopyWriter exiting with error: transport closed by client |  
  |   | 2024-09-12 15:16:06.630 | 2024/09/12 22:16:06 INFO: [transport] [server-transport 0xc0002f0c00] Closing: EOF

It seems that this log is unique to the issue from what I see.

[Server #6]grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"

The client in this case is a server making a call to an auth service (prior to handling the incoming RPC request). Our error logging middleware logs the following.

{"time":"2024-09-12T22:16:06.629011334Z","level":"ERROR","source":{"<redacted>","file":"<redacted>","line":38},"msg":"encountered unexpected error: failed to call auth service: rpc error: code = Canceled desc = context canceled","error_message":"failed to call auth service: rpc error: code = Canceled desc = context canceled","method":"<redacted>"}
--

The client stats_handler reports status code 1 (UNKNOWN) which is strange. I'll have to look more into that behavior.

from grpc-go.

dfawley avatar dfawley commented on September 27, 2024

From your logs it sounds like the server attempted to write a status of OK, but because the connection was lost before it could be written, the client sees UNAVAILABLE instead. If that's what is happening, then that's to be expected, and there's not really anything else that we could do in these situations.

from grpc-go.

Sovietaced avatar Sovietaced commented on September 27, 2024

From your logs it sounds like the server attempted to write a status of OK, but because the connection was lost before it could be written, the client sees UNAVAILABLE instead. If that's what is happening, then that's to be expected, and there's not really anything else that we could do in these situations.

The server is reporting UNAVAILABLE and the clients are reporting UNKNOWN.

from grpc-go.

dfawley avatar dfawley commented on September 27, 2024

Thanks for the correction. In that case we'd want to see some logs on the client to determine what's going on. It's possible something else happened, like the headers were received but the data was cut off due to the connection loss, and then the codec errored decoding it (just a guess). Since we observed a connection loss, it's not surprising that there are differences between the client and server, but there could always be a bug in the code we're returning.

from grpc-go.

Sovietaced avatar Sovietaced commented on September 27, 2024

Since we observed a connection loss, it's not surprising that there are differences between the client and server, but there could always be a bug in the code we're returning.

Alright. I will update all of our clients with the increase logging to see what we can find.

from grpc-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.