Comments (14)
We probably need more information to be able to help, here.
In our production environment our telemetry will indicate that one of our gRPC servers returns code UNAVAILABLE about once per day. What is particularly interesting is that all evidence indicates that the request completes processing through the middleware and the application layer but somehow fails in what seems to be the gRPC internals and is out of our control.
What is your telemetry? You say it's otel -- is that google.golang.org/grpc/stats/opentelemetry
or something else? On what side (client/server) is it reporting UNAVAILABLE? What about your middleware?
Interesting the trace does seem to indicate that the client gives up on the request since the span is 14ms from the client side but 340ms from the server side.
Do you know what status the server is reporting for the RPC in this case? Is it a success? We'd really like to see some client-side debug logs here if you are able to get it. Possibly the client starts the RPC, then the connection is lost, but the server doesn't notice that until after it processes the RPC?
from grpc-go.
What is your telemetry? You say it's otel -- is that
google.golang.org/grpc/stats/opentelemetry
or something else?
This is the stats handler implementation we are using: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go
Do you know what status the server is reporting for the RPC in this case?
The server reports UNAVAILABLE, but the server side middleware does not see any errors. So yeah, it is possible that by the time the server tries to write the response the connection is gone. It's just a bit annoying for us since we alert on a few status codes (INTERNAL, UNAVAILABLE) and this alert fires off once a day.
I just confirmed that the client side is getting CANCELLED
due to the context getting timed out/cancelled. It seems strange to me that this results in UNAVAILABLE
on the server side though.
We are running with full logging enabled on the server side, we can do the same on the client side if the logs are not helpful the next time this issue occurs.
from grpc-go.
@Sovietaced, based on the fact that the nginx ingress reports an HTTP 499, can you confirm if the client closing the transport (and not just cancelling the request context) before the server sends back a response?
from grpc-go.
This is the stats handler implementation we are using: open-telemetry/opentelemetry-go-contrib@main/instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go
We provide one as I described earlier, which we recommend using instead. I hope you understand that we cannot support this one contributed to the otel repo.
It seems strange to me that this results in UNAVAILABLE on the server side though.
Server-side status codes mostly only make sense if the server actually responds to the RPC. If our library is converting client cancelation into UNAVAILABLE errors, we'd need some help determining where that's happening (ideally in the form of some minimal reproduction code).
There's also always potential races where the server responds with a status code but the client cancels the RPC before it receives that response.
from grpc-go.
i have the same error, but in my case frontend is the grpc-java that calls some endpoint.
my repro case:
- send first request to endpoint with invalid data, so my backend sends grpc error like Unauthorized
- immediately send the second request to the same endpoint with valid data - my backend responds with 0 code, but nginx ingress in logs says that it returns 499 status code and grpc-java receives grpc error.
from grpc-go.
We provide one as I described earlier, which we recommend using instead. I hope you understand that we cannot support this one contributed to the otel repo.
I understand, however if you look at the stats_handler in the OTEL repo it seems that it is merely observing status codes as passed to it so I don't think the stats_handler is the issue.
There's also always potential races where the server responds with a status code but the client cancels the RPC before it receives that response.
Yeah, given that this seems to happen only once a day I assume this is some weird race. Coincidentally, now that I have enabled the additional logging environment variables it hasn't been reproduced yet :)
from grpc-go.
Alright. The issue just happened twice and i have server side logs.
2024-09-12 15:16:06.664 | 2024/09/12 22:16:06 WARNING: [core] [Server #6]grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" |
-- | -- | --
| | 2024-09-12 15:16:06.630 | 2024/09/12 22:16:06 INFO: [transport] [server-transport 0xc0002f0c00] loopyWriter exiting with error: transport closed by client |
| | 2024-09-12 15:16:06.630 | 2024/09/12 22:16:06 INFO: [transport] [server-transport 0xc0002f0c00] Closing: EOF
It seems that this log is unique to the issue from what I see.
[Server #6]grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
The client in this case is a server making a call to an auth service (prior to handling the incoming RPC request). Our error logging middleware logs the following.
{"time":"2024-09-12T22:16:06.629011334Z","level":"ERROR","source":{"<redacted>","file":"<redacted>","line":38},"msg":"encountered unexpected error: failed to call auth service: rpc error: code = Canceled desc = context canceled","error_message":"failed to call auth service: rpc error: code = Canceled desc = context canceled","method":"<redacted>"}
--
The client stats_handler reports status code 1 (UNKNOWN) which is strange. I'll have to look more into that behavior.
from grpc-go.
From your logs it sounds like the server attempted to write a status of OK, but because the connection was lost before it could be written, the client sees UNAVAILABLE instead. If that's what is happening, then that's to be expected, and there's not really anything else that we could do in these situations.
from grpc-go.
From your logs it sounds like the server attempted to write a status of OK, but because the connection was lost before it could be written, the client sees UNAVAILABLE instead. If that's what is happening, then that's to be expected, and there's not really anything else that we could do in these situations.
The server is reporting UNAVAILABLE
and the clients are reporting UNKNOWN
.
from grpc-go.
Thanks for the correction. In that case we'd want to see some logs on the client to determine what's going on. It's possible something else happened, like the headers were received but the data was cut off due to the connection loss, and then the codec errored decoding it (just a guess). Since we observed a connection loss, it's not surprising that there are differences between the client and server, but there could always be a bug in the code we're returning.
from grpc-go.
Since we observed a connection loss, it's not surprising that there are differences between the client and server, but there could always be a bug in the code we're returning.
Alright. I will update all of our clients with the increase logging to see what we can find.
from grpc-go.
Related Issues (20)
- encoding.GetCodec(proto.Name) broke in 1.66.0 HOT 7
- Create tests for two potential cardinality violation bugs HOT 1
- Backport #7571 to 1.66 once merged HOT 2
- Why is Metadata still being used instead of Attributes in clientconn.go? HOT 3
- Feature Request: Add Support for Generating Full Method Names List for Service-Level gRPC Interceptors
- xds: v1.66.0 regression in `xds.BootstrapContentsForTesting` HOT 11
- mem package & bufferPoolingThreshold HOT 3
- Generic Streams from protoc-gen-go-grpc are not detected by mock tools HOT 4
- Allow generated handlers to check the correct errors are returned at compile time HOT 1
- Flaky Test: xds/TestUnmarshalListener_WithUpdateValidatorFunc HOT 2
- Flaky test: Test/ClientCloseReturnsEarlyWhenGoAwayWriteHangs HOT 4
- Where are the v1.66.1 release notes? HOT 3
- Uptick of FLOW_CONTROL_ERROR errors in CI after updating to 1.66.0 HOT 3
- google.golang.org/grpc v1.66.1 is missing a retraction HOT 4
- gRPC Server Sends RST_STREAM without trailers when TCP Reassembly occurs HOT 9
- 1.66.0 - increase in produced zero addresses HOT 10
- Flaky test: TestServeAndCloseDoNotRace HOT 2
- panic: send on closed channel HOT 3
- mem.NewBuffer() should look at slice capacity? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from grpc-go.