Comments (14)
It looks like there were connection issues in the cluster from right after the cluster started. I haven't seen this "Can't find decompressor for snappy" gRPC before, checking if KV team has knowledge of this.
I240602 05:16:32.549270 383 kv/kvserver/replica_command.go:1754 ⋮ [T1,n1,replicate,s1,r11/1:‹/Table/{7-8}›] 208 could not successfully add and upreplicate LEARNER replica(s) on [n2,s2], rolling back: error sending couldn't accept ‹range_id:11 coordinator_replica:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > recipient_replica:<node_id:2 store_id:2 replica_id:5 type:LEARNER > delegated_sender:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > priority:RECOVERY type:INITIAL term:6 first_index:32 sender_queue_name:REPLICATE_QUEUE sender_queue_priority:10001 descriptor_generation:7 queue_on_delegate_len:-1 snap_id:396c82fb-f663-4c05-bbaf-4a56bdee8181 ›: unable to dial n2: ‹breaker open›
E240602 05:16:31.290980 1123 2@rpc/context.go:2412 ⋮ [T1,n1,rnode=2,raddr=‹localhost:42203›,class=default,rpc] 166 unable to connect (is the peer up and reachable?): initial connection heartbeat failed: grpc: ‹Can't find decompressor for snappy› [code 12/Unimplemented]
E240602 05:16:32.550892 2965 2@rpc/context.go:2412 ⋮ [T1,n1,rnode=2,raddr=‹localhost:42203›,class=default,rpc] 358 unable to connect (is the peer up and reachable?): initial connection heartbeat failed: grpc: ‹Can't find decompressor for snappy› [code 12/Unimplemented]
E240602 05:16:37.043408 383 kv/kvserver/queue.go:1142 ⋮ [T1,n1,replicate,s1,r34/1:‹/Table/3{2-3}›] 546 failed to replicate after 5 retries
from cockroach.
Looks odd. We register the snappy de-/compressor in init
of the rpc
package:
Lines 224 to 226 in 4c06ddd
and opt to it when making a connection:
Lines 1672 to 1686 in 4c06ddd
All nodes should be able to find it. Unless we're talking to a server that somehow does not use the rpc
package, and hasn't registered snappy
.
from cockroach.
@rafiss I don't know how to run this test, could you hint? Can you reliably repro it?
from cockroach.
Interestingly, not only the troubled node localhost:42203
fails the incoming connections. It can't also initiate an rpc
handshake with other 2 nodes because its own register doesn't have snappy
:
$ grep "closing connection after" log.txt | grep -o "raddr=.*class" | sort | uniq
raddr=‹localhost:43077›,class
raddr=‹localhost:44245›,class
Sounds like some infra flake / memory corruption on this node.
from cockroach.
The test can be run with either of these commands:
❯ ./dev testlogic base --config=cockroach-go-testserver-23.1 --files=mixed_version_udf
❯ ./dev test pkg/sql/logictest/tests/cockroach-go-testserver-23.1 -f=TestLogic_mixed_version_udf_execute_privileges
No, it doesn't repro reliably, so if you are fine with closing it out, that's fine for me too.
from cockroach.
Reopening since we have other occurrences of this in #125133 and #125151 on a different branch.
from cockroach.
@rafiss WAIDW?
$ dev testlogic base --config=cockroach-go-testserver-23.1 --files=mixed_version_udf
WARNING: no tests found
$ dev test pkg/sql/logictest/tests/cockroach-go-testserver-23.1 -f=TestLogic_mixed_version_udf_execute_privileges
ERROR: could not query for tests within pkg/sql/logictest/tests/cockroach-go-testserver-23.1:all: got error exit status 7
UPD: got it, needed to change 23.1 to 23.2 on master
.
from cockroach.
No luck catching this on just stressing the test.
from cockroach.
The weird thing is that I can't find the Can't find decompressor for
substring and its variations in either CRDB or grpc-go
codebase. The closest thing in the Go gRPC codebase is the error grpc: Decompressor is not installed for grpc-encoding %q
which can be returned in a bunch of places when the compressor is not registered.
I wonder if our request is proxied through some non-CRDB/Go server, or gets to one by mistake (e.g. the CI machine has some other service at this port). Occurrences of Can't find decompressor for
error string can only be found in the Java gRPC repo. So are we talking to some Java server?
This still smells like an infra flake.
from cockroach.
@rafiss Do you know the specifics of this test to tell if there can be any Java servers involved?
from cockroach.
There aren't any Java servers in the test; the "special" thing this test does is that it uses the cockroach-go/testserver library to run CRDB, rather than an in-mem test cluster. this testserver library runs a real CRDB binary in a different process. actually for this test, it's a 3 node cluster so there are 3 CRDB processes that are started, but nothing from there uses java.
The theory about talking to a wrong server could check out though. each node in the test needs to identify the ports that are used by the other nodes. maybe something is wrong with how those ports are discovered. the port discovery code is here.
Looking at this test failure, it doesn't appear to mis-select a port.
From n1, I see that it talks to n2 (on port 34107) normally:
I240605 19:41:54.876683 1403 2@rpc/peer.go:527 ⋮ [T1,Vsystem,n1,rnode=2,raddr=‹localhost:34107›,class=system,rpc] 124 ‹connection is now healthy›
But n1 has this error talking to n3 (on port 34121) every time it tries to use the gRPC connection.
E240605 19:41:55.024044 1892 2@rpc/peer.go:601 ⋮ [T1,Vsystem,n1,rnode=3,raddr=‹localhost:34121›,class=default,rpc] 165 failed connection attempt‹ (last connected 0s ago)›: grpc: ‹Can't find decompressor for snappy› [code 12/Unimplemented]
The same pattern is visible on n2 logs -- n2 is able to connect to n1 (on port 46835), but cannot reach n3 (on port 34121).
The n3 logs show that it is in fact listening on port 34121:
I240605 19:41:54.914524 93 1@server/server.go:1992 ⋮ [T1,Vsystem,n3] 83 starting grpc/postgres server at ‹127.0.0.1:34121›
I240605 19:41:54.914542 93 1@server/server.go:1993 ⋮ [T1,Vsystem,n3] 84 advertising CockroachDB node at ‹localhost:34121›
Later in n3 logs we see that the connection to n1 is flapping (the pattern below is repeated 100s of times throughout the n3 logs):
E240605 19:41:55.922810 416 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=?,raddr=‹localhost:46835›,class=system,rpc] 106 disconnected (was healthy for 1.002s): grpc: ‹initial connection heartbeat failed: grpc: Can't find decompressor for snappy [code 12/Unimplemented]› [code 2/Unknown]
I240605 19:41:55.926426 416 2@rpc/peer.go:527 ⋮ [T1,Vsystem,n3,rnode=?,raddr=‹localhost:46835›,class=system,rpc] 107 ‹connection is now healthy (after 0s)›
E240605 19:41:55.944389 709 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=1,raddr=‹localhost:46835›,class=system,rpc] 108 disconnected (was healthy for 1.001s): grpc: ‹initial connection heartbeat failed: grpc: Can't find decompressor for snappy [code 12/Unimplemented]› [code 2/Unknown]
I240605 19:41:55.947610 709 2@rpc/peer.go:527 ⋮ [T1,Vsystem,n3,rnode=1,raddr=‹localhost:46835›,class=system,rpc] 109 ‹connection is now healthy (after 0s)›
E240605 19:41:55.963308 421 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=1,raddr=‹localhost:46835›,class=default,rpc] 110 disconnected (was healthy for 1.001s): grpc: ‹initial connection heartbeat failed: grpc: Can't find decompressor for snappy [code 12/Unimplemented]› [code 2/Unknown]
An interesting thing about these failures is that the first one was 5 days ago on the release-24.1.0-rc branch. But that branch has had no code changes since 22 days ago. Since that first failure, we've seen tests fail this way on the master branch too. If this was caused by a recent change, then that change probably was not in the CRDB repo. It could be in infrastructure.
from cockroach.
Posting the logs from each node:
from cockroach.
This issue has not occurred in 10 days, and we don't have a reliable repro. I'm closing this as an unsolved mystery.
from cockroach.
I spoke too soon... it happened again: #126032
from cockroach.
Related Issues (20)
- add SHOW SCHEMAS WITH COMMENT
- sql/schema: support DDLs inside stored procedures and UDFs HOT 1
- sql/multi-region: warn user with the system db zone config is different than user db
- sql: TestCancelQueryPermissions failed HOT 1
- pkg/sql/logictest/tests/cockroach-go-testserver-23.2/cockroach-go-testserver-23_2_test: TestLogic_upgrade_skip_version failed HOT 1
- pkg/sql/logictest/tests/cockroach-go-testserver-23.2/cockroach-go-testserver-23_2_test: TestLogic_mixed_version_upgrade_preserve_ttl failed HOT 1
- ui: rendering index details page crashes for tenants with a non-empty statements list HOT 1
- internal/sqlsmith: TestGenerateParse failed HOT 1
- pkg/ccl/testccl/sqlstatsccl/sqlstatsccl_test: TestSQLStatsRegions failed HOT 1
- roachtest: admission-control/multitenant-fairness/write-heavy/even failed HOT 1
- roachtest: mvcc_gc failed HOT 15
- ccl/multiregionccl: TestTenantStartupWithMultiRegionEnum failed
- roachtest: point-tombstone/heterogeneous-value-sizes failed HOT 2
- kv/kvserver/closedts/sidetransport: TestRPCConnUnblocksOnStopper failed HOT 1
- pkg/ccl/backupccl/backupinfo/backupinfo_test: TestManifestHandlingIteratorOperations failed
- roachtest: asyncpg failed HOT 1
- roachtest: apt_problem failed HOT 1
- Sentry: validate.go:466: relation × (1355): referenced table × (1359) is dropped (1) keys: [sql.schema.validation_errors.write.forward_references.relation] Wraps: (2) keys: [sql.schema.validation_er... HOT 1
- jobs: TestPauseReason failed
- sql/pgwire: TestAuthenticationAndHBARules failed HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cockroach.