Giter VIP home page Giter VIP logo

Comments (14)

rafiss avatar rafiss commented on July 18, 2024

It looks like there were connection issues in the cluster from right after the cluster started. I haven't seen this "Can't find decompressor for snappy" gRPC before, checking if KV team has knowledge of this.

I240602 05:16:32.549270 383 kv/kvserver/replica_command.go:1754 ⋮ [T1,n1,replicate,s1,r11/1:‹/Table/{7-8}›] 208  could not successfully add and upreplicate LEARNER replica(s) on [n2,s2], rolling back: error sending couldn't accept ‹range_id:11 coordinator_replica:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > recipient_replica:<node_id:2 store_id:2 replica_id:5 type:LEARNER > delegated_sender:<node_id:1 store_id:1 replica_id:1 type:VOTER_FULL > priority:RECOVERY type:INITIAL term:6 first_index:32 sender_queue_name:REPLICATE_QUEUE sender_queue_priority:10001 descriptor_generation:7 queue_on_delegate_len:-1 snap_id:396c82fb-f663-4c05-bbaf-4a56bdee8181 ›: unable to dial n2: ‹breaker open›


E240602 05:16:31.290980 1123 2@rpc/context.go:2412 ⋮ [T1,n1,rnode=2,raddr=‹localhost:42203›,class=default,rpc] 166  unable to connect (is the peer up and reachable?): initial connection heartbeat failed: grpc: ‹Can't find decompressor for snappy› [code 12/Unimplemented]

E240602 05:16:32.550892 2965 2@rpc/context.go:2412 ⋮ [T1,n1,rnode=2,raddr=‹localhost:42203›,class=default,rpc] 358  unable to connect (is the peer up and reachable?): initial connection heartbeat failed: grpc: ‹Can't find decompressor for snappy› [code 12/Unimplemented]

E240602 05:16:37.043408 383 kv/kvserver/queue.go:1142 ⋮ [T1,n1,replicate,s1,r34/1:‹/Table/3{2-3}›] 546  failed to replicate after 5 retries

from cockroach.

pav-kv avatar pav-kv commented on July 18, 2024

Looks odd. We register the snappy de-/compressor in init of the rpc package:

cockroach/pkg/rpc/snappy.go

Lines 224 to 226 in 4c06ddd

func init() {
encoding.RegisterCompressor(snappyCompressor{})
}

and opt to it when making a connection:

cockroach/pkg/rpc/context.go

Lines 1672 to 1686 in 4c06ddd

// Request request compression. Note that it's the client that
// decides to opt into compressions; the server accepts either
// compressed or decompressed payloads, and the specific codec used
// is named in the request (so different clients can use different
// compression algorithms.)
//
// On a related note, this configuration uses our own snappy codec.
// We believe it works better than the gzip codec provided natively
// by grpc, although the specific reason is now lost to history. It
// would be possible to change/simplify this, and since it's for
// each client to decide changing this will not require much
// cross-version compatibility dance.
if rpcCtx.rpcCompression {
dialOpts = append(dialOpts, grpc.WithDefaultCallOptions(grpc.UseCompressor((snappyCompressor{}).Name())))
}

All nodes should be able to find it. Unless we're talking to a server that somehow does not use the rpc package, and hasn't registered snappy.

from cockroach.

pav-kv avatar pav-kv commented on July 18, 2024

@rafiss I don't know how to run this test, could you hint? Can you reliably repro it?

from cockroach.

pav-kv avatar pav-kv commented on July 18, 2024

Interestingly, not only the troubled node localhost:42203 fails the incoming connections. It can't also initiate an rpc handshake with other 2 nodes because its own register doesn't have snappy:

$ grep "closing connection after" log.txt | grep -o "raddr=.*class" | sort | uniq
raddr=‹localhost:43077›,class
raddr=‹localhost:44245›,class

Sounds like some infra flake / memory corruption on this node.

from cockroach.

rafiss avatar rafiss commented on July 18, 2024

The test can be run with either of these commands:

❯ ./dev testlogic base --config=cockroach-go-testserver-23.1 --files=mixed_version_udf

❯ ./dev test pkg/sql/logictest/tests/cockroach-go-testserver-23.1 -f=TestLogic_mixed_version_udf_execute_privileges

No, it doesn't repro reliably, so if you are fine with closing it out, that's fine for me too.

from cockroach.

pav-kv avatar pav-kv commented on July 18, 2024

Reopening since we have other occurrences of this in #125133 and #125151 on a different branch.

from cockroach.

pav-kv avatar pav-kv commented on July 18, 2024

@rafiss WAIDW?

$ dev testlogic base --config=cockroach-go-testserver-23.1 --files=mixed_version_udf
WARNING: no tests found
$ dev test pkg/sql/logictest/tests/cockroach-go-testserver-23.1 -f=TestLogic_mixed_version_udf_execute_privileges
ERROR: could not query for tests within pkg/sql/logictest/tests/cockroach-go-testserver-23.1:all: got error exit status 7

UPD: got it, needed to change 23.1 to 23.2 on master.

from cockroach.

pav-kv avatar pav-kv commented on July 18, 2024

No luck catching this on just stressing the test.

from cockroach.

pav-kv avatar pav-kv commented on July 18, 2024

The weird thing is that I can't find the Can't find decompressor for substring and its variations in either CRDB or grpc-go codebase. The closest thing in the Go gRPC codebase is the error grpc: Decompressor is not installed for grpc-encoding %q which can be returned in a bunch of places when the compressor is not registered.

I wonder if our request is proxied through some non-CRDB/Go server, or gets to one by mistake (e.g. the CI machine has some other service at this port). Occurrences of Can't find decompressor for error string can only be found in the Java gRPC repo. So are we talking to some Java server?

This still smells like an infra flake.

from cockroach.

pav-kv avatar pav-kv commented on July 18, 2024

@rafiss Do you know the specifics of this test to tell if there can be any Java servers involved?

from cockroach.

rafiss avatar rafiss commented on July 18, 2024

There aren't any Java servers in the test; the "special" thing this test does is that it uses the cockroach-go/testserver library to run CRDB, rather than an in-mem test cluster. this testserver library runs a real CRDB binary in a different process. actually for this test, it's a 3 node cluster so there are 3 CRDB processes that are started, but nothing from there uses java.

The theory about talking to a wrong server could check out though. each node in the test needs to identify the ports that are used by the other nodes. maybe something is wrong with how those ports are discovered. the port discovery code is here.

https://github.com/cockroachdb/cockroach-go/blob/2c9d026f19fba1cf30c7a1880db4a668bc5d26e2/testserver/testservernode.go#L61-L81

Looking at this test failure, it doesn't appear to mis-select a port.

From n1, I see that it talks to n2 (on port 34107) normally:

I240605 19:41:54.876683 1403 2@rpc/peer.go:527 ⋮ [T1,Vsystem,n1,rnode=2,raddr=‹localhost:34107›,class=system,rpc] 124  ‹connection is now healthy›

But n1 has this error talking to n3 (on port 34121) every time it tries to use the gRPC connection.

E240605 19:41:55.024044 1892 2@rpc/peer.go:601 ⋮ [T1,Vsystem,n1,rnode=3,raddr=‹localhost:34121›,class=default,rpc] 165  failed connection attempt‹ (last connected 0s ago)›: grpc: ‹Can't find decompressor for snappy› [code 12/Unimplemented]

The same pattern is visible on n2 logs -- n2 is able to connect to n1 (on port 46835), but cannot reach n3 (on port 34121).

The n3 logs show that it is in fact listening on port 34121:

I240605 19:41:54.914524 93 1@server/server.go:1992 ⋮ [T1,Vsystem,n3] 83  starting grpc/postgres server at ‹127.0.0.1:34121›
I240605 19:41:54.914542 93 1@server/server.go:1993 ⋮ [T1,Vsystem,n3] 84  advertising CockroachDB node at ‹localhost:34121›

Later in n3 logs we see that the connection to n1 is flapping (the pattern below is repeated 100s of times throughout the n3 logs):

E240605 19:41:55.922810 416 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=?,raddr=‹localhost:46835›,class=system,rpc] 106  disconnected (was healthy for 1.002s): grpc: ‹initial connection heartbeat failed: grpc: Can't find decompressor for snappy [code 12/Unimplemented]› [code 2/Unknown]
I240605 19:41:55.926426 416 2@rpc/peer.go:527 ⋮ [T1,Vsystem,n3,rnode=?,raddr=‹localhost:46835›,class=system,rpc] 107  ‹connection is now healthy (after 0s)›
E240605 19:41:55.944389 709 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=1,raddr=‹localhost:46835›,class=system,rpc] 108  disconnected (was healthy for 1.001s): grpc: ‹initial connection heartbeat failed: grpc: Can't find decompressor for snappy [code 12/Unimplemented]› [code 2/Unknown]
I240605 19:41:55.947610 709 2@rpc/peer.go:527 ⋮ [T1,Vsystem,n3,rnode=1,raddr=‹localhost:46835›,class=system,rpc] 109  ‹connection is now healthy (after 0s)›
E240605 19:41:55.963308 421 2@rpc/peer.go:580 ⋮ [T1,Vsystem,n3,rnode=1,raddr=‹localhost:46835›,class=default,rpc] 110  disconnected (was healthy for 1.001s): grpc: ‹initial connection heartbeat failed: grpc: Can't find decompressor for snappy [code 12/Unimplemented]› [code 2/Unknown]

An interesting thing about these failures is that the first one was 5 days ago on the release-24.1.0-rc branch. But that branch has had no code changes since 22 days ago. Since that first failure, we've seen tests fail this way on the master branch too. If this was caused by a recent change, then that change probably was not in the CRDB repo. It could be in infrastructure.

from cockroach.

rafiss avatar rafiss commented on July 18, 2024

Posting the logs from each node:

from cockroach.

rafiss avatar rafiss commented on July 18, 2024

This issue has not occurred in 10 days, and we don't have a reliable repro. I'm closing this as an unsolved mystery.

from cockroach.

rafiss avatar rafiss commented on July 18, 2024

I spoke too soon... it happened again: #126032

from cockroach.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.