Giter VIP home page Giter VIP logo

Comments (3)

rharding6373 avatar rharding6373 commented on August 26, 2024

Here's the last log line from changefeedccl, followed by a range split on the table with the changefeed:

I240410 08:13:30.370337 14481640 ccl/changefeedccl/changefeed_dist.go:399 ⋮ [T10,Vcluster 10,nsql1,client=127.0.0.1:53676,hostssl,user=‹sinklessfeeduser›] 348      using bulk oracle for DistSQL planning
I240410 08:13:31.681238 14739761 kv/kvserver/replica_command.go:474 ⋮ [T1,Vsystem,n1,split,s1,r75/1:‹/Tenant/1{0/Table/…-1}›] 349  initiating a split of this range at key /Tenant/10/Table/106 [r76] (span config)

This is followed by some suspicious log messages about waiting to acquire read/write latches and slow replica RPCs that hint at an overloaded node:

W240410 08:13:46.254287 14500272 kv/kvserver/spanlatch/manager.go:610 ⋮ [T1,Vsystem,n1,tenant=10,s1,r75/1:‹/Tenant/1{0/Table/…-1}›] 352  have been waiting 15s to acquire read latch ‹/Tenant/10/Table/106/{1-2}›@1712736810.305663866,0 for request Scan [/Tenant/10/Table/106/1,/Tenant/10/Table/106/2), [txn: 0b98bc3f], [max_span_request_keys: 21], [target_bytes: 524288], held by write latch ‹/Tenant/10/Table/106/1/7687/0›@1712736809.446934585,0 for request ResolveIntent [/Tenant/10/Table/106/1/‹7687›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7688›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7689›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7690›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7691›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7692›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7693›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7694›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7695›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7696›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7697›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7698›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7699›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7700›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7701›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7702›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7703›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7704›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7705›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7706›/‹0›],... 75 skipped ..., ResolveIntent [/Tenant/10/Table/106/1/‹7782›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7783›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7784›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7785›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹7786›/‹0›], [max_span_request_keys: 0], [target_bytes: 4194304]
W240410 08:13:46.842363 14739761 kv/kvserver/spanlatch/manager.go:610 ⋮ [T1,Vsystem,n1,s1,r75/1:‹/Tenant/1{0/Table/…-1}›] 353  have been waiting 15s to acquire write latch ‹/Local/Range/Tenant/10/Table/28/RangeDescriptor›@0,0 for request ConditionalPut [/Local/Range/Tenant/10/Table/28/‹RangeDescriptor›], [txn: 277aebbf], held by read latch ‹/Local/Range/Tenant/10/Table/28/RangeDescriptor›@0,0 for request Put [/Tenant/10/Table/39/2/‹"\x80"›/‹"\x01\x01\x80\xf5|%>Z?HC\x8axUq\xb1\xdf\xc2\xf4"›/‹0›], EndTxn(commit) [/Tenant/10/Table/39/2/‹"\x80"›/‹"\x01\x01\x80\xf5|%>Z?HC\x8axUq\xb1\xdf\xc2\xf4"›/‹0›], [txn: f5cabda2]
W240410 08:13:47.494836 14916786 kv/kvserver/spanlatch/manager.go:610 ⋮ [T1,Vsystem,n1,tenant=10,s1,r75/1:‹/Tenant/1{0/Table/…-1}›] 354  have been waiting 15s to acquire write latch ‹/Tenant/10/Table/53/1/959009280289734657/"legacy_progress"/1915-09-23T15:46:26.840194999Z/0›@1712736812.159804618,0 for request ConditionalPut [/Tenant/10/Table/53/1/‹959009280289734657›/‹"legacy_progress"›/‹1915-09-23T15:46:26.840194999Z›/‹0›], [txn: f490c81d], held by write latch ‹/Tenant/10/Table/53/1/959009280289734657/"legacy_progress"{-/PrefixEnd}›@1712736812.159804618,0 for request DeleteRange [/Tenant/10/Table/53/1/‹959009280289734657›/‹"legacy_progress"›,/Tenant/10/Table/53/1/‹959009280289734657›/‹"legacy_progress"›/‹PrefixEnd›), [txn: f490c81d], [max_span_request_keys: 600], [target_bytes: 0]
W240410 08:13:52.090733 15823194 kv/kvserver/spanlatch/manager.go:610 ⋮ [T1,Vsystem,n1,tenant=10,s1,r75/1:‹/Tenant/1{0/Table/…-1}›] 355  have been waiting 15s to acquire read latch ‹/Tenant/10/Table/39/2/"\x80"/"\x01\x01\x80\xf5|%>Z?HC\x8axUq\xb1\xdf\xc2\xf4"/0›@1712736817.589248033,0 for request Get [/Tenant/10/Table/39/2/‹"\x80"›/‹"\x01\x01\x80\xf5|%>Z?HC\x8axUq\xb1\xdf\xc2\xf4"›/‹0›], [txn: 6bf55098], [can-forward-ts], held by write latch ‹/Tenant/10/Table/39/2/"\x80"/"\x01\x01\x80\xf5|%>Z?HC\x8axUq\xb1\xdf\xc2\xf4"/0›@1712736811.268706311,0 for request Put [/Tenant/10/Table/39/2/‹"\x80"›/‹"\x01\x01\x80\xf5|%>Z?HC\x8axUq\xb1\xdf\xc2\xf4"›/‹0›], EndTxn(commit) [/Tenant/10/Table/39/2/‹"\x80"›/‹"\x01\x01\x80\xf5|%>Z?HC\x8axUq\xb1\xdf\xc2\xf4"›/‹0›], [txn: f5cabda2]
I240410 08:13:52.841673 18855912 kv/kvserver/replica_consistency.go:215 ⋮ [T1,Vsystem,n1,consistencyChecker,s1,r4/1:‹/System{/tsd-tse}›] 356  triggering stats recomputation to resolve delta of {ContainsEstimates:21544 LastUpdateNanos:1712736832702299269 LockAge:0 GCBytesAge:0 LiveBytes:1463 LiveCount:-2713 KeyBytes:-156604 KeyCount:-2713 ValBytes:158067 ValCount:-2713 IntentBytes:0 IntentCount:0 LockBytes:0 LockCount:0 RangeKeyCount:0 RangeKeyBytes:0 RangeValCount:0 RangeValBytes:0 SysBytes:0 SysCount:0 AbortSpanBytes:0}
I240410 08:14:22.851139 13590173 kv/kvserver/queue.go:1345 ⋮ [T1,Vsystem,n1,replicate] 357  purgatory is now empty
W240410 08:14:23.347330 13134734 kv/kvserver/spanlatch/manager.go:654 ⋮ [T1,Vsystem,n1,s1,r75/1:‹/Tenant/1{0/Table/…-1}›,raft] 358  ResolveIntent [/Tenant/10/Table/106/1/‹9287›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9288›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9289›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9290›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9291›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9292›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9293›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9294›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9295›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9296›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9297›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9298›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9299›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9300›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9301›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9302›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9303›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9304›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9305›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9306›/‹0›],... 75 skipped ..., ResolveIntent [/Tenant/10/Table/106/1/‹9382›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9383›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9384›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9385›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9386›/‹0›], [max_span_request_keys: 0], [target_bytes: 4194304] has held latch for 52092021362 ns. Some possible causes are slow disk reads, slow raft replication, and expensive request processing.
W240410 08:14:23.349499 14513298 kv/kvclient/kvcoord/dist_sender.go:2756 ⋮ [T1,Vsystem,n1,s1] 359  slow replica RPC: have been waiting 52.88s (0 attempts) for RPC ResolveIntent [/Tenant/10/Table/106/1/‹9687›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9688›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9689›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9690›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9691›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9692›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9693›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9694›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9695›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9696›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9697›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9698›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9699›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9700›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9701›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9702›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9703›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9704›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9705›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9706›/‹0›],... 75 skipped ..., ResolveIntent [/Tenant/10/Table/106/1/‹9782›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9783›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9784›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9785›/‹0›], ResolveIntent [/Tenant/10/Table/106/1/‹9786›/‹0›], [max_span_request_keys: 0], [target_bytes: 4194304] to replica (n1,s1):1; resp: ‹(err: <nil>), *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, ... 75 skipped ..., *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse, *kvpb.ResolveIntentResponse›
...
I240410 08:14:23.476187 14739761 kv/kvserver/split_queue.go:274 ⋮ [T1,Vsystem,n1,split,s1,r75/1:‹/Tenant/10/Table/{28-106}›] 393  range split took 51.795883344s    , exceeding threshold of 2s

Timeout for the test is 30s (5m for race, which this test was not), so I'm not sure why it errored with "unexpected EOF" instead of a timeout, which seems to indicate the changefeed closed before any messages were read.

from cockroach.

rharding6373 avatar rharding6373 commented on August 26, 2024

I also see a lot of messages in the kv distribution log like:

I240410 08:13:25.473800 13589326 13@kv/kvserver/replicate_queue.go:767 ⋮ [T1,Vsystem,n1,replicate,s1,r1/1:‹/{Min-System/NodeL…}›] 1  error processing replica: 0     of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster

Here are the logs for convenience: changefeedccltest.log,
changefeedccltest-kv-distribution.log

There doesn't look like there's any issue in CDC here. It seems like something happened during the test (a node died?) that caused KV to take a long time. Asking for KV to take a quick before closing as a flake or lengthening the timeout our test uses for this operation.

from cockroach.

arulajmani avatar arulajmani commented on August 26, 2024

Looks like overload. From a quick read, this is a single server test, so latches being held for >= 15 seconds is extremely surprising. We also see a few log lines littered throughout:

I240410 08:13:32.085325 13134734 kv/kvserver/store_raft.go:699 ⋮ [T1,Vsystem,n1,s1,r75/1:‹/Tenant/1{0/Table/…-1}›,raft] 350  raft ready handling: 0.81s [append=0.00s, apply=0.81s, , other=0.00s], wrote [append-batch=49 B, apply=1.2 KiB (1)] pebble stats: [commit-wait 80ns sem 520ns]; node might be overloaded

from cockroach.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.