Comments (19)
Also, the 0.1 seconds sleep in
scylla-ccm/ccmlib/scylla_node.py
Line 1378 in 65872f6
floods the log with messages.
See https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-debug/245/artifact/logs-full.debug.094/dtest-gw0.log
The following messages are repeated every 0.1 seconds for 360 seconds:
08:17:22,692 788 urllib3.connectionpool DEBUG connectionpool.py :228 | test_rebuild_many_keyspaces: Starting new HTTP connection (1): 127.0.10.1:10000
08:17:22,694 788 urllib3.connectionpool DEBUG connectionpool.py :456 | test_rebuild_many_keyspaces: http://127.0.10.1:10000 "GET /gossiper/endpoint/live HTTP/1.1" 200 14
08:17:22,696 788 urllib3.connectionpool DEBUG connectionpool.py :228 | test_rebuild_many_keyspaces: Starting new HTTP connection (1): 127.0.10.1:10000
08:17:22,698 788 urllib3.connectionpool DEBUG connectionpool.py :456 | test_rebuild_many_keyspaces: http://127.0.10.1:10000 "GET /storage_service/nodes/joining HTTP/1.1" 200 2
08:17:22,799 788 urllib3.connectionpool DEBUG connectionpool.py :228 | test_rebuild_many_keyspaces: Starting new HTTP connection (1): 127.0.10.1:10000
$ wc dtest-gw0.log
13920 193064 2487237 dtest-gw0.log
from scylla-ccm.
INFO 2023-07-23 08:17:28,186 [shard 0] storage_service - Set host_id=ddb99668-493f-4171-ab7f-a1208ba4186f to be owned by node=127.0.10.1
INFO 2023-07-23 08:17:28,187 [shard 0] gossip - InetAddress 127.0.10.1 is now UP, status = NORMAL
INFO 2023-07-23 08:17:28,196 [shard 0] gossip - Live nodes seen in gossip: {127.0.10.1, 127.0.10.2}
INFO 2023-07-23 08:17:28,196 [shard 0] storage_service - Started waiting for normal state handlers to finish
INFO 2023-07-23 08:17:28,196 [shard 0] storage_service - Normal state handlers not yet finished for nodes (127.0.10.1, status=NORMAL)
INFO 2023-07-23 08:17:28,228 [shard 0] migration_manager - Requesting schema pull from 127.0.10.1:0
INFO 2023-07-23 08:17:28,228 [shard 0] migration_manager - Pulling schema from 127.0.10.1:0
INFO 2023-07-23 08:17:28,283 [shard 0] migration_manager - Requesting schema pull from 127.0.10.1:0
INFO 2023-07-23 08:17:28,296 [shard 0] storage_service - Finished waiting for normal state handlers; endpoints observed in gossip: (127.0.10.1, status=NORMAL), (127.0.10.2, status=UNKNOWN)
INFO 2023-07-23 08:17:28,296 [shard 0] storage_service - Waiting for nodes {127.0.10.2, 127.0.10.1} to be alive
INFO 2023-07-23 08:17:28,297 [shard 0] storage_service - Nodes {127.0.10.2, 127.0.10.1} are alive
from scylla-ccm.
Logging I'll take care of, I'll disable this for those calls
The default timeout should be bigger for debug mode ? 10min as it used to be ?
from scylla-ccm.
If @nyh thinks we still need it than 10 minutes should be enough, at least for small cluster.
But we need to see how long it takes on large clusters where bootstrap repair needs to communucate with more nodes.
from scylla-ccm.
If @nyh thinks we still need it than 10 minutes should be enough, at least for small cluster. But we need to see how long it takes on large clusters where bootstrap repair needs to communucate with more nodes.
Maybe there shouldn't be any timeout at all, whatsoever? If the waiting code is not buggy, it will eventually stop waiting when the node becomes available. If the waiting code is buggy, the caller (dtest, Jenkins, etc.) will eventually stop on an overall timeout, so we're still safe.
Alternatively, maybe the timeout should be made configurable. It can default to something (e.g., 10 minutes), but if the user really wants to wait 60 minutes for a node on a huge cluster to come up, they can configure the timeout to 60 minutes.
from scylla-ccm.
I think we can also make the timeout be a linear function of the number of nodes in the cluster (at least for the first node in the loop, the rest can follow shortly after).
But I'm not sure it's worth it
from scylla-ccm.
Sigh, apparently 120 seconds in non-debug modes isn't enough and there's flakiness, see
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/314/testReport/update_cluster_layout_tests/TestLargeScaleCluster/Run_Dtest_Parallel_Cloud_Machines___LongDtest___long_split000___test_add_many_nodes_under_load/
ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds
from scylla-ccm.
not only in the many nodes test.
ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds
from scylla-ccm.
And it turns out that 600 seconds isn't emough either in debug mode if there are enough keyspaces and/or tables.
See https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/rebuild_test/TestRebuild/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split090___test_rebuild_many_keyspaces/
and https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/rebuild_test/TestRebuild/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split086___test_rebuild_many_tables/
We can extend the timeout indefinitely while there is bootstrap progress like we do for the cql listen message, maybe this is the way to go for test robustness
from scylla-ccm.
from scylla-ccm.
and https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/cdc_test/TestCdc/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split084___test_cluster_expansion_with_cdc_Single_cluster_/ and similar variants
from scylla-ccm.
from scylla-ccm.
from scylla-ccm.
@bhalevy I'm not sure what the suggestion is here exactly, this function is waiting for bootstrap to end already...
from scylla-ccm.
Yes, it's polling the rest api and the event it is waiting for doesn't happen until bootstrap is over.
I suggested to watch the log for progress as we do in wait_for_starting
.
But coming to think about it, I think we could just change the order of operation as follows to run
wait_for_starting
before self.watch_rest_for_alive(others)
:
wait_timeout = timeout * 4 if timeout is not None else 420 if self.cluster.scylla_mode != 'debug' else 900
if wait_other_notice:
for node, _ in marks:
node.watch_rest_for_alive(self, timeout=wait_timeout)
if wait_for_binary_proto:
from_mark = self.mark
try:
self.wait_for_binary_interface(from_mark=from_mark, process=self._process_scylla, timeout=wait_timeout)
except TimeoutError as e:
self.wait_for_starting(from_mark=self.mark, timeout=wait_timeout)
self.wait_for_binary_interface(from_mark=from_mark, process=self._process_scylla, timeout=0)
elif wait_other_notice:
self.wait_for_starting(from_mark=self.mark, timeout=wait_timeout)
if wait_other_notice:
for node, _ in marks:
self.watch_rest_for_alive(node, timeout=wait_timeout)
from scylla-ccm.
Why not todo all the wait_other_notice
logic after wait_for_binary_proto
? as it in scylla_cluster.py
?
why split it into two parts like that ?
from scylla-ccm.
it's possible, although the other node notice this node as up before it starts listening for cql, but that shouldn't matter.
However, if only wait_other_notics is set, we'd still need wait_for_starting
to prevent flakiness.
from scylla-ccm.
it's possible, although the other node notice this node as up before it starts listening for cql, but that shouldn't matter.
However, if only wait_other_notics is set, we'd still need
wait_for_starting
to prevent flakiness.
Even though I don't know of any test that would do wait_other_notice without wait_for_binary_protocol ...
from scylla-ccm.
It should be ok to imply wait_for_binary_protocol
when wait_other_notice
is set.
from scylla-ccm.
Related Issues (20)
- Using scylla-ccm without a local Scylla build HOT 5
- CCM CI is failing to even start
- unpin urllib3 once botocore works with urllib3 >= 2.0 HOT 6
- JMX doesn't start HOT 28
- wait_other_notice doesn't wait for the right thing HOT 1
- SCYLLA_MANAGER_PACKAGE can only be URL
- "waiting for scylla process to be running" log spam HOT 5
- Add support for Cassandra 4.1.2 HOT 2
- Add tablets support to node stress command
- Add support for populating multi-rack datacenter(s) HOT 1
- Improve caching of local tarballs HOT 8
- Fix compatibility with Python 3.12 HOT 5
- ScyllaNode methods that invoke scylla-sstable should take care of disabling compaction HOT 1
- Scylla startup fails with an error if "-c" or "-m" is used in `SCYLLA_EXT_OPTS`
- Only the first occurrence `--experimental-features` in `SCYLLA_EXT_OPTS` is used HOT 1
- CCM support for native nodetool HOT 4
- Errorneously added `-schema` option to stress tool HOT 6
- Release to PyPi
- `node.start(wait_other_notice=True, ...)` may return too early because implementation of `wait_other_notice` may observe past state HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scylla-ccm.