See for example <a href="https://jenkins.scylladb.com/view/master/job/scylla-master/jo

Also, the 0.1 seconds sleep in <div class=

All that, while on <a href="https://jenkins.scylladb.com/view/master/job/scylla-master

If <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

If <a class="user-mention notranslate" data-hovercard-type="user" data-ho

not only in the many nodes test. <a href="https://jenkins.scylladb.c

self.watch_rest_for_alive(other_node) fails en-mass during bootstrap in debug mode about scylla-ccm HOT 19 OPEN

bhalevy commented on July 28, 2024

self.watch_rest_for_alive(other_node) fails en-mass during bootstrap in debug mode

from scylla-ccm.

Comments (19)

bhalevy commented on July 28, 2024

Also, the 0.1 seconds sleep in

scylla-ccm/ccmlib/scylla_node.py

Line 1378 in 65872f6

time.sleep(0.1)

floods the log with messages.
See https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-debug/245/artifact/logs-full.debug.094/dtest-gw0.log
The following messages are repeated every 0.1 seconds for 360 seconds:

08:17:22,692 788     urllib3.connectionpool         DEBUG    connectionpool.py   :228  | test_rebuild_many_keyspaces: Starting new HTTP connection (1): 127.0.10.1:10000
08:17:22,694 788     urllib3.connectionpool         DEBUG    connectionpool.py   :456  | test_rebuild_many_keyspaces: http://127.0.10.1:10000 "GET /gossiper/endpoint/live HTTP/1.1" 200 14
08:17:22,696 788     urllib3.connectionpool         DEBUG    connectionpool.py   :228  | test_rebuild_many_keyspaces: Starting new HTTP connection (1): 127.0.10.1:10000
08:17:22,698 788     urllib3.connectionpool         DEBUG    connectionpool.py   :456  | test_rebuild_many_keyspaces: http://127.0.10.1:10000 "GET /storage_service/nodes/joining HTTP/1.1" 200 2
08:17:22,799 788     urllib3.connectionpool         DEBUG    connectionpool.py   :228  | test_rebuild_many_keyspaces: Starting new HTTP connection (1): 127.0.10.1:10000

$ wc dtest-gw0.log
  13920  193064 2487237 dtest-gw0.log

from scylla-ccm.

bhalevy commented on July 28, 2024

All that, while on https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-debug/245/artifact/logs-full.debug.094/1690100633894_rebuild_test.py%3A%3ATestRebuild%3A%3Atest_rebuild_many_keyspaces/node2.log:

INFO  2023-07-23 08:17:28,186 [shard 0] storage_service - Set host_id=ddb99668-493f-4171-ab7f-a1208ba4186f to be owned by node=127.0.10.1
INFO  2023-07-23 08:17:28,187 [shard 0] gossip - InetAddress 127.0.10.1 is now UP, status = NORMAL
INFO  2023-07-23 08:17:28,196 [shard 0] gossip - Live nodes seen in gossip: {127.0.10.1, 127.0.10.2}
INFO  2023-07-23 08:17:28,196 [shard 0] storage_service - Started waiting for normal state handlers to finish
INFO  2023-07-23 08:17:28,196 [shard 0] storage_service - Normal state handlers not yet finished for nodes (127.0.10.1, status=NORMAL)
INFO  2023-07-23 08:17:28,228 [shard 0] migration_manager - Requesting schema pull from 127.0.10.1:0
INFO  2023-07-23 08:17:28,228 [shard 0] migration_manager - Pulling schema from 127.0.10.1:0
INFO  2023-07-23 08:17:28,283 [shard 0] migration_manager - Requesting schema pull from 127.0.10.1:0
INFO  2023-07-23 08:17:28,296 [shard 0] storage_service - Finished waiting for normal state handlers; endpoints observed in gossip: (127.0.10.1, status=NORMAL), (127.0.10.2, status=UNKNOWN)
INFO  2023-07-23 08:17:28,296 [shard 0] storage_service - Waiting for nodes {127.0.10.2, 127.0.10.1} to be alive
INFO  2023-07-23 08:17:28,297 [shard 0] storage_service - Nodes {127.0.10.2, 127.0.10.1} are alive

from scylla-ccm.

fruch commented on July 28, 2024

Logging I'll take care of, I'll disable this for those calls

The default timeout should be bigger for debug mode ? 10min as it used to be ?

from scylla-ccm.

bhalevy commented on July 28, 2024

If @nyh thinks we still need it than 10 minutes should be enough, at least for small cluster.
But we need to see how long it takes on large clusters where bootstrap repair needs to communucate with more nodes.

from scylla-ccm.

nyh commented on July 28, 2024

If @nyh thinks we still need it than 10 minutes should be enough, at least for small cluster. But we need to see how long it takes on large clusters where bootstrap repair needs to communucate with more nodes.

Maybe there shouldn't be any timeout at all, whatsoever? If the waiting code is not buggy, it will eventually stop waiting when the node becomes available. If the waiting code is buggy, the caller (dtest, Jenkins, etc.) will eventually stop on an overall timeout, so we're still safe.

Alternatively, maybe the timeout should be made configurable. It can default to something (e.g., 10 minutes), but if the user really wants to wait 60 minutes for a node on a huge cluster to come up, they can configure the timeout to 60 minutes.

from scylla-ccm.

bhalevy commented on July 28, 2024

I think we can also make the timeout be a linear function of the number of nodes in the cluster (at least for the first node in the loop, the rest can follow shortly after).
But I'm not sure it's worth it

from scylla-ccm.

bhalevy commented on July 28, 2024

Sigh, apparently 120 seconds in non-debug modes isn't enough and there's flakiness, see
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/314/testReport/update_cluster_layout_tests/TestLargeScaleCluster/Run_Dtest_Parallel_Cloud_Machines___LongDtest___long_split000___test_add_many_nodes_under_load/

ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds

from scylla-ccm.

bhalevy commented on July 28, 2024

not only in the many nodes test.

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/316/testReport/update_cluster_layout_tests/TestUpdateClusterLayout/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split013___test_remove_garbage_members_from_group0_after_abort_decommission_Announcing_that_I_have_left_the_ring__/

ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds

from scylla-ccm.

bhalevy commented on July 28, 2024

And it turns out that 600 seconds isn't emough either in debug mode if there are enough keyspaces and/or tables.
See https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/rebuild_test/TestRebuild/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split090___test_rebuild_many_keyspaces/
and https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/rebuild_test/TestRebuild/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split086___test_rebuild_many_tables/

We can extend the timeout indefinitely while there is bootstrap progress like we do for the cql listen message, maybe this is the way to go for test robustness

from scylla-ccm.

bhalevy commented on July 28, 2024

Also https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/lwt_random_test/TestRandomPaxos/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split093___test_topology_grow/

from scylla-ccm.

bhalevy commented on July 28, 2024

and https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/cdc_test/TestCdc/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split084___test_cluster_expansion_with_cdc_Single_cluster_/ and similar variants

from scylla-ccm.

bhalevy commented on July 28, 2024

and https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/secondary_indexes_test/TestSecondaryIndexes/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split075___test_add_node_during_index_build_2/

from scylla-ccm.

bhalevy commented on July 28, 2024

and https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/alternator_ttl_tests/TestAlternatorTTL/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split022___test_ttl_with_load_and_decommission/

from scylla-ccm.

fruch commented on July 28, 2024

@bhalevy I'm not sure what the suggestion is here exactly, this function is waiting for bootstrap to end already...

from scylla-ccm.

bhalevy commented on July 28, 2024

Yes, it's polling the rest api and the event it is waiting for doesn't happen until bootstrap is over.
I suggested to watch the log for progress as we do in wait_for_starting.

But coming to think about it, I think we could just change the order of operation as follows to run
wait_for_starting before self.watch_rest_for_alive(others):

        wait_timeout = timeout * 4 if timeout is not None else 420 if self.cluster.scylla_mode != 'debug' else 900

        if wait_other_notice:
            for node, _ in marks:
                node.watch_rest_for_alive(self, timeout=wait_timeout)

        if wait_for_binary_proto:
            from_mark = self.mark
            try:
                self.wait_for_binary_interface(from_mark=from_mark, process=self._process_scylla, timeout=wait_timeout)
            except TimeoutError as e:
                self.wait_for_starting(from_mark=self.mark, timeout=wait_timeout)
                self.wait_for_binary_interface(from_mark=from_mark, process=self._process_scylla, timeout=0)
        elif wait_other_notice:
            self.wait_for_starting(from_mark=self.mark, timeout=wait_timeout)

        if wait_other_notice:
            for node, _ in marks:
                self.watch_rest_for_alive(node, timeout=wait_timeout)

from scylla-ccm.

fruch commented on July 28, 2024

Why not todo all the wait_other_notice logic after wait_for_binary_proto ? as it in scylla_cluster.py ?

why split it into two parts like that ?

from scylla-ccm.

bhalevy commented on July 28, 2024

it's possible, although the other node notice this node as up before it starts listening for cql, but that shouldn't matter.

However, if only wait_other_notics is set, we'd still need wait_for_starting to prevent flakiness.

from scylla-ccm.

fruch commented on July 28, 2024

it's possible, although the other node notice this node as up before it starts listening for cql, but that shouldn't matter.

However, if only wait_other_notics is set, we'd still need wait_for_starting to prevent flakiness.

Even though I don't know of any test that would do wait_other_notice without wait_for_binary_protocol ...

from scylla-ccm.

bhalevy commented on July 28, 2024

It should be ok to imply wait_for_binary_protocol when wait_other_notice is set.

from scylla-ccm.

self.watch_rest_for_alive(other_node) fails en-mass during bootstrap in debug mode about scylla-ccm HOT 19 OPEN

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent