Giter VIP home page Giter VIP logo

Comments (19)

bhalevy avatar bhalevy commented on July 28, 2024

Also, the 0.1 seconds sleep in

time.sleep(0.1)

floods the log with messages.
See https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-debug/245/artifact/logs-full.debug.094/dtest-gw0.log
The following messages are repeated every 0.1 seconds for 360 seconds:

08:17:22,692 788     urllib3.connectionpool         DEBUG    connectionpool.py   :228  | test_rebuild_many_keyspaces: Starting new HTTP connection (1): 127.0.10.1:10000
08:17:22,694 788     urllib3.connectionpool         DEBUG    connectionpool.py   :456  | test_rebuild_many_keyspaces: http://127.0.10.1:10000 "GET /gossiper/endpoint/live HTTP/1.1" 200 14
08:17:22,696 788     urllib3.connectionpool         DEBUG    connectionpool.py   :228  | test_rebuild_many_keyspaces: Starting new HTTP connection (1): 127.0.10.1:10000
08:17:22,698 788     urllib3.connectionpool         DEBUG    connectionpool.py   :456  | test_rebuild_many_keyspaces: http://127.0.10.1:10000 "GET /storage_service/nodes/joining HTTP/1.1" 200 2
08:17:22,799 788     urllib3.connectionpool         DEBUG    connectionpool.py   :228  | test_rebuild_many_keyspaces: Starting new HTTP connection (1): 127.0.10.1:10000
$ wc dtest-gw0.log
  13920  193064 2487237 dtest-gw0.log

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

All that, while on https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-debug/245/artifact/logs-full.debug.094/1690100633894_rebuild_test.py%3A%3ATestRebuild%3A%3Atest_rebuild_many_keyspaces/node2.log:

INFO  2023-07-23 08:17:28,186 [shard 0] storage_service - Set host_id=ddb99668-493f-4171-ab7f-a1208ba4186f to be owned by node=127.0.10.1
INFO  2023-07-23 08:17:28,187 [shard 0] gossip - InetAddress 127.0.10.1 is now UP, status = NORMAL
INFO  2023-07-23 08:17:28,196 [shard 0] gossip - Live nodes seen in gossip: {127.0.10.1, 127.0.10.2}
INFO  2023-07-23 08:17:28,196 [shard 0] storage_service - Started waiting for normal state handlers to finish
INFO  2023-07-23 08:17:28,196 [shard 0] storage_service - Normal state handlers not yet finished for nodes (127.0.10.1, status=NORMAL)
INFO  2023-07-23 08:17:28,228 [shard 0] migration_manager - Requesting schema pull from 127.0.10.1:0
INFO  2023-07-23 08:17:28,228 [shard 0] migration_manager - Pulling schema from 127.0.10.1:0
INFO  2023-07-23 08:17:28,283 [shard 0] migration_manager - Requesting schema pull from 127.0.10.1:0
INFO  2023-07-23 08:17:28,296 [shard 0] storage_service - Finished waiting for normal state handlers; endpoints observed in gossip: (127.0.10.1, status=NORMAL), (127.0.10.2, status=UNKNOWN)
INFO  2023-07-23 08:17:28,296 [shard 0] storage_service - Waiting for nodes {127.0.10.2, 127.0.10.1} to be alive
INFO  2023-07-23 08:17:28,297 [shard 0] storage_service - Nodes {127.0.10.2, 127.0.10.1} are alive

from scylla-ccm.

fruch avatar fruch commented on July 28, 2024

Logging I'll take care of, I'll disable this for those calls

The default timeout should be bigger for debug mode ? 10min as it used to be ?

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

If @nyh thinks we still need it than 10 minutes should be enough, at least for small cluster.
But we need to see how long it takes on large clusters where bootstrap repair needs to communucate with more nodes.

from scylla-ccm.

nyh avatar nyh commented on July 28, 2024

If @nyh thinks we still need it than 10 minutes should be enough, at least for small cluster. But we need to see how long it takes on large clusters where bootstrap repair needs to communucate with more nodes.

Maybe there shouldn't be any timeout at all, whatsoever? If the waiting code is not buggy, it will eventually stop waiting when the node becomes available. If the waiting code is buggy, the caller (dtest, Jenkins, etc.) will eventually stop on an overall timeout, so we're still safe.

Alternatively, maybe the timeout should be made configurable. It can default to something (e.g., 10 minutes), but if the user really wants to wait 60 minutes for a node on a huge cluster to come up, they can configure the timeout to 60 minutes.

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

I think we can also make the timeout be a linear function of the number of nodes in the cluster (at least for the first node in the loop, the rest can follow shortly after).
But I'm not sure it's worth it

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

Sigh, apparently 120 seconds in non-debug modes isn't enough and there's flakiness, see
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/314/testReport/update_cluster_layout_tests/TestLargeScaleCluster/Run_Dtest_Parallel_Cloud_Machines___LongDtest___long_split000___test_add_many_nodes_under_load/

ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

not only in the many nodes test.

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/316/testReport/update_cluster_layout_tests/TestUpdateClusterLayout/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split013___test_remove_garbage_members_from_group0_after_abort_decommission_Announcing_that_I_have_left_the_ring__/

ccmlib.node.TimeoutError: watch_rest_for_alive() timeout after 120 seconds

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

And it turns out that 600 seconds isn't emough either in debug mode if there are enough keyspaces and/or tables.
See https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/rebuild_test/TestRebuild/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split090___test_rebuild_many_keyspaces/
and https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/rebuild_test/TestRebuild/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split086___test_rebuild_many_tables/

We can extend the timeout indefinitely while there is bootstrap progress like we do for the cql listen message, maybe this is the way to go for test robustness

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

Also https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/lwt_random_test/TestRandomPaxos/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split093___test_topology_grow/

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

and https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/cdc_test/TestCdc/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split084___test_cluster_expansion_with_cdc_Single_cluster_/ and similar variants

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

and https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/secondary_indexes_test/TestSecondaryIndexes/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split075___test_add_node_during_index_build_2/

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

and https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/246/testReport/alternator_ttl_tests/TestAlternatorTTL/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split022___test_ttl_with_load_and_decommission/

from scylla-ccm.

fruch avatar fruch commented on July 28, 2024

@bhalevy I'm not sure what the suggestion is here exactly, this function is waiting for bootstrap to end already...

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

Yes, it's polling the rest api and the event it is waiting for doesn't happen until bootstrap is over.
I suggested to watch the log for progress as we do in wait_for_starting.

But coming to think about it, I think we could just change the order of operation as follows to run
wait_for_starting before self.watch_rest_for_alive(others):

        wait_timeout = timeout * 4 if timeout is not None else 420 if self.cluster.scylla_mode != 'debug' else 900

        if wait_other_notice:
            for node, _ in marks:
                node.watch_rest_for_alive(self, timeout=wait_timeout)

        if wait_for_binary_proto:
            from_mark = self.mark
            try:
                self.wait_for_binary_interface(from_mark=from_mark, process=self._process_scylla, timeout=wait_timeout)
            except TimeoutError as e:
                self.wait_for_starting(from_mark=self.mark, timeout=wait_timeout)
                self.wait_for_binary_interface(from_mark=from_mark, process=self._process_scylla, timeout=0)
        elif wait_other_notice:
            self.wait_for_starting(from_mark=self.mark, timeout=wait_timeout)

        if wait_other_notice:
            for node, _ in marks:
                self.watch_rest_for_alive(node, timeout=wait_timeout)

from scylla-ccm.

fruch avatar fruch commented on July 28, 2024

Why not todo all the wait_other_notice logic after wait_for_binary_proto ? as it in scylla_cluster.py ?

why split it into two parts like that ?

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

it's possible, although the other node notice this node as up before it starts listening for cql, but that shouldn't matter.

However, if only wait_other_notics is set, we'd still need wait_for_starting to prevent flakiness.

from scylla-ccm.

fruch avatar fruch commented on July 28, 2024

it's possible, although the other node notice this node as up before it starts listening for cql, but that shouldn't matter.

However, if only wait_other_notics is set, we'd still need wait_for_starting to prevent flakiness.

Even though I don't know of any test that would do wait_other_notice without wait_for_binary_protocol ...

from scylla-ccm.

bhalevy avatar bhalevy commented on July 28, 2024

It should be ok to imply wait_for_binary_protocol when wait_other_notice is set.

from scylla-ccm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.