Giter VIP home page Giter VIP logo

scylla-ccm's Introduction

CCM (Cassandra Cluster Manager)

A script/library to create, launch and remove an Apache Cassandra cluster on localhost.

The goal of ccm and ccmlib is to make it easy to create, manage and destroy a small Cassandra cluster on a local box. It is meant for testing a Cassandra cluster.

Pointer to the Scylla CCM instructions (should really be merged here)

https://github.com/scylladb/scylla/wiki/Using-CCM

Scylla usage examples:

Creating a 3-node Scylla cluster:

$ ccm create my_cluster --scylla --vnodes -n 3 -v release:2022.2
$ ccm start
# Now wait...
$ ccm status
Cluster: 'my_cluster'
-----------------
node1: UP
node2: UP
node3: UP

The nodes will be available at 127.0.0.1, 127.0.0.2 and 127.0.0.3.

Creating a multi-datacenter cluster that has 3 nodes in dc1, 4 nodes in dc2 and 5 nodes in dc3:

$ ccm create my_multi_dc_cluster --scylla --vnodes -n 3:4:5 -v release:2022.2
$ ccm start
# Wait a lot...

Creating a cluster of nodes with a specific build-id:

Let's say you want to create a cluster of Scylla with build id f6e718548e76ccf3564ed2387b6582ba8d37793c (it's Scylla 2023.1.0~rc8-20230731.b6f7c5a6910c).

  1. Go to https://backtrace.scylladb.com and find your desired Scylla version
  2. Click the arrow down symbol (\/) to show all available download links
  3. Download the unified Scylla package (unified-pack-url-x86_64)
  4. Create a 3 node cluster:
# The unified package will be extracted to ~/.ccm/scylla-repository/my_custom_scylla
# Make sure that the version name (my_custom_scylla) is different for each unified package you use, otherwise ccm will use the previously extracted version.
ccm create my_cluster -n 3 --scylla --vnodes \
    --version my_custom_scylla \
    --scylla-unified-package-uri=/home/$USER/Downloads/scylla-enterprise-unified-2023.1.0\~rc8-0.20230731.b6f7c5a6910c.x86_64.tar.gz

Requirements

  • A working python installation (tested to work with python 3.11).

  • pyYAML (http://pyyaml.org/ -- sudo easy_install pyYaml)

  • ant (http://ant.apache.org/, on Mac OS X, brew install ant)

  • psutil (https://pypi.python.org/pypi/psutil)

  • Java (which version depends on the version of Cassandra you plan to use. If unsure, use Java 7 as it is known to work with current versions of Cassandra).

  • ccm only works on localhost for now. If you want to create multiple node clusters, the simplest way is to use multiple loopback aliases. On modern linux distributions you probably don't need to do anything, but on Mac OS X, you will need to create the aliases with

    sudo ifconfig lo0 alias 127.0.0.2 up
    sudo ifconfig lo0 alias 127.0.0.3 up
    ...
    

    Note that the usage section assumes that at least 127.0.0.1, 127.0.0.2 and 127.0.0.3 are available.

Known issues

Windows only:

  • node start pops up a window, stealing focus.
  • cli and cqlsh started from ccm show incorrect prompts on command-prompt
  • non nodetool-based command-line options fail (sstablesplit, scrub, etc)
  • cli_session does not accept commands.
  • To install psutil, you must use the .msi from pypi. pip install psutil will not work
  • You will need ant.bat in your PATH in order to build C* from source
  • You must run with an Unrestricted Powershell Execution-Policy if using Cassandra 2.1.0+
  • Ant installed via chocolatey will not be found by ccm, so you must create a symbolic link in order to fix the issue (as administrator):
    • cmd /c mklink C:\ProgramData\chocolatey\bin\ant.bat C:\ProgramData\chocolatey\bin\ant.exe

Installation

ccm uses python setuptools (with distutils fallback) so from the source directory run:

sudo ./setup.py install

ccm is available on the Python Package Index:

pip install ccm

There is also a Homebrew package available:

brew install ccm

You can also use ccm trough Nix.

Spawn new temporary shell with ccm present, without installing: `nix shell github:scylladb/scylla-ccm`
Install ccm: `nix profile install github:scylladb/scylla-ccm`
To remove / update ccm installed this way, first locate it's index in `nix profile list`.
To remove it, use `nix profile remove <index>`.
To update it use `nix profile upgrade <index>` - or `nix profile upgrade '.*'` to upgrade all Nix packages.

Nix

This project features experimental Nix flake. It allows ccm to be used as a dependency in other nix projects or to quickly launch a dev shell with all dependencies required to run and test the project.

How to setup Nix shell

  1. Install Nix: https://nixos.org/download.html - on Fedora you should probably use "Single-user installation", as there are some problems with multi-user due to SELinux.
  2. Activate required experimental features:
mkdir -p ~/.config/nix
echo "experimental-features = nix-command flakes" >> ~/.config/nix/nix.conf

If you installed Nix in multi-user mode, you will need to restart Nix daemon. 3. First option: using direnv. Install direnv (see: https://direnv.net/docs/installation.html ), cd into project directory and execute direnv allow .. Now you will have dev env activated whenever you are in a project's directory - and automatically unloaded when you leave it. 4. Second option: use nix develop command directly. This command will launch a bash session with loaded dev env. If you want to use your favourite shell, pass --command <shell> flag to nix develop (in my case: nix develop --command zsh).

If you want to install ccm using Nix, or launch a temporary shell with ccm - see "Installation" section.

Usage

Let's say you wanted to fire up a 3 node Cassandra cluster.

Short version

ccm create test -v 2.0.5 -n 3 -s

You will of course want to replace 2.0.5 by whichever version of Cassandra you want to test.

Longer version

ccm works from a Cassandra source tree (not the jars). There are two ways to tell ccm how to find the sources:

  1. If you have downloaded and compiled Cassandra sources, you can ask ccm to use those by initiating a new cluster with:

    ccm create test --install-dir=<path/to/cassandra-sources>

    or, from that source tree directory, simply

     ccm create test
    
  2. You can ask ccm to use a released version of Cassandra. For instance to use Cassandra 2.0.5, run

     ccm create test -v 2.0.5
    

    ccm will download the binary (from http://archive.apache.org/dist/cassandra), and set the new cluster to use it. This means that this command can take a few minutes the first time you create a cluster for a given version. ccm saves the compiled source in ~/.ccm/repository/, so creating a cluster for that version will be much faster the second time you run it (note however that if you create a lot of clusters with different versions, this will take up disk space).

Once the cluster is created, you can populate it with 3 nodes with:

ccm populate -n 3

Note: If you’re running on Mac OSX, create a new interface for every node besides the first, for example if you populated your cluster with 3 nodes, create interfaces for 127.0.0.2 and 127.0.0.3 like so:

sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3

Otherwise you will get the following error message:

(...) Inet address 127.0.0.1:9042 is not available: [Errno 48] Address already in use

After that execute:

ccm start

That will start 3 nodes on IP 127.0.0.[1, 2, 3] on port 9042 for native transport, port 7000 for the internal cluster communication and ports 7100, 7200 and 7300 for JMX. You can check that the cluster is correctly set up with

ccm node1 ring

You can then bootstrap a 4th node with

ccm add node4 -i 127.0.0.4 -j 7400 -b

(populate is just a shortcut for adding multiple nodes initially)

ccm provides a number of conveniences, like flushing all of the nodes of the cluster:

ccm flush

or only one node:

ccm node2 flush

You can also easily look at the log file of a given node with:

ccm node1 showlog

Finally, you can get rid of the whole cluster (which will stop the node and remove all the data) with

ccm remove

The list of other provided commands is available through

ccm

Each command is then documented through the -h (or --help) flag. For instance ccm add -h describes the options for ccm add.

Source Distribution

If you'd like to use a source distribution instead of the default binary each time (for example, for Continuous Integration), you can prefix cassandra version with source:, for example:

ccm create test -v source:2.0.5 -n 3 -s

Automatic Version Fallback

If 'binary:' or 'source:' are not explicitly specified in your version string, then ccm will fallback to building the requested version from git if it cannot access the apache mirrors.

Git and GitHub

To use the latest version from the canonical Apache Git repository, use the version name git:branch-name, e.g.:

ccm create trunk -v git:trunk -n 5

and to download a branch from a GitHub fork of Cassandra, you can prefix the repository and branch with github:, e.g.:

ccm create patched -v github:jbellis/trunk -n 1

Remote debugging

If you would like to connect to your Cassandra nodes with a remote debugger you have to pass the -d (or --debug) flag to the populate command:

ccm populate -d -n 3

That will populate 3 nodes on IP 127.0.0.[1, 2, 3] setting up the remote debugging on ports 2100, 2200 and 2300. The main thread will not be suspended so you don't have to connect with a remote debugger to start a node.

Alternatively you can also specify a remote port with the -r (or --remote-debug-port) flag while adding a node

ccm add node4 -r 5005 -i 127.0.0.4 -j 7400 -b

Where things are stored

By default, ccm stores all the node data and configuration files under ~/.ccm/cluster_name/. This can be overridden using the --config-dir option with each command.

DataStax Enterprise

CCM 2.0 supports creating and interacting with DSE clusters. The --dse option must be used with the ccm create command. See the ccm create -h help for assistance.

CCM Lib

The ccm facilities are available programmatically through ccmlib. This could be used to implement automated tests again Cassandra. A simple example of how to use ccmlib follows:

import ccmlib

CLUSTER_PATH="."
cluster = ccmlib.Cluster(CLUSTER_PATH, 'test', cassandra_version='2.0.5')
cluster.populate(3).start()
[node1, node2, node3] = cluster.nodelist()

# do some tests on the cluster/nodes. To connect to a node through native protocol,
# the host and port to a node is available through
#   node.network_interfaces['binary]

cluster.flush()
node2.compact()

# do some other tests

# after the test, you can leave the cluster running, you can stop all nodes
# using cluster.stop() but keep the data around (in CLUSTER_PATH/test), or
# you can remove everything with cluster.remove()

-- Sylvain Lebresne [email protected]

scylla-ccm's People

Contributors

aboudreault avatar aholmberg avatar amnonh avatar asias avatar avikivity avatar bhalevy avatar denesb avatar driftx avatar enigmacurry avatar fruch avatar joaquincasares avatar juliayakovlev avatar kishkaru avatar lmr avatar lorak-mmk avatar mambocab avatar mike-tr-adamson avatar nutbunnies avatar orenef11 avatar pauloricardomg avatar pcmanus avatar ptnapoleon avatar roydahan avatar shlomibalalis avatar slivne avatar spodkowinski avatar tchaikov avatar thobbs avatar vladzcloudius avatar yukim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scylla-ccm's Issues

FR: Allow overriding 'scylla-jmx' link source in node creation

Scylla-jmx required jdk8 java to run. Scylla-node links 'scylla-jmx' symlink to hardcoded '/usr/bin/java'.
On dev machine, running dtest, one has to switch 'alternatives' all the time between 'real' java stuff (CDC drivers etc) and this.

Would be neat if this could be at least somewhat controlled via environment. A set JAVA_HOME springs to mind.

adjust max-networking-io-control-blocks=100 by default

scylladb/scylladb@2cfc517 sets max_networking_aio_io_control_blocks to 50000 by default, causing us to run out of aio iocb slots and fail starting in setup_aio_context.

The signature for the error is:

ERROR 2021-07-20 15:54:51,154 [shard 7] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application

Cassandra 4.0 is unsupported

When trying to launch a Cassandra 4.0 cluster with scylla-ccm:

$ ccm create ccm_c4 -i 127.0.1. --id 0  -n 1 -v 4.0.0  --config-dir=/tmp/testc4
Current cluster is now: ccm_c4

$ ccm start --config-dir=/tmp/testc4
[node1 ERROR] Invalid yaml. Please remove properties [rpc_port] from your cassandra.yaml
Error starting node1.
Standard error output is:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ccmlib/cmds/cluster_cmds.py", line 648, in run
    if self.cluster.start(no_wait=self.options.no_wait,
  File "/usr/local/lib/python3.8/site-packages/ccmlib/cluster.py", line 448, in start
    raise NodeError("Error starting {0}.".format(node.name), p)
ccmlib.node.NodeError: Error starting node1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/ccm", line 75, in <module>
    cmd.run()
  File "/usr/local/lib/python3.8/site-packages/ccmlib/cmds/cluster_cmds.py", line 665, in run
    for line in e.process.stderr:
ValueError: I/O operation on closed file.

In the upstream repo, this problem is already fixed: https://github.com/riptano/ccm/blob/ce612ea71587bf263ed513cb8f8d5dfcf7c8dadb/ccmlib/node.py#L330

scylla_node do_stop does not stop scylla_jmx if the scylla node is not running

See #303 (comment)

do_stop checks first if self.is_running() which will return false if the node exited already, e.g. on error.
In this case we still need to stop/kill scylla_jmx, otherwise it stays up, holding jmx listen port.

From example:
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/dtest.log

2021-02-16 11:13:14,540 30123   ccm                            DEBUG    | node2: Starting scylla: args=['/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-6rchw__r/test/node2/bin/scylla', '--options-file', '/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-6rchw__r/test/node2/conf/s
cylla.yaml', '--log-to-stdout', '1', '--api-address', '127.0.61.2', '--collectd-hostname', '76836c581099.node2', '--smp', '2', '--memory', '1024M', '--abort-on-seastar-bad-alloc', '--abort-on-lsa-bad-alloc=1', '--abort-on-internal-error', '1', '--developer-mode', 'true', '--default-log-level', 'info', '--collectd', 
'0', '--overprovisioned', '--prometheus-address', '127.0.61.2', '--unsafe-bypass-fsync', '1'] wait_other_notice=True wait_for_binary_proto=True
2021-02-16 11:13:16,060 24977   dtest                          DEBUG    | pushed_notifications_test.py:TestLocalhostPushedNotifications.restart_node_localhost_test - Test failed with errors: [(<pushed_notifications_test.TestLocalhostPushedNotifications testMethod=restart_node_localhost_test>, (<class 'ccmlib.node.Ti
meoutError'>, TimeoutError("16 Feb 2021 11:13:16 [node1] Missing: ['node is now in normal status|Starting listening for CQL clients']:\nScylla version 4.5.dev-0.20210215.495b7b559 with b.....\nSee system.log for remainder"), <traceback object at 0x7f116acbce60>))]

Failed due to https://github.com/scylladb/scylla-dtest/issues/1965
and with node1 failing to start (according to https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613473996068_pushed_notifications_test.TestLocalhostPushedNotifications.restart_node_localhost_test/node1.log)
it is not running so we don't stop scylla_jmx.

Next time Cluster ID 72 is used:

2021-02-16 13:47:50,886 2315    dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_after_index_build - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_remove_node_after_index_build>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab0e500>))]

Starting scylla with --kernel-page-cache is not supported by older scylla versions

PR #317 caused a regression in rolling_upgrade_test.

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/780/artifact/logs-all.release.2/1617166227559_rolling_upgrade_test.RollingUpgradeTest.test_rolling_upgrade/node1.log:

Scylla version 4.3.0-0.20210110.000585522 with build-id 7fab6c35a5d403f3416292b37c9a9aa94fe4db1f starting ...
command used: "/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-5ns8k1n5/test/node1/bin/scylla --options-file /jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-5ns8k1n5/test/node1/conf/scylla.yaml --log-to-stdout 1 --api-address 127.0.51.1 --collectd-hostname d8061d688155.node1 --smp 2 --memory 1024M --abort-on-seastar-bad-alloc --abort-on-lsa-bad-alloc 1 --abort-on-internal-error 1 --developer-mode true --default-log-level info --collectd 0 --overprovisioned --prometheus-address 127.0.51.1 --unsafe-bypass-fsync 1 --kernel-page-cache 1"
parsed command line options: [options-file: /jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-5ns8k1n5/test/node1/conf/scylla.yaml, log-to-stdout: 1, api-address: 127.0.51.1, collectd-hostname: d8061d688155.node1, smp: 2, memory: 1024M, abort-on-seastar-bad-alloc, abort-on-lsa-bad-alloc: 1, abort-on-internal-error: 1, developer-mode: true, default-log-level: info, collectd: 0, overprovisioned, prometheus-address: 127.0.51.1, unsafe-bypass-fsync: 1, kernel-page-cache, (positional) 1]
error: unrecognised option '--kernel-page-cache'

The option was recently introduced in scylladb/scylladb@8785dd6
and is not available in any release up to and including 4.4.0.

"ccm create" fail with error: Command '['bash', '-c', '/home/artsiom/.ccm/scylla-repository/unstable/branch-5.0/latest/scylla-core-package/install.sh --prefix /home/artsiom/.ccm/scylla-repository/unstable/branch-5.0/latest --nonroot --supervisor']' returned non-zero exit status 1.

"ccm create command " fail with error: Command '['bash', '-c', '/home/artsiom/.ccm/scylla-repository/unstable/branch-5.0/latest/scylla-core-package/install.sh --prefix /home/artsiom/.ccm/scylla-repository/unstable/branch-5.0/latest --nonroot --supervisor']' returned non-zero exit status 1.

full output

ccm create --scylla -v unstable/branch-5.0:latest -n 2 tes4
S3 download: scylla-x86_64-unified-package-5.0.9.0.20230125.94b8baa79.tar.gz100%|██████████| 453M/453M [00:20<00:00, 21.8MB/s] 
Extracting /tmp/ccm-bv6m4p53.tar.gz (http://s3.amazonaws.com/downloads.scylladb.com/unstable/scylla/branch-5.0/relocatable/2023-01-25T02:52:16Z/scylla-x86_64-unified-package-5.0.9.0.20230125.94b8baa79.tar.gz, /tmp/tmpo8b9q434) as version unstable/branch-5.0/latest ...
Relocatable package format version 2 detected.
./install.sh: line 458: /home/artsiom/.config/systemd/user/scylla-server.service.d/nonroot.conf: No such file or directory
Traceback (most recent call last):
  File "/usr/local/bin/ccm", line 4, in <module>
    __import__('pkg_resources').run_script('ccm==2.0.5', 'ccm')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.8/dist-packages/ccm-2.0.5-py3.8.egg/EGG-INFO/scripts/ccm", line 75, in <module>
    cmd.run()
  File "/usr/local/lib/python3.8/dist-packages/ccm-2.0.5-py3.8.egg/ccmlib/cmds/cluster_cmds.py", line 223, in run
    cluster = ScyllaCluster(self.path, [self.name](http://self.name/), install_dir=self.options.install_dir,
  File "/usr/local/lib/python3.8/dist-packages/ccm-2.0.5-py3.8.egg/ccmlib/scylla_cluster.py", line 50, in __init__
    super(ScyllaCluster, self).__init__(path, name, partitioner,
  File "/usr/local/lib/python3.8/dist-packages/ccm-2.0.5-py3.8.egg/ccmlib/cluster.py", line 71, in __init__
    dir, v = self.load_from_repository(version, verbose)
  File "/usr/local/lib/python3.8/dist-packages/ccm-2.0.5-py3.8.egg/ccmlib/scylla_cluster.py", line 71, in load_from_repository
    install_dir, version = scylla_repository.setup(version, verbose)
  File "/usr/local/lib/python3.8/dist-packages/ccm-2.0.5-py3.8.egg/ccmlib/scylla_repository.py", line 314, in setup
    run_scylla_unified_install_script(**args)
  File "/usr/local/lib/python3.8/dist-packages/ccm-2.0.5-py3.8.egg/ccmlib/scylla_repository.py", line 524, in run_scylla_unified_install_script
    run('''{0}/install.sh --prefix {1} --nonroot{2}'''.format(
  File "/usr/local/lib/python3.8/dist-packages/ccm-2.0.5-py3.8.egg/ccmlib/scylla_repository.py", line 47, in run
    subprocess.check_call(['bash', '-c', cmd], cwd=cwd,
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['bash', '-c', '/home/artsiom/.ccm/scylla-repository/unstable/branch-5.0/latest/scylla-core-package/install.sh --prefix /home/artsiom/.ccm/scylla-repository/unstable/branch-5.0/latest --nonroot --supervisor']' returned non-zero exit status 1.

workaround:
delete /home/artsiom/.config/systemd/user/scylla-server.service.d/nonroot.conf file

$ stat /home/artsiom/.config/systemd/user/scylla-server.service.d/nonroot.conf
  File: /home/artsiom/.config/systemd/user/scylla-server.service.d/nonroot.conf -> ../../../../.ccm/scylla-repository/unstable/master/202001250259/scylla/etc/systemd/system/scylla-server.service.d/nonroot.conf
  Size: 126       	Blocks: 8          IO Block: 4096   symbolic link
Device: 3fh/63d	Inode: 8782500     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: ( 1001/ artsiom)   Gid: ( 1001/ artsiom)
Access: 2023-01-09 12:32:03.668594981 +0100
Modify: 2023-01-09 12:32:03.668594981 +0100
Change: 2023-01-09 12:32:03.668594981 +0100
 Birth: -
$ cat /home/artsiom/.config/systemd/user/scylla-server.service.d/nonroot.conf
cat: /home/artsiom/.config/systemd/user/scylla-server.service.d/nonroot.conf: No such file or directory

cc @fruch

Can't stop two nodes concurrently with wait_other_notice=True.

The dtest "materialized_views_test.TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test" starts 4 nodes, and kills two of them with wait_other_notice=True.
The log contains something like this:

2019-01-17 14:42:10.215307 START: stop node node3
2019-01-17 14:42:10.215669 START: stop node node2
2019-01-17 14:42:17.325150 FINISH: stop node node2

The "stop node3" never finished,
and eventually produces

TimeoutError: 17 Jan 2019 12:52:18 [node2] Missing: ['127.0.0.3 is now [dead|DOWN]']:

What happens is that we stop node3, and then "wait for node2 to notice", which means grepping node2's log for message about node3 becoming dead. But since node2 has also died, it will obviously never notice.

The "waiting for notice" should stop waiting for a certain node's log if that node itself dies.

node::watch_for_log() is broken in the multi-patterns case

HEAD: af38330

Description
There is a bug node.py: line 426 where an element is deleted from the list from inside the loop which iterates over this list.

This may lead to very unexpected results are described here.

Also, the seems to be a missing break in the code after this line.

As a result the function may match the same log line for more than a single pattern and result in a very unexpected result - when you give X patterns you expect the function to return after X corresponding log lines have been seen.

Add ccm logging

During testing it would be helpful to have ccm debug logging in addition to dtest logging (that's saved in dtest.log artifact).

@fruch:

CCM as a library should use logging with specific logger namespace ccm.*

And then we could configure that logger in dtest to write to the stdout as needed.

Calling set_configuration_options twice on the same node causes test to fail

12:13:32 Traceback (most recent call last):
12:13:32   File "/data/jenkins/workspace/scylla-master-dtest/label/monster/mode/release/smp/2/urchin-dtest/bootstrap_test.py", line 48, in simple_bootstrap_test
12:13:32     node1.set_configuration_options(values={'initial_token': tokens[0]})
12:13:32   File "/data/jenkins/workspace/scylla-master-dtest/label/monster/mode/release/smp/2/urchin-ccm/ccmlib/node.py", line 259, in set_configuration_options
12:13:32     self.import_config_files()
12:13:32   File "/data/jenkins/workspace/scylla-master-dtest/label/monster/mode/release/smp/2/urchin-ccm/ccmlib/scylla_node.py", line 445, in import_config_files
12:13:32     self.__copy_logback_files()
12:13:32   File "/data/jenkins/workspace/scylla-master-dtest/label/monster/mode/release/smp/2/urchin-ccm/ccmlib/scylla_node.py", line 449, in __copy_logback_files
12:13:32     os.path.join(self.get_conf_dir(), 'logback-tools.xml'))
12:13:32   File "/data/jenkins/workspace/scylla-master-dtest/label/monster/mode/release/smp/2/urchin-ccm/ccmlib/scylla_node.py", line 464, in hard_link_or_copy
12:13:32     raise oserror
12:13:32 OSError: [Errno 17] File exists

Broken by 486cd61

add support for scylla relocatable package

This is a follow up on #150 which was resolved in a hackish way by creating ad-hoc wrapper script called scylla.sh that runs scylla using the dynamic loader.

As discussed with @avikivity, in addition to running a plain scylla binary (via CASSANRA_DIR), we would like to add support to use a scylla relocatable package, install it, and run scylla using the installed dynamic libraries.

When using unified package and enabling `internode_encryption` nodes can't seem to communicate

for example dtest test_putget_with_internode_ssl:

generate_ssl_stores(self.test_path)
cluster.enable_internode_ssl(self.test_path, internode_encryption='all')
ERROR 2022-11-15 00:41:35,169 [shard 0] init - Startup failed: std::runtime_error (Failed to learn about other nodes' tokens during bootstrap. Make sure that:
 - the node can contact other nodes in the cluster,
 - the `ring_delay` parameter is large enough (the 30s default should be enough for small-to-middle-sized clusters),
 - a node with this IP didn't recently leave the cluster. If it did, wait for some time first (the IP is quarantined),
and retry the bootstrap.)

Logs:
https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/199/

dtest failed with Inet address 127.0.70.3:9042 is not available: [Errno 98] Address already in use

I have no clue about what caused this as there is no indication that the reported address was used by another dtest that might have run in parallel to the one failed. Opening this issue to track the problem in case we keep hitting it.

scylla-ccm version @476c0d1928d8773748f9c7dabd532b49d5ed9ea1

Seen in dtest-release/75/testReport/materialized_views_test/TestMaterializedViews/mv_populating_from_existing_data_during_node_remove_test:

Stacktrace
  File "/usr/lib64/python2.7/unittest/case.py", line 367, in run
    testMethod()
  File "/jenkins/workspace/scylla-master/dtest-release@2/scylla-dtest/materialized_views_test.py", line 581, in mv_populating_from_existing_data_during_node_remove_test
    self._mv_populating_from_existing_data_during_changes_test('remove node', nodes=4, rf=3, mvs=10, prefill=40000, fail=False)
  File "/jenkins/workspace/scylla-master/dtest-release@2/scylla-dtest/materialized_views_test.py", line 596, in _mv_populating_from_existing_data_during_changes_test
    session = self.prepare(rf=rf, nodes=nodes, options={'prometheus_port': 0})
  File "/jenkins/workspace/scylla-master/dtest-release@2/scylla-dtest/materialized_views_test.py", line 89, in prepare
    cluster.start(jvm_args=jvm_args,wait_other_notice=True,wait_for_binary_proto=True)
  File "/jenkins/workspace/scylla-master/dtest-release@2/scylla-ccm/ccmlib/scylla_cluster.py", line 87, in start
    profile_options=profile_options)
  File "/jenkins/workspace/scylla-master/dtest-release@2/scylla-ccm/ccmlib/scylla_node.py", line 305, in start
    common.check_socket_available(itf)
  File "/jenkins/workspace/scylla-master/dtest-release@2/scylla-ccm/ccmlib/common.py", line 449, in check_socket_available
    raise UnavailableSocketError("Inet address %s:%s is not available: %s" % (addr, port, msg))

Inet address 127.0.70.3:9042 is not available: [Errno 98] Address already in use

node1.log

INFO  2019-04-01 01:27:42,925 [shard 0] rpc - client 127.0.70.2:7000: fail to connect: Connection refused
WARN  2019-04-01 01:27:42,925 [shard 0] gossip - Fail to send EchoMessage to 127.0.70.2: seastar::rpc::closed_error (connection is closed)
INFO  2019-04-01 01:27:42,925 [shard 0] gossip - InetAddress 127.0.70.3 is now UP, status = NORMAL

^^^^^^^^

INFO  2019-04-01 01:27:42,940 [shard 1] compaction - Compacting [/jenkins/workspace/scylla-master/dtest-release@2/scylla/.dtest/dtest-zBLB6V/test/node1/data/system/peers-37f71aca7dc2383ba70672528af04d4f/system-peers-ka-3-Data.db:level=0, /jenkins/workspace/scylla-master/dtest-release@2/scylla/.dtest/dtest-zBLB6V/test/node1/data/system/peers-37f71aca7dc2383ba70672528af04d4f/system-peers-ka-5-Data.db:level=0, ]
INFO  2019-04-01 01:27:42,942 [shard 0] rpc - client 127.0.70.2:7000: fail to connect: Connection refused
WARN  2019-04-01 01:27:42,942 [shard 0] gossip - Fail to send EchoMessage to 127.0.70.2: seastar::rpc::closed_error (connection is closed)
INFO  2019-04-01 01:27:42,949 [shard 1] compaction - Compacted 2 sstables to [/jenkins/workspace/scylla-master/dtest-release@2/scylla/.dtest/dtest-zBLB6V/test/node1/data/system/peers-37f71aca7dc2383ba70672528af04d4f/system-peers-ka-7-Data.db:level=0, ]. 15864 bytes to 11357 (~71% of original) in 9ms = 1.20MB/s. ~256 total partitions merged to 1.
INFO  2019-04-01 01:27:43,105 [shard 0] rpc - client 127.0.70.2:7000: fail to connect: Connection refused
INFO  2019-04-01 01:27:44,110 [shard 0] rpc - client 127.0.70.2:7000: fail to connect: Connection refused
INFO  2019-04-01 01:27:46,120 [shard 0] rpc - client 127.0.70.4:7000: fail to connect: Connection refused
INFO  2019-04-01 01:27:47,124 [shard 0] rpc - client 127.0.70.4:7000: fail to connect: Connection refused

YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated

See for example byo_build_tests_dtest/70/consoleFull:

18:31:43 /jenkins/workspace/scylla-master/manual-and-scheduled-tests/byo_build_tests_dtest/scylla-ccm/ccmlib/scylla_node.py:601: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
18:31:43   data = yaml.load(f)
18:31:43 /jenkins/workspace/scylla-master/manual-and-scheduled-tests/byo_build_tests_dtest/scylla-ccm/ccmlib/scylla_node.py:324: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.

Please read https://msg.pyyaml.org/load for full details.

scylla_cluster: refactor start and stop to provide api to start_nodes and stop_nodes in parallel

Can be used by materialized views tests that start and stop multiple nodes.
Today this done in sequence, waiting for each nose to complete start/stop before moving on to the next one, while we can initiate the start/stop process on all nodes in parallel like we do in the cluster.{start,stop} methods and then join the operation by monitoring the logs given the (wait_for_binary_proto, wait_other_notice) options.

core dumps of relocatable scylla cannot be debugged

Since b84d6dc and 468f559, when SCYLLA_DBUILD_SO_DIR is provided,
scylla_node.py runs scylla using the interpreter (ld-linux-x86-64.so.2).
This creates core dumps of the interpreter, not of scylla, and it becomes useless for thread debugging. Instead we need to follow scylldb/scylla@698b72b5018868df6a839d08fd24c642db97ffcd and set the binary's interpreter using patchelf and run it setting the correct path to the loadable libs in LD_LIBRARY_PATH.

CCM fails to download relocatable packages - patchelf: std::bad_alloc

CCM branch: master (up to date)
Scylla branch: master
Packages attempted: Various packages between 2020-06 and 2020--08

When trying to download a new scylla relocatable package (tried a few different packages) I receive the following exception after a few minutes with no error messages:

Relocatable package format version 2 detected.
Relocatable package format version 2 detected.
Relocatable package format version 2 detected.
patchelf: std::bad_alloc
E
======================================================================
ERROR: test_backup_task_progress (manager_backup_tests.TestScyllaMgmtBackup)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/shlomib/projects/scylla-dtest/dtest.py", line 619, in setUp
    self.cluster = self._get_cluster(version=self.cassandra_version)
  File "/home/shlomib/projects/scylla-dtest/dtest.py", line 439, in _get_cluster
    cluster = ScyllaCluster(self.test_path, name, cassandra_version=scylla_version,
  File "/home/shlomib/projects/scylla-ccm/ccmlib/scylla_cluster.py", line 40, in __init__
    super(ScyllaCluster, self).__init__(path, name, partitioner,
  File "/home/shlomib/projects/scylla-ccm/ccmlib/cluster.py", line 62, in __init__
    dir, v = self.load_from_repository(version, verbose)
  File "/home/shlomib/projects/scylla-ccm/ccmlib/scylla_cluster.py", line 60, in load_from_repository
    install_dir, version = scylla_repository.setup(version, verbose)
  File "/home/shlomib/projects/scylla-ccm/ccmlib/scylla_repository.py", line 71, in setup
    run_scylla_install_script(os.path.join(
  File "/home/shlomib/projects/scylla-ccm/ccmlib/scylla_repository.py", line 243, in run_scylla_install_script
    run('''{0}/install.sh --root {1} --prefix {1} --prefix /opt/scylladb --nonroot'''.format(
  File "/home/shlomib/projects/scylla-ccm/ccmlib/scylla_repository.py", line 32, in run
    subprocess.check_call(['bash', '-c', cmd], cwd=cwd,
  File "/home/shlomib/.pyenv/versions/3.8.3/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['bash', '-c', '/home/shlomib/.ccm/scylla-repository/unstable/master/2020-08-29T22_24_05Z/scylla-core-package/install.sh --root /home/shlomib/.ccm/scylla-repository/unstable/master/2020-08-29T22_24_05Z/scylla --prefix /home/shlomib/.ccm/scylla-repository/unstable/master/2020-08-29T22_24_05Z/scylla --prefix /opt/scylladb --nonroot']' returned non-zero exit status 1.

"ccm create" prints misleading error message when build/release/scylla is missing

$ ccm create scylla-2 --scylla --vnodes -n 2 --install-dir=/home/tgrabiec/src/scylla-ccm/../scylla2
Traceback (most recent call last):
  File "/home/tgrabiec/src/scylla-ccm/ccm", line 73, in <module>
    cmd.validate(parser, options, args)
  File "/home/tgrabiec/src/scylla-ccm/ccmlib/cmds/cluster_cmds.py", line 146, in validate
    common.validate_install_dir(options.install_dir)
  File "/home/tgrabiec/src/scylla-ccm/ccmlib/common.py", line 412, in validate_install_dir
    elif isDse(install_dir):
  File "/home/tgrabiec/src/scylla-ccm/ccmlib/common.py", line 359, in isDse
    raise ArgumentError('Installation directory does not contain a bin directory: %s' % install_dir)
ccmlib.common.ArgumentError: Installation directory does not contain a bin directory: /home/tgrabiec/src/scylla-ccm/../scylla2

dtests sometime fail with unable to connect to scylla-jmx

Most recent incidence:
https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/1657/testReport/bootstrap_test/TestBootstrap/start_stop_test/
and
https://jenkins.scylladb.com/view/master/job/scylla-master/job/next/1657/testReport/bootstrap_test/TestBootstrap/start_stop_test_node/

Traceback (most recent call last):
  File "/usr/lib64/python3.7/unittest/case.py", line 60, in testPartExecutor
    yield
  File "/usr/lib64/python3.7/unittest/case.py", line 645, in run
    testMethod()
  File "/jenkins/workspace/scylla-master/next@2/scylla-dtest/bootstrap_test.py", line 53, in start_stop_test
    cluster.start(wait_for_binary_proto=True, wait_other_notice=True)
  File "/jenkins/workspace/scylla-master/next@2/scylla-ccm/ccmlib/scylla_cluster.py", line 137, in start
    started = self.start_nodes(**args)
  File "/jenkins/workspace/scylla-master/next@2/scylla-ccm/ccmlib/scylla_cluster.py", line 109, in start_nodes
    profile_options=profile_options, no_wait=no_wait)
  File "/jenkins/workspace/scylla-master/next@2/scylla-ccm/ccmlib/scylla_node.py", line 514, in start
    raise NodeError(e_msg, scylla_process)
ccmlib.node.NodeError: Error starting node node1: unable to connect to scylla-jmx

It is not clear what the error is.
I'll open a pull request to narrow down the set of errors we accept for retry so that any other error will throw and become visible.

start-all.sh unexpected hyphen

I'm passing an extra argument to 'start-all.sh' to enable the Prometheus admin API using:

./start-all.sh -b --web.enable-admin-api

The problem is that the Prometheus container fails with the following message:

Error parsing commandline arguments: unknown long flag '---web.enable-admin-api'
prometheus: error: unknown long flag '---web.enable-admin-api'

As you can see, there are 3 hypens, and I only passed 2.

If I change my startup command to

./start-all.sh -b -web.enable-admin-api

It will add an extra hyphen en the containers start fine.

My Scylla manager version: 3.6.3

'ccm list' fails with "SyntaxError: invalid syntax" on Ubuntu 20 box

HEAD: f1a1b77

Description

$ ccm list                                                               
Traceback (most recent call last):                                                                                             
  File "/home/vladz/work/scylla-ccm/ccm", line 7, in <module>                                                                  
    from ccmlib import common                                                                                                  
  File "/home/vladz/work/scylla-ccm/ccmlib/common.py", line 489                                                                
    raise TimeoutError(f"Relocatables download still runs in parallel from another test after 60 min. "
                                                                                                      ^
SyntaxError: invalid syntax

The reason is the first line in ccm:

#!/usr/bin/env python

And /usr/bin/python points to python2.7.
And f-strings are a Python3-only feature.

Can't add new node to the cluster if I don't specify data-center explicitly.

In materialized_views_test.py I wanted to add a test doing something like:

        session = self.prepare(rf=rf)
        session.execute("CREATE TABLE t (id int PRIMARY KEY, v int)")
        node4 = new_node(self.cluster)
        node4.start(wait_other_notice=True, wait_for_binary_proto=True) 

This did not work... The fourth Scylla instance failed to start up. The problem was that the file
/home/nyh/.dtest/dtest-kawO4S/test/node4/conf/cassandra-rackdc.properties instead of containing

dc=dc1
rack=RAC1

as was expected, it contain a comment with explanation, but no content, apparently copied from the default Cassandra configuration.

If I change the code to explicitly specify the data center:

        node4 = new_node(self.cluster, data_center='dc1')

Everything works.

support for nodetool REST

OK
I think I found a possible workaround for
#171

I was able to proceed and don't break cassandra simply by using

diff --cc ccmlib/node.py
index cc8f565,5ee75c6..0000000
--- a/ccmlib/node.py
+++ b/ccmlib/node.py
@@@ -704,8 -704,12 +704,12 @@@ class Node(object)
          if capture_output and not wait:
              raise common.ArgumentError("Cannot set capture_output while wait is False.")
          env = self.get_env()
+         if self.is_scylla():
+             host = self.address()
+         else:
 -            host = "localhost"
++            host = 'localhost'
          nodetool = self.get_tool('nodetool')
-         args = [nodetool, '-h', 'localhost', '-p', str(self.jmx_port)]
+         args = [nodetool, '-h', host, '-p', str(self.jmx_port)]
          args += cmd.split()
          if capture_output:
              p = subprocess.Popen(args, env=env, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

Launching Scylla Enterprise 2021 causes unrecognised argument --kernel-page-cache

When launching Scylla Enterprise 2021 with current master scylla-ccm, CCM wrongly adds --kernel-page-cache command line argument, which is not supported by Scylla Enterprise 2021. The offending code:

# The '--kernel-page-cache' was introduced by
# https://github.com/scylladb/scylla/commit/8785dd62cb740522d80eb12f8272081f85be9b7e from 4.5 version
current_node_version = self.node_install_dir_version() or self.cluster.version()
if parse_version(current_node_version) >= parse_version('4.5.dev'):
args += ['--kernel-page-cache', '1']
if parse_version(current_node_version) >= parse_version('4.6') and '--max-networking-io-control-blocks' not in args:
args += ['--max-networking-io-control-blocks', '1000']

Reproducer:

$ mkdir /tmp/ccm123
$ SCYLLA_PRODUCT=enterprise ccm create ccm_1 -n 1 --scylla  -v release:2021.1.10 --config-dir=/tmp/ccm123
$ SCYLLA_PRODUCT=enterprise ccm start --config-dir=/tmp/ccm123
$ tail -n10 /tmp/ccm123/ccm_1/node1/logs/system.log
Scylla version 2021.1.10-0.20220410.e8e681dee with build-id f407f6d42e2570c781e916292747e651dd7f50da starting ...
command used: "/tmp/ccm123/ccm_1/node1/bin/scylla --options-file /tmp/ccm123/ccm_1/node1/conf/scylla.yaml --log-to-stdout 1 --api-address 127.0.0.1 --collectd-hostname localhost.localdomain.node1 --developer-mode true --smp 1 --memory 512M --default-log-level info --collectd 0 --overprovisioned --prometheus-address 127.0.0.1 --unsafe-bypass-fsync 1 --kernel-page-cache 1 --max-networking-io-control-blocks 1000"
parsed command line options: [options-file: /tmp/ccm123/ccm_1/node1/conf/scylla.yaml, log-to-stdout: 1, api-address: 127.0.0.1, collectd-hostname: localhost.localdomain.node1, developer-mode: true, smp: 1, memory: 512M, default-log-level: info, collectd: 0, overprovisioned, prometheus-address: 127.0.0.1, unsafe-bypass-fsync: 1, kernel-page-cache, (positional) 1, max-networking-io-control-blocks: 1000]
error: unrecognised option '--kernel-page-cache'

Try --help.

rewriting install.sh by sed causes syntax error

We currently causing bash syntax error on install.sh while merging scylladb/scylladb#7187, likely because we tries to add one more bash function to call systemctl:

/jenkins/workspace/scylla-master/next/scylla/.ccm/scylla-repository/5d75ecb069f7db44b0457a634edf9ed596d3de95/scylla-core-package/install.sh: line 386: syntax error near unexpected token `fi'
21:33:39  start_stop_test_node (bootstrap_test.TestBootstrap) ... ERROR

Also see: https://groups.google.com/g/scylladb-dev/c/9mK1qMnPEV4/m/w3DeyMB7AQAJ

It's better to use install.sh option to skip systemctl instead of rewriting by sed, it can easily break the script.
I going to merge a patch to support --packaging with nonroot mode, so we can use the option:
scylladb/scylladb#7405

`add` command forgets which build mode we use

If I use

./ccm create t1 --scylla --vnodes -n 1 --install-dir=`pwd`/../scylla/build/dev

then nodes will be created using dev mode builds. However, further nodes lauched via ccm add / ccm start will launch from the release build. This can be quite confusing.

Preparing CCM to manager 2.0's agent

The latest version of manager, 2.0, now requires an agent installed on every node of a cluster. In order to run dtest with manager 2.0, ccm needs to start an agent for each node.

Add a separate scylla node as the manager's backend

Currently, the scylla manager uses the same cluster that is used for testing as a backend. This creates a problematic situation on certain test cases. Can't we create a separate scylla cluster that will be used as the backend?

Using IPs instead of localhost for nodetool commands breaks dtests

ccm version 1545f3e
dtest version scylladb/scylla-dtest@1cd840d8c575fdc68723152f85ce0c1bcc085c3c

pull #170 assumes we listen on the cluster host address but it still listens on localhost:(7000 + cluster.id * 100 + node.id)

This change broke a few dtests. E.g.:
dtest-release/227/testReport/migration_test/TTLWithMigrate/big_table_with_ttls_test:

Error Message
Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/repository/3.11.3/bin/nodetool -h 127.0.42.1 -p 7142 refresh -- ks cf' failed; exit status: 1; stderr: nodetool: Failed to connect to '127.0.42.1:7142' - ConnectException: 'Connection refused (Connection refused)'.
Stacktrace
Traceback (most recent call last):
  File "/usr/lib64/python2.7/unittest/case.py", line 367, in run
    testMethod()
  File "/jenkins/workspace/scylla-master/dtest-release/scylla-dtest/migration_test.py", line 937, in big_table_with_ttls_test
    count_query=count_query)
  File "/jenkins/workspace/scylla-master/dtest-release/scylla-dtest/migration_test.py", line 959, in migrate_to_cassandra
    keyspace_names_list=[keyspace_name], table_names=[table_name])
  File "/jenkins/workspace/scylla-master/dtest-release/scylla-dtest/scylla_tools.py", line 1141, in run_migration
    self.migrate_data_to_cassandra(nodes=nodes_list)
  File "/jenkins/workspace/scylla-master/dtest-release/scylla-dtest/scylla_tools.py", line 1115, in migrate_data_to_cassandra
    node.nodetool("refresh -- {} {}".format(ks.replace('"', ''), table.replace('"', '')))
  File "/jenkins/workspace/scylla-master/dtest-release/scylla-ccm/ccmlib/node.py", line 721, in nodetool
    raise NodetoolError(" ".join(args), exit_status, stdout, stderr)
NodetoolError: Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/repository/3.11.3/bin/nodetool -h 127.0.42.1 -p 7142 refresh -- ks cf' failed; exit status: 1; stderr: nodetool: Failed to connect to '127.0.42.1:7142' - ConnectException: 'Connection refused (Connection refused)'.
  1. We also need to pass to scylla-jmx com.sun.management.jmxremote.host
  2. I'm not sure if it's used anywhere but jconsole still uses localhost

"ccm create" fails with "-v release:2021.1.16"

When executing ccm create ccm_1 -i 127.0.1. -n 1 --scylla -v release:2021.1.16, ccm fails with:

S3 download: scylla-enterprise-x86_64-unified-package-2021.1.16.0.20221022.3224379e7.tar.gz100%|██████████| 1.12G/1.12G [00:35<00:00, 31.3MB/s]
Extracting /tmp/ccm-3_kulzdn.tar.gz (https://s3.amazonaws.com/downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2021.1/scylla-enterprise-x86_64-unified-package-2021.1.16.0.20221022.3224379e7.tar.gz, /tmp/tmpaf6m1de_) as version release/2021.1.16 ...
Relocatable package format version 2 detected.
Usage: install.sh [options]

Options:
  --root /path/to/root     alternative install root (default /)
  --prefix /prefix         directory prefix (default /usr)
  --python3 /opt/python3   path of the python3 interpreter relative to install root (default /opt/scylladb/python3/bin/python3)
  --housekeeping           enable housekeeping service
  --nonroot                install Scylla without required root priviledge
  --sysconfdir /etc/sysconfig   specify sysconfig directory name
  --help                   this helpful message
Traceback (most recent call last):
  File "/home/piotrgrabowski/.local/bin/ccm", line 75, in <module>
    cmd.run()
  File "/home/piotrgrabowski/.local/lib/python3.11/site-packages/ccmlib/cmds/cluster_cmds.py", line 223, in run
    cluster = ScyllaCluster(self.path, self.name, install_dir=self.options.install_dir,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotrgrabowski/.local/lib/python3.11/site-packages/ccmlib/scylla_cluster.py", line 50, in __init__
    super(ScyllaCluster, self).__init__(path, name, partitioner,
  File "/home/piotrgrabowski/.local/lib/python3.11/site-packages/ccmlib/cluster.py", line 71, in __init__
    dir, v = self.load_from_repository(version, verbose)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotrgrabowski/.local/lib/python3.11/site-packages/ccmlib/scylla_cluster.py", line 71, in load_from_repository
    install_dir, version = scylla_repository.setup(version, verbose)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotrgrabowski/.local/lib/python3.11/site-packages/ccmlib/scylla_repository.py", line 312, in setup
    run_scylla_unified_install_script(**args)
  File "/home/piotrgrabowski/.local/lib/python3.11/site-packages/ccmlib/scylla_repository.py", line 522, in run_scylla_unified_install_script
    run('''{0}/install.sh --prefix {1} --nonroot{2}'''.format(
  File "/home/piotrgrabowski/.local/lib/python3.11/site-packages/ccmlib/scylla_repository.py", line 47, in run
    subprocess.check_call(['bash', '-c', cmd], cwd=cwd,
  File "/usr/lib64/python3.11/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['bash', '-c', '/home/piotrgrabowski/.ccm/scylla-repository/release/2021.1.16/scylla-core-package/install.sh --prefix /home/piotrgrabowski/.ccm/scylla-repository/release/2021.1.16 --nonroot --supervisor']' returned non-zero exit status 1.

I think that one of the problems is a --supervisor flag passed to install.sh - it looks like 2021.1.16 does not support it, so install.sh fails.

nodetool_additional_test.py:TestNodetool.concurrent_repair_test occasionally fails to set up CLASSPATH

Seen in the following runs for example:
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/523/testReport/nodetool_additional_test/TestNodetool/concurrent_repair_test/
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/535/testReport/nodetool_additional_test/TestNodetool/concurrent_repair_test/

Standard Output
Failed repair  with  Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/2d91e5f6a06586cccc8d2904198ea44192f4a193/scylla-tools-java/bin/nodetool -h 127.0.9.1 -p 7109 repair' failed; exit status: 1; stderr: You must set the CASSANDRA_CONF and CLASSPATH vars

Failed verify_info  with  Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/2d91e5f6a06586cccc8d2904198ea44192f4a193/scylla-tools-java/bin/nodetool -h 127.0.9.1 -p 7109 info' failed; exit status: 1; stderr: You must set the CASSANDRA_CONF and CLASSPATH vars

Failed verify_info  with  Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/2d91e5f6a06586cccc8d2904198ea44192f4a193/scylla-tools-java/bin/nodetool -h 127.0.9.1 -p 7109 info' failed; exit status: 1; stderr: You must set the CASSANDRA_CONF and CLASSPATH vars

I managed to reproduce this locally after a few runs and the issue stems from node1/bin/cassandra.in.sh being empty,
causing CLASSPATH to remain unset.

$ ll /local/home/bhalevy/.dtest/dtest-4hdqgq43/test/node1/resources/cassandra/bin/cassandra.in.sh /local/home/bhalevy/.dtest/dtest-4hdqgq43/test/node1/bin/cassandra.in.sh
-rw-rw-r--. 1 bhalevy bhalevy    0 Jun 28 14:56 /local/home/bhalevy/.dtest/dtest-4hdqgq43/test/node1/bin/cassandra.in.sh
-rw-rw-r--. 3 bhalevy bhalevy 5687 Jun 23 17:44 /local/home/bhalevy/.dtest/dtest-4hdqgq43/test/node1/resources/cassandra/bin/cassandra.in.sh

The assigned jmx-port might be in use for large clusters

Seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/37/testReport/update_cluster_layout_tests/TestLargeScaleCluster/Run_Dtest_Parallel_Cloud_Machines___LongDtest___long_split000___test_add_50_nodes/

ccmlib.node.NodeError: Error starting node node32: unable to connect to scylla-jmx port 127.0.21.32:10221

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/37/artifact/logs-long.release.000/1655881382174_update_cluster_layout_tests.py%3A%3ATestLargeScaleCluster%3A%3Atest_add_50_nodes/node32_system.log.jmx

Using config file: /jenkins/workspace/scylla-master/dtest-daily-release/scylla/.dtest/dtest-97fehkcv/test/node32/conf/scylla.yaml
Error: JMX connector server communication error: service:jmx:rmi://47a05c919e49:10221
sun.management.AgentConfigurationError: java.rmi.server.ExportException: Port already in use: 10221; nested exception is: 
	java.net.BindException: Address already in use (Bind failed)
	at sun.management.jmxremote.ConnectorBootstrap.exportMBeanServer(ConnectorBootstrap.java:800)
	at sun.management.jmxremote.ConnectorBootstrap.startRemoteConnectorServer(ConnectorBootstrap.java:468)
	at sun.management.Agent.startAgent(Agent.java:262)
	at sun.management.Agent.startAgent(Agent.java:452)
Caused by: java.rmi.server.ExportException: Port already in use: 10221; nested exception is: 
	java.net.BindException: Address already in use (Bind failed)
	at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:346)
	at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:254)
	at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:412)
	at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147)
	at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:237)
	at sun.management.jmxremote.ConnectorBootstrap$PermanentExporter.exportObject(ConnectorBootstrap.java:199)
	at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:146)
	at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:122)
	at javax.management.remote.rmi.RMIConnectorServer.start(RMIConnectorServer.java:404)
	at sun.management.jmxremote.ConnectorBootstrap.exportMBeanServer(ConnectorBootstrap.java:796)
	... 3 more
Caused by: java.net.BindException: Address already in use (Bind failed)
	at java.net.PlainSocketImpl.socketBind(Native Method)
	at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
	at java.net.ServerSocket.bind(ServerSocket.java:390)
	at java.net.ServerSocket.<init>(ServerSocket.java:252)
	at java.net.ServerSocket.<init>(ServerSocket.java:143)
	at sun.rmi.transport.proxy.RMIDirectSocketFactory.createServerSocket(RMIDirectSocketFactory.java:45)
	at sun.rmi.transport.proxy.RMIMasterSocketFactory.createServerSocket(RMIMasterSocketFactory.java:345)
	at sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:670)
	at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:335)
	... 12 more

The jmx port is calculated in

def get_node_jmx_port(self,nodeid):
return 7000 + nodeid * 100 + self.id;

So for large enough clusters we get arbitrary high ports, like 10221 above (7000 + 32*100 + 21).

All but the first logging settings passed via SCYLLA_EXT_OPTS are ignored

Example:

export SCYLLA_EXT_OPTS="--logger-log-level paging=trace --logger-log-level storage_proxy=trace --logger-log-level migration_manager=trace --logger-log-level commitlog_replayer=trace --logger-log-level database=trace --logger-log-level batchlog_manager=trace"

Only "paging" logger will be adjusted.

That's because duplicate options are filtered out by this code in scylla_node.py:

        while opts_i < len(scylla_ext_opts):
            if scylla_ext_opts[opts_i].startswith('-'):
                add = False
                if scylla_ext_opts[opts_i] not in args:
                    add = True
                    args.append(scylla_ext_opts[opts_i])

`cluster.populate(n).start()` does not wait for nodes to start

and in effect some dtests are failing.

e.g. with the following setup: scylla-ccm d1e62ba (next and master at the moment of writing this issue) + scylla-dtest https://github.com/scylladb/scylla-dtest/commit/cd83fd8f8956befd1af256b91dee6456093413c0 (next and master at the moment of writing this issue)

running:

nosetests -v -s concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.snapshot_test

gives:

2020-01-31 15:08:22.142366 going to run tests sequentially
2020-01-31 15:08:22.142422 using the SingleClusterIdAllocator
snapshot_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges) ... 2020-01-31 15:08:22.143590 cluster ccm directory: /home/kbraun/.dtest/dtest-2k9Sq9
2020-01-31 15:08:22.143644 Starting Scylla cluster from directory /home/kbraun/dev/scylla-dtest/../scylla/build/dev
2020-01-31 15:08:22.189404 snapshot_test()
ERROR
2020-01-31 15:08:28.350926 Test failed with exception: <class 'cassandra.cluster.NoHostAvailable'>: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
2020-01-31 15:08:28.354925 stopping ccm cluster test at: /home/kbraun/.dtest/dtest-2k9Sq9

======================================================================
ERROR: snapshot_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kbraun/dev/scylla-dtest/concurrent_schema_changes_test.py", line 499, in snapshot_test
    session = self.cql_connection(node1)
  File "/home/kbraun/dev/scylla-dtest/dtest.py", line 704, in cql_connection
    protocol_version, port=port, ssl_opts=ssl_opts, **kwargs)
  File "/home/kbraun/dev/scylla-dtest/dtest.py", line 742, in _create_session
    session = cluster.connect()
  File "cassandra/cluster.py", line 1335, in cassandra.cluster.Cluster.connect
    with self._lock:
  File "cassandra/cluster.py", line 1371, in cassandra.cluster.Cluster.connect
    raise
  File "cassandra/cluster.py", line 1358, in cassandra.cluster.Cluster.connect
    self.control_connection.connect()
  File "cassandra/cluster.py", line 2896, in cassandra.cluster.ControlConnection.connect
    self._set_new_connection(self._reconnect_internal())
  File "cassandra/cluster.py", line 2939, in cassandra.cluster.ControlConnection._reconnect_internal
    raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
-------------------- >> begin captured logging << --------------------
dtest: DEBUG:  - going to run tests sequentially
dtest: DEBUG:  - using the SingleClusterIdAllocator
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - cluster ccm directory: /home/kbraun/.dtest/dtest-2k9Sq9
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - Starting Scylla cluster from directory /home/kbraun/dev/scylla-dtest/../scylla/build/dev
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - snapshot_test()
cassandra.cluster: WARNING: Cluster.__init__ called with contact_points specified, but load-balancing policies are not specified in some ExecutionProfiles. In the next major version, this will raise an error; please specify a load-balancing policy. (contact_points = ['127.0.0.1'], EPs without explicit LBPs = ('EXEC_P
ROFILE_DEFAULT',))
cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.1:9042:
Traceback (most recent call last):
  File "cassandra/cluster.py", line 2928, in cassandra.cluster.ControlConnection._reconnect_internal
    return self._try_connect(host)
  File "cassandra/cluster.py", line 2950, in cassandra.cluster.ControlConnection._try_connect
    connection = self._cluster.connection_factory(host.endpoint, is_control_connection=True)
  File "cassandra/cluster.py", line 1292, in cassandra.cluster.Cluster.connection_factory
    return self.connection_class.factory(endpoint, self.connect_timeout, *args, **kwargs)
  File "cassandra/connection.py", line 471, in cassandra.connection.Connection.factory
    conn = cls(endpoint, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/cassandra/io/libevreactor.py", line 267, in __init__
    self._connect_socket()
  File "cassandra/connection.py", line 514, in cassandra.connection.Connection._connect_socket
    raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror or sockerr))
error: [Errno 111] Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused
cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster:
Traceback (most recent call last):
  File "cassandra/cluster.py", line 1358, in cassandra.cluster.Cluster.connect
    self.control_connection.connect()
  File "cassandra/cluster.py", line 2896, in cassandra.cluster.ControlConnection.connect
    self._set_new_connection(self._reconnect_internal())
  File "cassandra/cluster.py", line 2939, in cassandra.cluster.ControlConnection._reconnect_internal
    raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 7.293s

FAILED (errors=1)

Here's how the test starts:

    def snapshot_test(self):
        debug("snapshot_test()")
        cluster = self.cluster
        cluster.set_configuration_options(values={'experimental': True})
        cluster.populate(2).start()
        node1, node2 = cluster.nodelist()
        wait(2)
        session = self.cql_connection(node1)

By reading the code and debugging, I arrived at the conclusion that the reason is 83bdc2d, specifically these lines:

-        if no_wait and not verbose:
-            # waiting 2 seconds to check for early errors and for the
-            # pid to be set
-            time.sleep(2)
-        else:
-            for node, p, mark in started:
-                start_message = "Starting listening for CQL clients"
-                try:
-                    # updated code, scylla starts CQL only by default
-                    # process should not be checked for scylla as the
-                    # process is a boot script (that ends after boot)
-                    node.watch_log_for(start_message, timeout=600,
-                                       verbose=verbose, from_mark=mark)
-                except RuntimeError:
-                    raise Exception("Not able to find start "
-                                    "message '%s' in Node '%s'" %
-                                    (start_message, node.name))

the else branch was causing the wait in older ccm versions (which caused start() to take ~25 seconds longer to finish than it currently does). I guess this is now responsible for the wait:

        if wait_for_binary_proto:
            for node, _, mark in started:
                node.watch_log_for("Starting listening for CQL clients",
                                   verbose=verbose, from_mark=mark)

but

  1. waiting for binary proto should probably be different/something more than just waiting for a certain message appearing in the node's logs (see wait_for_binary_interface in ccmlib/node.py)
  2. lots of dtests do cluster.populate(n).start() (so wait_for_binary_proto=False) and probably assume that the cluster is ready after this:
[kbraun@localhost scylla-dtest]$ grep -r "populate.*start()" | wc -l
185

The dtest manager test failing due to Scylla version mismatch (got Scylla 3.0 and not 4.1)

The manager_health_check_test.test_health_check_metrics dtest test failed because the cluster version 3.0 and not 4.1 (The CCM version function returns incorrect cluster version).

+----+----------+----------+------------+-----------+------+----------+--------------------------------+-----------------------------+--------------------------------------+
|    | CQL      | REST     | Address    | Uptime    | CPUs | Memory   | Scylla                         | Agent                       | Host ID                              |
+----+----------+----------+------------+-----------+------+----------+--------------------------------+-----------------------------+--------------------------------------+
| DN | -        | -        | 127.0.81.1 | -         | -    | -        | -                              | -                           | 4174891b-0ee1-4a98-97a8-6463a4a33eab |
| UN | UP (0ms) | UP (0ms) | 127.0.81.2 | 72h21m28s | 8    | 31.18GiB | 4.1.rc2-0.20200608.67348cd6e8e | 666.dev-0.20201014.fae52058 | 05e0e94e-9f9f-4cb6-93a9-5915a7e90c04 |
| UN | UP (0ms) | UP (0ms) | 127.0.81.3 | 72h21m28s | 8    | 31.18GiB | 4.1.rc2-0.20200608.67348cd6e8e | 666.dev-0.20201014.fae52058 | 553f4400-b619-4f31-ad20-77350c5e9d82 |
+----+----------+----------+------------+-----------+------+----------+--------------------------------+-----------------------------+--------------------------------------+

https://jenkins.scylladb.com/job/mermaid-master/job/mermaid-dtest/573/testReport/manager_health_check_tests/ManagerHealthCheckTest/test_health_check_metrics/

CCM doesn't notice that a node has aborted

This is particularly annoying in dtests: when one of the nodes throws an exception (e.g. during startup), the dtest will keep running until it timeouts (~10 minutes) because CCM doesn't see that the node stopped.

Change the relocatable related command line to compare to cached versions

Right now doing

ccm create scylla-reloc-1 -n 1 --scylla --version unstable/master:390

and then

ccm create scylla-reloc-1 -n 1 --scylla --version unstable/master:390 --scylla-core-package-uri=../scylla-next/build/release/scylla-package.tar.gz

Second command won't have effect (i.e. scylla-package.tar.gz won't be used) since version are cached only base on the --version argument

local_quorum_bootstrap_test failed: type object got multiple values for keyword argument 'stderr'

381104e
scylladb/scylla-dtest@8126bd12f144afa6084f85c14a976e4c3f420297

Seen in dtest-release/41/testReport/bootstrap_test/TestBootstrap/local_quorum_bootstrap_test:

Stacktrace
  File "/usr/lib64/python2.7/unittest/case.py", line 367, in run
    testMethod()
  File "/jenkins/workspace/scylla-master/dtest-release/scylla-dtest/bootstrap_test.py", line 304, in local_quorum_bootstrap_test
    stdout=tmpfile, stderr=subprocess.STDOUT)
  File "/jenkins/workspace/scylla-master/dtest-release/scylla-ccm/ccmlib/node.py", line 1121, in stress
    **kwargs)

type object got multiple values for keyword argument 'stderr'

It looks like local_quorum_bootstrap_test is passing stderr to node.stress in kwargs.

300             node1.stress(['user', 'profile=' + stress_config.name, 'ops(insert=1)',
301                           'n=500000', 'cl=LOCAL_QUORUM',
302                           '-rate', 'threads=5',
303                           '-errors', 'retries=2'],
304                          stdout=tmpfile, stderr=subprocess.STDOUT)

This popped up with 3e0696c
with capture_output=False

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.