Giter VIP home page Giter VIP logo

ceph-linode's Introduction

Repository of scripts to deploy Ceph in Linode

The repository has a collection of scripts that automate the deployment of Ceph within Linode. The primary use-case for this work is to allow rapid testing of Ceph at scale.

Why Linode?

Linode is a popular virtual private server (VPS) provider, but one of several. The primary reasons for selecting Linode were

  • Price. Linode is generally very affordable compared to competition.

  • SSD local storage at no extra cost. Obviously, testing Ceph requires the use of OSDs that require local devices. The SSDs on Linode are enterprise quality and well provisioned.

  • Friendly API. Most cloud providers have an API today for deployment. At the time I first worked on this project, there were not many that did.

Want to add another cloud provider? I'm all-ears. Please talk to me by email (see commit history for email address).

Repository Organization

The repository has a number of utilities roughly organized as:

  • linode.py: script to rapidly create/configure/nuke/destroy Linodes.

  • cluster.json: the description of the cluster to deploy.

  • pre-config.yml: an ansible playbook to pre-configure Linodes with useful packages or utilities prior to installing Ceph.

  • cephadm.yml: an ansible playbook to install Ceph using cephadm.

  • playbooks/: ansible playbooks for running serial tests and collecting test artifacts and performance data. Note that most of these playbooks were written for testing CephFS.

  • scripts/ and misc/: miscellaneous scripts. Notably, workflow management scripts for testing CephFS are located here.

  • graphing/: graphing scripts using gnuplot and some ImageMagik utilities. These may run on the artifacts produced by the ansible playbooks in playbooks.

How-to Get Started:

🔥 Note 🔥 For non-toy deployments, it's recommended to use a dedicated linode for running ansible. This reduces latency of operations, internet hiccups, allows you to allocate enough RAM for memory-hungry ansible, and rapidly download test artifacts for archival. Generally, the more RAM/cores the better. Also: make sure to enable a private IP address on the ansible linode otherwise ansible will not be able to communicate with the ceph cluster.

  • Setup a Linode account and get an API key.

    Put the key in ~/.linode.key:

    cat > ~/.linode.key
    ABCFejfASFG...
    ^D
  • Setup an ssh key if not already done:

    ssh-keygen
  • Install necessary packages:

    CentOS Stream:

    dnf install epel-release
    dnf update
    dnf install git ansible python3-pip python3-netaddr jq rsync wget htop
    pip3 install linode_api4 notario

    Fedora:

    dnf install git ansible python3-notario python3-pip python3-netaddr jq rsync htop wget
    pip3 install linode_api4

    Arch Linux:

    pacman -Syu git ansible python3-netaddr python3-pip jq rsync htop wget
    pip3 install notario linode_api4
  • Clone ceph-linode:

    git clone https://github.com/batrick/ceph-linode.git
  • Copy cluster.json.sample to cluster.json and modify it to have the desired count and Linode plan for each daemon type. If you're planning to do testing with CephFS, it is recommend to have 3+ MDS, 2+ clients, and 8+ OSDs. The ansible playbook playbooks/cephfs-setup.yml will configure 4 OSDs to be dedicated for the metadata pool. Keep in mind that the use of containerized Ceph daemons requires more memory than bare-metal installations. It is recommended to use at least 4GB for all daemons. OSDs require at least 8GB.

🔥 Note 🔥 The OSD memory target is always at least 4GB, otherwise set appropriately and automatically based on the available memory on the OSD. If you use smaller OSDs (4GB or smaller), then you must configure the memory target manually via changing the Ceph config.

  • Configure which Ceph version you want to deploy in settings.yml.

  • Start using:

    python3 linode.py launch
    source ansible-env.bash
    do_playbook cephadm.yml

SSH to a particular machine

./ansible-ssh mon-000

Or any named node in the linodes JSON file.

Execute ansible commands against the cluster

source ansible-env.bash
ans -m shell -a 'echo im an osd' osds
ans -m shell -a 'echo im an mds' mdss
ans -m shell -a 'echo im a client' clients
...

You can also easily execute playbooks:

source ansible-env.bash
do_playbook foo.yml

How-to nuke and repave your cluster:

Sometimes you want to start over from a clean slate. Destroying the cluster can incur unnecessary costs though as Linodes are billed by the hour, no matter how little of an hour you use. It is often cheaper to nuke the Linodes by deleting all configurations, destroying all disks, etc.

You can manually nuke the cluster if you want using:

python3 linode.py nuke

How-to destroy your cluster:

python3 linode.py destroy

The script works by destroying all the Linodes that belong to the group named in the LINODE_GROUP file, created by linode.py.

This deletes EVERYTHING and stops any further billing.

ceph-linode's People

Contributors

ajarr avatar batrick avatar bengland2 avatar fullerdj avatar sidharthanup avatar tserlin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ceph-linode's Issues

latest kernel get installed on the hosts, but hosts done get rebooted

When testing, Jeff Layton observed that latest kernel get installed on hosts, but hosts don't get rebooted afterwards by the ansible scripts.

In pre-config.yaml, I see all packages are updated using 'update_packages' section. Then we install a bunch of packages. Maybe installing them also somehow installs the latest kernel on the hosts?

would like launch.sh option to just create the linodes and not run ceph-ansible

There are cases where it is desirable to create the linodes but do some additional steps before you run ceph-ansible. Example: when a new RHCS release comes out, centos doesn't have all the RPMs that it depends on, so you have to insert a repo containing those RPMs before you can install Ceph.

Suggestion: if the user doesn't specify the --ceph-ansible argument to launch.sh, just have launch.sh skip ceph-ansible and warn the user that they have to do it. The nice thing about launch.sh is that it's idempotent - if you already created the linodes, then linode-launch.py won't create them again. But it takes a long time for ceph-ansible to fail so just running launch.sh twice as it is today can be very expensive. So a user could do something like this:

launch.sh
# <make necessary changes to cluster hosts before running ceph-ansible
launch.sh --ceph-ansible /usr/share/ceph-ansible

Or they could just run ceph-ansible directly. If you agree with this, I could submit a PR for it, it's trivial to do.

strange intermittent failure of yum on linodes

I get this strange failure of yum that is non-reproducible - if I go back and re-run the command on the same cluster, it succeeds. Has anyone else seen that? I'm guessing the mirror site used by the yum repo was busy, is there a way to make yum more robust in the face of this by retrying? I'm going to try ansible yum module and see if that's more resilient. I could also try "yum -t --randomwait=1" , because maybe having all these linodes attack the yum repo server at the same time contributes to the problem.

$ ansible -m shell -a 'yum install -y wget yum-utils' all
 [WARNING]: Consider using yum module rather than running yum
mgr-000 | FAILED | rc=1 >>


 One of the configured repositories failed (Unknown),
 and yum doesn't have enough cached data to continue. 

firewall disabling?

with ceph-ansible stable-3.1 branch I found this necessary at end of pre-config.yml:

- hosts: all
  become: yes
  tasks:
  - name: disable firewall
    shell: "(systemctl stop firewalld && systemctl disable firewalld) || (systemctl stop iptables && systemctl disable iptables)"

but maybe newer version of ceph-ansible handles this now?

kernel cephfs in linode

what's the best way to run kernel cephfs in linode? I'm running centos-7 distro and there is no kernel module ceph.ko or ceph.ko.xz in /lib/modules/uname -r/kernel/fs/ . When I try to mount using cephfs:

[root@li1741-92 ~]# mount -t ceph li1749-153:/cephfs /mnt/cephfs
failed to load ceph kernel module (1)
[root@li1741-92 ~]# find /lib/modules/`uname -r`/kernel -name 'ceph*'
find: ‘/lib/modules/4.9.36-x86_64-linode85/kernel’: No such file or directory
[root@li1741-92 ~]# uname -a
Linux li1741-92.members.linode.com 4.9.36-x86_64-linode85 #1 SMP Thu Jul 6 15:31:23 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[root@li1741-92 ~]# mkdir /mnt/ceph
[root@li1741-92 ~]# ceph-fuse /mnt/ceph
2017-10-09 22:21:26.392074 7fe409988040 -1 init, newargv = 0x55b07674f7a0 newargc=9
ceph-fuse[28473]: starting ceph client
ceph-fuse[28473]: starting fuse
[root@li1741-92 ~]# df /mnt/ceph
Filesystem     1K-blocks  Used Available Use% Mounted on
ceph-fuse       15593472     0  15593472   0% /mnt/ceph

fatal: [mon-000]: FAILED! =>

I can't start ceph-linode, the installation breaks at
shell: /root/cephadm bootstrap --allow-fqdn-hostname --mon-ip {{ monitor_address }}

TASK [bootstrap octopus] **********************************************************************************************************************************************************************************
fatal: [mon-000]: FAILED! => {
    "changed": true,
    "cmd": "/root/cephadm bootstrap --allow-fqdn-hostname --mon-ip 45.56.94.64",
    "delta": "0:12:02.194868",
    "end": "2020-09-23 13:40:01.039117",
    "rc": 1,
    "start": "2020-09-23 13:27:58.844249"
}

STDERR:

INFO:cephadm:Verifying podman|docker is present...
INFO:cephadm:Verifying lvm2 is present...
INFO:cephadm:Verifying time synchronization is in place...
INFO:cephadm:Unit chronyd.service is enabled and running
INFO:cephadm:Repeating the final host check...
INFO:cephadm:podman|docker (/usr/bin/podman) is present
INFO:cephadm:systemctl is present
INFO:cephadm:lvcreate is present
INFO:cephadm:Unit chronyd.service is enabled and running
INFO:cephadm:Host looks OK
INFO:root:Cluster fsid: 98f5f4ec-fda0-11ea-8836-f23c922d25b7
INFO:cephadm:Verifying IP 45.56.94.64 port 3300 ...
INFO:cephadm:Verifying IP 45.56.94.64 port 6789 ...
INFO:cephadm:Mon IP 45.56.94.64 is in CIDR network 45.56.94.0/24
INFO:cephadm:Pulling container image ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64...
INFO:cephadm:Extracting ceph user uid/gid from container image...
INFO:cephadm:Creating initial keys...
INFO:cephadm:Creating initial monmap...
INFO:cephadm:Creating mon...
INFO:cephadm:Waiting for mon to start...
INFO:cephadm:Waiting for mon...
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
INFO:cephadm:Non-zero exit code -9 from /usr/bin/podman run --rm --net=host --ipc=host -e CONTAINER_IMAGE=ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 -e NODE_NAME=li896-64 -v /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64:/var/lib/ceph/mon/ceph-li896-64:z -v /tmp/ceph-tmpt1gd66zz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp3w_b573j:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 status
INFO:cephadm:mon not available, waiting (1/10)...
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
INFO:cephadm:Non-zero exit code -9 from /usr/bin/podman run --rm --net=host --ipc=host -e CONTAINER_IMAGE=ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 -e NODE_NAME=li896-64 -v /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64:/var/lib/ceph/mon/ceph-li896-64:z -v /tmp/ceph-tmpt1gd66zz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp3w_b573j:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 status
INFO:cephadm:mon not available, waiting (2/10)...
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
INFO:cephadm:Non-zero exit code -9 from /usr/bin/podman run --rm --net=host --ipc=host -e CONTAINER_IMAGE=ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 -e NODE_NAME=li896-64 -v /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64:/var/lib/ceph/mon/ceph-li896-64:z -v /tmp/ceph-tmpt1gd66zz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp3w_b573j:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 status
INFO:cephadm:mon not available, waiting (3/10)...
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
INFO:cephadm:Non-zero exit code -9 from /usr/bin/podman run --rm --net=host --ipc=host -e CONTAINER_IMAGE=ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 -e NODE_NAME=li896-64 -v /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64:/var/lib/ceph/mon/ceph-li896-64:z -v /tmp/ceph-tmpt1gd66zz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp3w_b573j:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 status
INFO:cephadm:mon not available, waiting (4/10)...
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
INFO:cephadm:Non-zero exit code -9 from /usr/bin/podman run --rm --net=host --ipc=host -e CONTAINER_IMAGE=ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 -e NODE_NAME=li896-64 -v /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64:/var/lib/ceph/mon/ceph-li896-64:z -v /tmp/ceph-tmpt1gd66zz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp3w_b573j:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 status
INFO:cephadm:mon not available, waiting (5/10)...
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
INFO:cephadm:Non-zero exit code -9 from /usr/bin/podman run --rm --net=host --ipc=host -e CONTAINER_IMAGE=ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 -e NODE_NAME=li896-64 -v /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64:/var/lib/ceph/mon/ceph-li896-64:z -v /tmp/ceph-tmpt1gd66zz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp3w_b573j:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 status
INFO:cephadm:mon not available, waiting (6/10)...
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
INFO:cephadm:Non-zero exit code -9 from /usr/bin/podman run --rm --net=host --ipc=host -e CONTAINER_IMAGE=ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 -e NODE_NAME=li896-64 -v /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64:/var/lib/ceph/mon/ceph-li896-64:z -v /tmp/ceph-tmpt1gd66zz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp3w_b573j:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 status
INFO:cephadm:mon not available, waiting (7/10)...
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
INFO:cephadm:Non-zero exit code -9 from /usr/bin/podman run --rm --net=host --ipc=host -e CONTAINER_IMAGE=ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 -e NODE_NAME=li896-64 -v /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64:/var/lib/ceph/mon/ceph-li896-64:z -v /tmp/ceph-tmpt1gd66zz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp3w_b573j:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 status
INFO:cephadm:mon not available, waiting (8/10)...
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
INFO:cephadm:Non-zero exit code -9 from /usr/bin/podman run --rm --net=host --ipc=host -e CONTAINER_IMAGE=ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 -e NODE_NAME=li896-64 -v /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64:/var/lib/ceph/mon/ceph-li896-64:z -v /tmp/ceph-tmpt1gd66zz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp3w_b573j:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 status
INFO:cephadm:mon not available, waiting (9/10)...
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
INFO:cephadm:Non-zero exit code -9 from /usr/bin/podman run --rm --net=host --ipc=host -e CONTAINER_IMAGE=ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 -e NODE_NAME=li896-64 -v /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64:/var/lib/ceph/mon/ceph-li896-64:z -v /tmp/ceph-tmpt1gd66zz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp3w_b573j:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 status
INFO:cephadm:mon not available, waiting (10/10)...
INFO:cephadm:/usr/bin/ceph:timeout after 60 seconds
INFO:cephadm:Non-zero exit code -9 from /usr/bin/podman run --rm --net=host --ipc=host -e CONTAINER_IMAGE=ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 -e NODE_NAME=li896-64 -v /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64:/var/lib/ceph/mon/ceph-li896-64:z -v /tmp/ceph-tmpt1gd66zz:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp3w_b573j:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64 status
ERROR: mon not available after 10 tries


MSG:

non-zero return code

PLAY RECAP ************************************************************************************************************************************************************************************************
client-000                 : ok=33   changed=27   unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
mds-000                    : ok=34   changed=27   unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
mgr-000                    : ok=34   changed=28   unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
mon-000                    : ok=37   changed=28   unreachable=0    failed=1    skipped=1    rescued=0    ignored=0   
mon-001                    : ok=33   changed=26   unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
mon-002                    : ok=33   changed=26   unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
osd-000                    : ok=36   changed=29   unreachable=0    failed=0    skipped=0    rescued=0    ignored=1   
osd-001                    : ok=36   changed=29   unreachable=0    failed=0    skipped=0    rescued=0    ignored=1   
osd-002                    : ok=36   changed=29   unreachable=0    failed=0    skipped=0    rescued=0    ignored=1   


real	21m29,087s
user	1m46,031s
sys	0m39,729s

I replaced it to the stable version, for the developer version exactly the same error

  - set_fact:
      CEPHADM_IMAGE: ceph/daemon-base:v5.0.3-stable-5.0-octopus-centos-8-x86_64
      CEPHADM_REPO: --release octopus

Ceph tries to start on mon-000 but I don’t understand why it doesn’t work

./ansible-ssh mon-000
ps -ax
   4619 ?        S      0:00 (sd-pam)
  33672 ?        I      0:00 [kworker/u2:0-events_unbound]
  36859 ?        I      0:00 [kworker/0:2-events_power_efficient]
  36935 ?        S      0:00 /usr/sbin/chronyd
  38931 ?        R      0:00 [kworker/u2:1-flush-8:0]
  41421 ?        Ss     0:00 /bin/bash /var/lib/ceph/98f5f4ec-fda0-11ea-8836-f23c922d25b7/mon.li896-64/unit.run
  41450 ?        Sl     0:00 /usr/bin/podman run --rm --net=host --ipc=host --privileged --group-add=disk --name ceph-98f5f4ec-fda0-11ea-8836-f23c922d25b7-mon.li896-64 -e CONTAINER_IMAGE=ceph/daemon-base
  41458 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c e3c4e602d08452830cfb768ecb187da134f160e73def334b1481e926ae0c884e -u e3c4e602d08452830cfb768ecb187da134f160e73def334b1481e926ae0c884e -r 
  41469 ?        Ssl    0:00 /usr/libexec/platform-python -s /usr/bin/ceph status
  41479 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c 8438619ccb7b3b44503d01367da715e67e0bf07c422b642e139893d25fba4994 -u 8438619ccb7b3b44503d01367da715e67e0bf07c422b642e139893d25fba4994 -r 
  41491 ?        Ssl    0:00 /usr/bin/ceph-mon -n mon.li896-64 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug  --default-mon-
  41576 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c 6a6d696288002676da7b66ea2acf9edb2a065f7075c506bc18a7bf0c483f66d6 -u 6a6d696288002676da7b66ea2acf9edb2a065f7075c506bc18a7bf0c483f66d6 -r 
  41587 ?        Ssl    0:00 /usr/libexec/platform-python -s /usr/bin/ceph status
  41629 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c 6ae2ec2c90d62c2e0187fdf5251f14a1bc639189e9cf913e03ee60ad91a1a85c -u 6ae2ec2c90d62c2e0187fdf5251f14a1bc639189e9cf913e03ee60ad91a1a85c -r 
  41640 ?        Ssl    0:00 /usr/libexec/platform-python -s /usr/bin/ceph status
  41674 ?        Ss     0:00 sshd: root [priv]
  41677 ?        S      0:00 sshd: root@pts/1
  41678 pts/1    Ss     0:00 -bash
  41708 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c 5581995d13c14704d35b7facf0212bad115dcc965928d448f7cc3b7ca9d84139 -u 5581995d13c14704d35b7facf0212bad115dcc965928d448f7cc3b7ca9d84139 -r 
  41719 ?        Ssl    0:00 /usr/libexec/platform-python -s /usr/bin/ceph status
  41761 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c 02dba76f865472736a1631b5456f0dcc71cf0222bf58f37e1941f5b3545955a0 -u 02dba76f865472736a1631b5456f0dcc71cf0222bf58f37e1941f5b3545955a0 -r 
  41772 ?        Ssl    0:00 /usr/libexec/platform-python -s /usr/bin/ceph status
  41813 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c 0cb2a8f64f2b694a0a97bddd3c710c12fbf2fcad1f7a4c9f8432cd7d2fc44e3c -u 0cb2a8f64f2b694a0a97bddd3c710c12fbf2fcad1f7a4c9f8432cd7d2fc44e3c -r 
  41824 ?        Ssl    0:00 /usr/libexec/platform-python -s /usr/bin/ceph status
  41867 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c ba18819bb3b978210d85015f9cf9d6bf9ab1fc949c09abf77d79c356ce443ac1 -u ba18819bb3b978210d85015f9cf9d6bf9ab1fc949c09abf77d79c356ce443ac1 -r 
  41878 ?        Ssl    0:00 /usr/libexec/platform-python -s /usr/bin/ceph status
  41924 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c 40a2e69fed8781074f836f56806bee90a0036da0f9b8a5a268c6e4c54815a8f4 -u 40a2e69fed8781074f836f56806bee90a0036da0f9b8a5a268c6e4c54815a8f4 -r 
  41935 ?        Ssl    0:00 /usr/libexec/platform-python -s /usr/bin/ceph status
  41965 ?        I      0:00 [kworker/0:0-events]
  41979 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c c3b0ed74fc6f64a1c525418efd0a7ad60b48d4283c36f3be96323e00bc69cc63 -u c3b0ed74fc6f64a1c525418efd0a7ad60b48d4283c36f3be96323e00bc69cc63 -r 
  41990 ?        Ssl    0:00 /usr/libexec/platform-python -s /usr/bin/ceph status
  42033 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c 4a58314c03aa1793868dd4b68153dbc1fbeb75654c298dd6cdc8aa169356a77c -u 4a58314c03aa1793868dd4b68153dbc1fbeb75654c298dd6cdc8aa169356a77c -r 
  42044 ?        Ssl    0:00 /usr/libexec/platform-python -s /usr/bin/ceph status
  42087 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -s -c 8e202457a962fbd2a399b60abd98b6007c311a59b88605642860567513658d98 -u 8e202457a962fbd2a399b60abd98b6007c311a59b88605642860567513658d98 -r 
  42098 ?        Ssl    0:00 /usr/libexec/platform-python -s /usr/bin/ceph status
  42252 ?        I      0:00 [kworker/0:5-cgroup_destroy]
  42347 ?        I      0:00 [kworker/0:1-events]
  42423 ?        I      0:00 [kworker/0:4-events_power_efficient]
  42431 ?        Ss     0:00 /usr/sbin/anacron -s
  42434 ?        Ss     0:00 sshd: unknown [priv]
  42435 ?        S      0:00 sshd: unknown [net]
  42436 pts/1    R+     0:00 ps -ax

Could not re-install ceph after nuke

python3 linode.py launch
do_playbook cephadm.yml
python3 linode.py nuke
do_playbook cephadm.yml

throws below errors:
Using /etc/ansible/ansible.cfg as config file
[WARNING]: Unable to parse /root/ceph-linode/ansible_inventory as an inventory source
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not
match 'all'

PLAY [all] **********************************************************************************************************
skipping: no hosts matched

PLAY [all] **********************************************************************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: mdss

PLAY [mdss] *********************************************************************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: osds

PLAY [osds] *********************************************************************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: mons

PLAY [mons] *********************************************************************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: clients

PLAY [clients] ******************************************************************************************************
skipping: no hosts matched

PLAY [all] **********************************************************************************************************
skipping: no hosts matched

PLAY [all] **********************************************************************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: mon-000

PLAY [mon-000] ******************************************************************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: mgrs
[WARNING]: Could not match supplied host pattern, ignoring: grafana-servers

PLAY [mons mgrs osds mdss grafana-servers] **************************************************************************
skipping: no hosts matched

PLAY [mon-000] ******************************************************************************************************
skipping: no hosts matched

PLAY [all] **********************************************************************************************************
skipping: no hosts matched

PLAY [clients] ******************************************************************************************************
skipping: no hosts matched

PLAY RECAP **********************************************************************************************************

suggestion: could we run ansible playbooks from inside the cluster whenever possible?

It is starting to work for me but it takes a long time to start up the cluster, and find out if a run has errors or not, because of 2 things:

  • ansible playbooks are all run from outside linode.com, which means that response time is order of magnitude longer
  • retry options mean that unless you are watching the script it takes forever to find out that input parameters were wrong.

Would it make sense to run ansible playbooks from inside the cluster on the ceph-mgr host, in order to remove all the latency, and limit retries? We could just install ansible on the ceph-mgr host,generate a private key for it, install corresponding public key on all linodes.

ceph-ansible issue that impacts linode

This ceph-ansible issue means that you can't use monitor_address_block ansible var to set monitor addresses. Has anyone else encountered this?This seems like an absolute requirement for linode because of the use of multiple ethernet addresses associated with eth0, only 1 of which is on the same subnet 192.168.128.0/17 for all linodes in my cluster.

[root@li289-169 ~]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 66.228.39.169  netmask 255.255.255.0  broadcast 66.228.39.255
        inet6 2600:3c03::f03c:91ff:febd:18b7  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::f03c:91ff:febd:18b7  prefixlen 64  scopeid 0x20<link>
        ether f2:3c:91:bd:18:b7  txqueuelen 1000  (Ethernet)
...
[root@li289-169 ~]# ip a 
...
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether f2:3c:91:bd:18:b7 brd ff:ff:ff:ff:ff:ff
    inet 66.228.39.169/24 brd 66.228.39.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.152.194/17 brd 192.168.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 2600:3c03::f03c:91ff:febd:18b7/64 scope global noprefixroute dynamic 
       valid_lft 2592000sec preferred_lft 604800sec
    inet6 fe80::f03c:91ff:febd:18b7/64 scope link 
       valid_lft forever preferred_lft forever
...

I worked around it by disabling the check in ceph-ansible that prevents monitor_address_block from being used.

diff --git a/roles/ceph-common/tasks/checks/check_mandatory_vars.yml b/roles/ceph-common/tasks/checks/check_mandatory_vars.yml
index 3525e0f3..5f203d1e 100644
--- a/roles/ceph-common/tasks/checks/check_mandatory_vars.yml
+++ b/roles/ceph-common/tasks/checks/check_mandatory_vars.yml
@@ -75,14 +75,6 @@
   tags:
     - package-install
 
-- name: make sure monitor_interface or monitor_address is defined
-  fail:
-    msg: "you must set monitor_interface or monitor_address"
-  when:
-    - monitor_interface == 'interface'
-    - monitor_address == "0.0.0.0"
-    - mon_group_name in group_names
-
 - name: make sure radosgw_interface or radosgw_address is defined
   fail:
     msg: "you must set radosgw_interface or radosgw_address"

dont want to use upstream Linux kernel 4.17

I want to use whatever kernel boots with GRUB instead of linux kernel 4.17, but I can't figure out how to get linode-launch.py to do that. Otherwise ceph-ansible will not mount an .iso image (unless I reboot the node!). Worse yet, I'm not really using centos7. Any suggestions?

have you ever seen this warning followed by authentication failure?

ceph-linode from tip of your tree is not working for me. I suspect it's a cert problem, we've been getting rejected from Dell DRAC console because of this as well. They tightened up crypto in newer versions and attempts to use non-secure older certs are failing. The symptom here is that the moment I try to get a list of datacenters from linode server, which is the first thing that linode-launch.py does to it, I get this failure.

Pdb)

/root/ceph-linode/linode-launch.py(12)()
-> import linode.api as linapi
(Pdb)
/root/ceph-linode/linode-env/lib/python2.7/site-packages/linode/api.py:83: RuntimeWarning: using urllib instead of pycurl, urllib does not verify SSL remote certificates, there is a risk of compromised communication
warnings.warn(ssl_message, RuntimeWarning)
/root/ceph-linode/linode-launch.py(14)()

....

(Pdb) n

/root/ceph-linode/linode-launch.py(142)launch()
-> datacenters = client.avail_datacenters()
(Pdb)
2018-07-31 20:48:48,439 DEBUG Parameters {'api_key': 'api_key: xxxx REDACTED xxxx', 'api_responseformat': 'json', 'api_action': 'avail.datacenters'}
2018-07-31 20:48:48,650 DEBUG Raw Response: {"ACTION":"avail.datacenters","DATA":{},"ERRORARRAY":[{"ERRORMESSAGE":"Authentication failed","ERRORCODE":4}]}
ApiError: ApiError()
/root/ceph-linode/linode-launch.py(142)launch()
-> datacenters = client.avail_datacenters()

Cephfs mount success, I can create a directory, but I cannot create a file

I run the following command

>do_playbook cephadm.yml 
>do_playbook playbooks/cephfs-setup.yml
>do_playbook playbooks/cephfs-create.yml
>do_playbook playbooks/kernel-mount.yml 

but the mount test fails
error-mount

I mount successfully via key [client.admin] in /etc/ceph/ceph.client.admin.keyring
but I can't write to the file system although I can create a directory

error-admin

it has something to do with the pgs unknown that belongs cephfs_data
Reinstalled the system several times, but as soon as run the script playbooks/cephfs-create.yml, appears pgs unknown

cephfs-data

My hosts
ok

linode-launch.py fails - unable to create linodes

I'm unable to create linodes when running the following command:

# LINODE_API_KEY=<snip> python ceph-linode/linode-launch.py
...
2018-05-21 18:50:57,867 ERROR list index out of range
Traceback (most recent call last):
  File "ceph-linode/linode-launch.py", line 136, in create
    do_create(*args, **kwargs)
  File "ceph-linode/linode-launch.py", line 54, in do_create
    plan = filter(lambda p: p[u'LABEL'].lower().find(str(machine['plan']).lower()) >= 0, plans)[0]
IndexError: list index out of range

This same command worked about 2 months ago.

Note: I'm running this command from another linode, on CentOS 7.

partial deployment of large linode clusters

I've been experimenting with some larger linode clusters, and what happens is that sometimes the Linode API rejects a node creation with an error like the one shown at the bottom. I think it means that there is no room at the inn, linode just doesn't have resources at that geographic site to create that many VMs.

My complaint is that this results in a set of linodes that are created but aren't in the display group, so that linode-destroy.py won't clean them up and I have to do this by hand. Sometimes this set can be pretty large. If one linode create fails, the other threads in the pool are aborted before they can add their new linodes to the display group (the string in LINODE_GROUP), since that is a separate call to the linode API.

Is there any way to change linode-launch.py so that a linode isn't created unless it is also added to the group? Cleanup would be simple then - just run linode-destroy.py

2018-08-30 14:48:05,067 DEBUG Raw Response: {"ACTION":"linode.create","DATA":{"LinodeID":9963999},"ERRORARRAY":[]}
2018-08-30 14:48:05,068 DEBUG Parameters {'linodeid': 9963999, 'alert_cpu_enabled': 0, 'label': u'ceph-d5d1b9-mds-001', 'api_responseformat': 'json', 'api_action': 'linode.update', 'watchdog': 1, 'api_key': 'api_key: xxxx REDACTED xxxx', 'lpm_displaygroup': u'ceph-d5d1b9'}
2018-08-30 14:48:05,097 DEBUG Raw Response: {"ACTION":"linode.create","DATA":{},"ERRORARRAY":[{"ERRORMESSAGE":"No open slots for this plan!","ERRORCODE":8}]}
2018-08-30 14:48:05,098 ERROR [{u'ERRORCODE': 8, u'ERRORMESSAGE': u'No open slots for this plan!'}]
Traceback (most recent call last):
  File "./linode-launch.py", line 133, in create
    do_create(*args, **kwargs)
  File "./linode-launch.py", line 63, in do_create
    node = client.linode_create(DatacenterID = datacenter, PlanID = plan[u'PLANID'], PaymentTerm = 1)
  File "/root/ceph-linode/linode-env/lib/python2.7/site-packages/linode/api.py", line 340, in wrapper
    return self.__send_request(request)
  File "/root/ceph-linode/linode-env/lib/python2.7/site-packages/linode/api.py", line 294, in __send_request
    raise ApiError(s['ERRORARRAY'])
ApiError: [{u'ERRORCODE': 8, u'ERRORMESSAGE': u'No open slots for this plan!'}]

iptables.yml: Enable default ipv6 firewall

Jeff Layton suggested that we enable default IPv6 firewall with iptables

Currently in hosts set up by ceph linode, I see no IPv6 rules

# ip6tables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination     

After Jeff's suggestion,

# systemctl enable ip6tables
# systemctl start ip6tables
# ip6tables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all      anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     ipv6-icmp    anywhere             anywhere            
ACCEPT     all      anywhere             anywhere            
ACCEPT     tcp      anywhere             anywhere             state NEW tcp dpt:ssh
ACCEPT     udp      anywhere             fe80::/64            udp dpt:dhcpv6-client state NEW
REJECT     all      anywhere             anywhere             reject-with icmp6-adm-prohibited

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all      anywhere             anywhere             reject-with icmp6-adm-prohibited
`

linodes aren't in same subnet

I fired up an linode cluster with ceph-linode and discovered that the linodes weren't in the same subnet, so ceph-ansible couldn't configure them. I never saw this before. Do you have to tell it to create them in the same subnet?

$ ansible -m shell -a 'ifconfig eth0 | awk "/inet /{print $2}"' all

mgr-000 | SUCCESS | rc=0 >>
inet 173.255.238.79 netmask 255.255.255.0 broadcast 173.255.238.255

client-000 | SUCCESS | rc=0 >>
inet 66.228.47.20 netmask 255.255.255.0 broadcast 66.228.47.255

osd-001 | SUCCESS | rc=0 >>
inet 50.116.48.61 netmask 255.255.255.0 broadcast 50.116.48.255

osd-002 | SUCCESS | rc=0 >>
inet 173.255.230.167 netmask 255.255.255.0 broadcast 173.255.230.255

osd-000 | SUCCESS | rc=0 >>
inet 69.164.222.67 netmask 255.255.255.0 broadcast 69.164.222.255

mds-000 | SUCCESS | rc=0 >>
inet 96.126.104.6 netmask 255.255.255.0 broadcast 96.126.104.255

mon-002 | SUCCESS | rc=0 >>
inet 45.33.65.125 netmask 255.255.255.0 broadcast 45.33.65.255

mon-000 | SUCCESS | rc=0 >>
inet 66.228.39.50 netmask 255.255.255.0 broadcast 66.228.39.255

mon-001 | SUCCESS | rc=0 >>
inet 45.33.81.10 netmask 255.255.255.0 broadcast 45.33.81.255

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.