linux-system-roles / ha_cluster Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 21.0 699 KB

Provide automation for Cluster - High Availability management

Home Page: https://linux-system-roles.github.io/ha_cluster/

License: MIT License

Jinja 1.51% Shell 2.22% JavaScript 0.97% Python 12.79% HTML 82.51%

ha_cluster's People

Contributors

Stargazers

Watchers

ha_cluster's Issues

Add support for SBD devices

In tasks/cluster-enable-disable.yml there is a comment that SBD is currently disabled because it is not supported, yet:

# The role does not support configuring SBD yet, therefore we always disable it

Are there plans to support it in the future? I'm using this role to setup RHEL HA and the official documentation mentions that SBD is supported by Red Hat.

Extend platform/version execution to allow either pcs or crmsh

By design, ha_cluster allows for additional platform/version. The first Ansible Task executed within the Ansible Role, is set_vars.yml to import different variables for OS major.minor versions.

An extension of this principle to also include/execute a different set of Ansible Tasks for different high-level command interfaces to Linux Pacemaker, would be valuable to future compatibility.

Suggestion for execution logic fork

Append new var _ha_cluster_pacemaker_shell to all existing /vars/<<os_version>>.yml files. For example:

/vars/RHEL9.yml

_ha_cluster_pacemaker_shell: pcs

Create new directory for Ansible Tasks of a specific shell for Linux Pacemaker. Move the existing specific files of pcs shell into /tasks/pcs directory, and create stub code for crmsh shell in future. For example:

Directory re-structure....

/tasks/pcs/cluster-start-and-reload.yml
etc.

/tasks/crmsh/cluster-start-and-reload.yml
etc.

Move common Ansible Tasks that execute the underlying Pacemaker binaries cibadmin and crm_mon into a separate tasks subdirectory. For example:

Directory re-structure....

/tasks/common/create-and-push-cib.yml
/tasks/common/cluster-start-and-reload.yml

Append include_tasks. For example

/tasks/main.yml

---
- name: Set platform/version specific variables
  include_tasks: set_vars.yml

...
...

- name: Start the cluster and reload corosync.conf
  include_tasks: {{ _ha_cluster_pacemaker_shell }}/cluster-start-and-reload.yml
...
...

Side benefits

Retains flat design of /tasks directory, but reduces some of the contents so it is easier to see the key/controlling Ansible Tasks files and what they execute

Comparison of `pcs` and `crmsh`

Looking in the repository, there are a limited number of pcs shell commands in use. See equivilant below for crmsh:

# PCS Shell                   # CRMSH

pcs cluster                   crm node
pcs constraint <type>         crm configure <type>
pcs property                  crm configure property
pcs qdevice                   crm cluster init qdevice
pcs quorum                    corosync-quorumtool / corosync-qnetd-tool
pcs resource                  crm ra
pcs status                    crm status
pcs stonith                   crm ra

Refs:

corosync: Add hidden default options as variables

Issue:
pcs corosync commands currently create corosync file with predefined default values that are not exposed through variables.
Example: logging

https://github.com/linux-system-roles/ha_cluster/blob/main/tasks/shell_pcs/pcs-cluster-setup-pcs-0.10.yml does not specify logging, but it is created by default.

Reason:

Allow users to specify logging details during runtime
Enable planned rework of crmsh jinja template to use other existing variables like ha_cluster_totem

Resolution:
It would be helpful if variables (ex. like ha_cluster_logging) were added into corosync setup tasks as well as exposed in defaults and Readme.

Conscious Language: Please rename master branch to main branch

As part of the conscious language project, the master branch is to be renamed to the main branch.

Here are the instructions.

Rename the master branch to main: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-branches-in-your-repository/renaming-a-branch
Check this to ensure the default branch has been changed to main: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-branches-in-your-repository/changing-the-default-branch - this should keep the github history, as well as updating the default branch configuration and updating any existing PRs

If you use the gh cli (highly recommended) you can use this to check which repos need to be updated:

gh repo list linux-system-roles -L 100 --json name,defaultBranchRef --source | \
  jq --raw-output '.[] | select(.defaultBranchRef.name == "master") | .name'

Thanks.

RHEL9 resource-agents no longer contain cloud agents

This is of course cloud specific so I am not sure if you want to implement it here as I dont see any cloud platform specific code

RedHat_9 var file should be enhanced with __ha_cluster_fullstack_node_packages from main RedHat var file to cover for newly split package on RHEL9.
Package to add: resource-agents-cloud

It contains:

/usr/lib/ocf/resource.d/heartbeat/aliyun-vpc-move-ip
/usr/lib/ocf/resource.d/heartbeat/aws-vpc-move-ip
/usr/lib/ocf/resource.d/heartbeat/aws-vpc-route53
/usr/lib/ocf/resource.d/heartbeat/awseip
/usr/lib/ocf/resource.d/heartbeat/awsvip
/usr/lib/ocf/resource.d/heartbeat/azure-events
/usr/lib/ocf/resource.d/heartbeat/azure-events-az
/usr/lib/ocf/resource.d/heartbeat/azure-lb
/usr/lib/ocf/resource.d/heartbeat/gcp-ilb
/usr/lib/ocf/resource.d/heartbeat/gcp-pd-move
/usr/lib/ocf/resource.d/heartbeat/gcp-vpc-move-route
/usr/lib/ocf/resource.d/heartbeat/gcp-vpc-move-vip

Explanation:
RHEL9 changed resource-agents package and it no longer contains cloud resource agents. They are now under resource-agents-cloud.

Example for REL9.2 on AWS:
resource-agents-4.10.0-34.el9_2.2.x86_64

[root@rhel9ha0 ~]# rpm -ql resource-agents | grep aws-vpc-move-ip
/usr/share/man/man7/ocf_heartbeat_aws-vpc-move-ip.7.gz
[root@rhel9ha0 ~]#

resource-agents-cloud-4.10.0-34.el9_2.2.x86_64

[root@rhel9ha0 ~]# rpm -ql resource-agents-cloud | grep aws-vpc-move-ip
/usr/lib/ocf/resource.d/heartbeat/aws-vpc-move-ip
/usr/share/man/man7/ocf_heartbeat_aws-vpc-move-ip.7.gz

Looking at the README with fresh eyes, I notice the word Support is used frequently. Given this is a community repository, I would anticipate there is no (paid-for) support of the code, and the correct word for inference (particularly when translated) should instead be Compatibility / Compatible.

Example:

See sbd(8) man page, section 'Configuration via environment' for their description.
Supported options are:

overwrite not recognized on pcs cluster setup

Ansible Task 'Create a corosync.conf file content using pcs-0.10' has an error on re-run using RHEL 8.4 with pcs cluster setup

Error:

option --overwrite not recognized

Diagnosis

This option does not exist, is this supposed to be --force ?

Version:

[root@host-p ~]# pcs --version
0.10.4

Output:

TASK [fedora.linux_system_roles.ha_cluster : Create a corosync.conf file content using pcs-0.10] *********
fatal: [host-p]: FAILED! =>
{
    "changed": true,
    "cmd": [
        "pcs",
        "cluster",
        "setup",
        "--corosync_conf",
        "/tmp/ansible.5vjb9txg_ha_cluster_corosync_conf",
        "--overwrite",
        "--",
        "clusterhdb",
        "host-p",
        "host-s"
    ],
    "delta": "0:00:00.299788",
    "end": "2023-04-27 23:38:30.909534",
    "msg": "non-zero return code",
    "rc": 1,
    "start": "2023-04-27 23:38:30.609746",
    "stderr": "",
    "stderr_lines": [],
    "stdout_lines": [
        "option --overwrite not recognized"
    ]
}

high-availability firewall service is not added on qdevice node

The ha_cluster_manage_firewall: true attribute does not alter the firewalld configuration for the qdevice node.

In tasks/main.yml in task "Install and configure HA cluster" the firewall.yml inclusion only applies when the ha_cluster_cluster_present is true which will always be false for the qdevice.

Looks like the firewall.yml inclusion should also be added in tasks/shell_pcs/pcs-qnetd.yml

Available firewalld services in qdevice after running the role.

[root@qdevice ~]#
[root@qdevice ~]# firewall-cmd --list-services
cockpit dhcpv6-client ssh
[root@qdevice ~]#

SUSE Support and crmsh

Hello Team,

I am working on sap-linuxlab (community.sap_install) and our plan is to make sure that role sap_ha_pacemaker_cluster, which consumes fedora.linux_system_roles, is correctly working on SUSE systems.

I noticed that groundwork for adoption of non-pcs steps was already done thanks to Sean in #122, so adoption would consist of:

splitting of main.yml based on os_family, because it contains pcs related tasks directly (non-generic names)
creating variables for Suse
creating separate tasks under shell_crmsh for crmsh, corosync, sbd
replicating cluster setup for SUSE from sap-linuxlab/community.sles-for-sap/cluster

There are few things that I wanted to ask for clarification, before proceeding with any changes in fork:

Are there any ongoing efforts for SUSE support, or is this free range to pickup?
ha_cluster and idempotency?

If ansible_hostname includes '_' the role fails with `invalid characters in salt`

I am running the role against hosts with hostnames primary_8_6 and secondary_8_6. The task Set hacluster password fails on both with the invalid characters in salt error. https://github.com/linux-system-roles/ha_cluster/blob/master/tasks/install-and-configure-packages.yml#L52 must do replace('_','x') too.
We also must examine what other symbols that might appear in hostnames are not allowed in salt.

Eliminate non-inclusive language

Hello, we have a project to eliminate non-inclusive language from the linux-system-roles.

Running a utility woke (now it's supported in tox. please install the latest tox-lsr and run tox -e woke), two non-inclusive words are reported - dummy and slave.

Can we replace them with more appropriate words? For the word dummy, placeholder, and sample are recommended. For slave, secondary, replica, responder, device,worker,proxy, and performer are the candidates. It looks to me that replacing dummy is straightforward. But not certain about slave. I wonder if it is doable? Or if we replace it, does it break the ha cluster? Thanks!

pcs-qnetd: check-mode disabled for force removing configuration?

This new task is failing when run in check-mode against a cluster that was configured using a previous LSR version:

ha_cluster/tasks/pcs-qnetd.yml

Line 3 in bea2773

- name: Remove qnetd configuration

Collection                Version
------------------------- -------
fedora.linux_system_roles 1.30.5

Setup
Cluster built using a previous version of the LSR (no qnetd support).
Dry-run using --check against the existing cluster using the newer LSR (no change of input parameters).
Dry-run fails because the task is designed to force an actual configuration change, even for a check, which fails due to the missing corosync-qnetd package.

Issue
Is it really desired to force remove the qnetd config even during a --check run?
As a user I'm surprised by this behavior, as I would not expect any changes on the systems when running the playbook in check-mode explicitly.

Setting cluster members' attributes.

I've been trying to set a RHEL HA cluster using the ha_cluster system-role but I haven't found a way to define cluster members' attributes, this is required to define different cluster constraints.