linux-system-roles / metrics Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 19.0 551 KB

An ansible role which configures metrics collection.

Home Page: https://linux-system-roles.github.io/metrics/

License: MIT License

Jinja 4.05% JavaScript 9.49% HTML 75.70% Shell 10.75%

metrics's People

Contributors

Stargazers

Watchers

Forkers

sradco ovirt richm nhosoi pcahyna kurik msgpo jharuda natoscott spetrosi global-localhost global19 global19-atlassian-net briansmith0 saito-hideki ukulekek andreasgerstmayr sgnconnects mklee786

metrics's Issues

Graph access ideas

I tried this role on RHEL 8.3 and it works well, however few items could be improved for graph access:

Grafana port or URL is not mentioned anywhere in the README, while it may be obvious for many it would still be helpful to spell it out in the README
Ideally there would be an option to setup all components to listen to localhost ports only so making services accessible from outside would be opt-in (or at least possible to opt-out), now the services are exposed to the world after executing the role
It could be explained - or perhaps even automated - how to open the needed firewall ports for remote access
Accessing graphs over HTTPS doesn't seem to work, I get Error code: SSL_ERROR_RX_RECORD_TOO_LONG (using a self-signed certificate would not be perfect but at least it would allow using HTTPS, now there is no way around this error)

Thanks.

pmrepconf is not available on all platforms

I have the following playbook:

# SPDX-License-Identifier: MIT
---
- name: Check if pcp2elasticsearch has been deployed
  hosts: all

  roles:
    - role: linux-system-roles.metrics
      vars:
        metrics_into_elasticsearch: yes

  tasks:
    - name: Check if pcp2elasticsearch is installed
      command: test -x /usr/bin/pcp2elasticsearch

This playbook fails on the following distros:

RHEL <= 8.3
Fedora <= 32
CentoOS
(I do not have a Debian box available, so I can not provide infor for this distro)

The error message printed by Ansible is as follows:

fatal: [10.0.138.13]: FAILED! => {"changed": false, "msg": "Unable to start service pcp2elasticsearch: Job for pcp2elasticsearch.service failed because the control process exited with error code.\nSee \"systemctl status pcp2elasticsearch.service\" and \"journalctl -xe\" for details.\n"}

The journalctl -xe shows as the real reason:

pcp2elasticsearch.service: Failed at step EXEC spawning /usr/bin/pmrepconf: No such file or directory

The pmrepconf tool is used in unit file of the pcp2elasticsearch.service service. The issue is that pmrepconf tool has been introduced in pcp version 5.2.0, while all the distros mentioned above are using older pcp version where this tool is not available.

If this is intentional and the metrics_into_elasticsearch functionality should work only with the pcp >= 5.2.0, let me know and I will modify my tests accordingly.

Metrics sub-roles are not visible to Ansible in default configuration

The metrics role uses the following set of sub-roles:

performancecopilot_metrics_bpftrace
performancecopilot_metrics_mssql
performancecopilot_metrics_elasticsearch
performancecopilot_metrics_pcp
performancecopilot_metrics_grafana
performancecopilot_metrics_redis

Unfortunately these sub-roles are not visible to Ansible in the default configuration.

Let's have a playbook /root/myplaybook.yml in root's home directory:

# SPDX-License-Identifier: MIT
---
- name: My playbook
  hosts: all

  roles:
    - role: linux-system-roles.metrics
      vars:
        metrics_from_elasticsearch: yes

When the playbook is run using this command ansible-playbook /root/myplaybook.yml it fails with the following error:

TASK [Setup Elasticsearch metrics] ********************************************************
ERROR! the role 'performancecopilot_metrics_pcp' was not found in /root/roles:/root/.ansible/roles:/usr/share/ansible/roles:/etc/ansible/roles:/root

The error appears to be in '/usr/share/ansible/roles/metrics/roles/performancecopilot_metrics_elasticsearch/tasks/main.yml': line 24, column 11, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  import_role:
    name: performancecopilot_metrics_pcp
          ^ here

Note: a patch for issue #33 has been applied here as the original pcp role does not exists at all

The issue is: In the default installation/configuration the metrics role can not find its own sub-roles.

tests_sanity_bpftrace failure in task Check if allowed users of bpftrace are configured

https://fedorapeople.org/groups/linuxsystemroles/logs/linux-system-roles-metrics-pull-linux-system-roles_metrics-60-0c55307-centos-8-20210202-173533/artifacts/ansible.log

TASK [Check if allowed users of bpftrace are configured] ***********************
task path: /tmp/tmpl79__1xz/tests/check_bpftrace.yml:6
fatal: [/cache/centos-8.qcow2]: FAILED! => {"changed": true, "cmd": "grep -w '^allowed_users' /var/lib/pcp/pmdas/bpftrace/bpftrace.conf | grep -wq 'pcptest'", "delta": "0:00:00.005343", "end": "2021-02-02 17:47:01.734173", "msg": "non-zero return code", "rc": 1, "start": "2021-02-02 17:47:01.728830", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

@natoscott @kurik

Rerunning the role fails after changing Grafana admin credentials

After running the role, logging in to Grafana (thus changing the password), the role fails:

TASK [rhel-system-roles.metrics : Ensure graphing service runtime settings are configured] ************************************************************************************************************************
fatal: [localhost]: FAILED! => {"cache_control": "no-cache", "changed": false, "connection": "close", "content": "{"message":"Invalid username or password"}", "content_length": "42", "content_type": "application/json; charset=UTF-8", "date": "Mon, 16 Nov 2020 09:33:25 GMT", "elapsed": 0, "expires": "-1", "json": {"message": "Invalid username or password"}, "msg": "Status code was 401 and not [200]: HTTP Error 401: Unauthorized", "pragma": "no-cache", "redirected": false, "status": 401, "url": "http://admin:admin@localhost:3000/api/plugins/performancecopilot-pcp-app/settings", "x_frame_options": "deny"}

While the check is probably helpful initially I think the role should handle somehow the case where Grafana has been used already to allow rerunning the role. Thanks.

"Ensure performance metric collector authentication is configured" task fails when run on CentOS 9 Stream system

When running the Metrics role against a CentOS Stream 9 system, the "Ensure performance metric collector authentication is configured" task fails with error:

fatal: [c9s-server1.example.com]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: '__pcp_sasl_mechlist' is undefined"}

This is probably related to there being no metrics/roles/pcp/vars/CentOS_9.yml file to define this variable.

Here are the ansible_distribution variables gathered from my CentOS 9 Stream system:

"ansible_distribution": "CentOS",
"ansible_distribution_file_parsed": true,
"ansible_distribution_file_path": "/etc/redhat-release",
"ansible_distribution_file_variety": "RedHat",
"ansible_distribution_major_version": "9",
"ansible_distribution_release": "NA",
"ansible_distribution_version": "9",

rhel8.conf

Hi,
You added in your initial patch a rhel8.conf

[options]
version = 1

[rhel8-zeroconf]
interval = 1s
#proc metrics
proc.psinfo.cmd = ,
proc.psinfo.sname = ,
proc.psinfo.ppid = ,

What is it used for?
Is it required when installing pcp-zeroconf?

@lberk

Typo in README.md - metrics_with_elasticsearch to be replaced by metrics_into_elasticsearch

There is a typo in the README.md where keyword metrics_with_elasticsearch should be replaced by keyword metrics_into_elasticsearch.
The keyword metrics_with_elasticsearch is not used in the metrics role, all tasks related to elasticsearch work with metrics_into_elasticsearch keyword.

MSSQL agent does not register it self in PMCD due to missing python3-pyodbc package

When metrics_from_mssql: yes is set in a playbook, the role installs MSSQL agent. However registration of this agent in PMCD fails due to missing python3-pyodbc package.

The fail of the registration is not recognized by the playbook run. The run of such playbook succeeds, but the MSSQL agent simply does not work due to the missing registration.

Note: This has been tested on Fedora-33 and RHEL-8.4-Development distros. As I do not have Debian or other distros available, I can not confirm this issue on those other distros.

Local role variable "role_name" conflicts with global variable of the same name

The role uses a locally defined variable role_name. Unfortunately a global variable of the same name (see Ansible doc) sometimes overrides the local value.

One example of such conflict is in generation of pcp2elasticsearch.service file form roles/performancecopilot_metrics_elasticsearch/templates/pcp2elasticsearch.service.j2 template. The generated file looks i.e. like this

[Unit]
Description=pcp-to-elasticsearch metrics export service
Documentation=man:pcp2elasticsearch(1)
After=network-online.target pmcd.service

[Service]
TimeoutSec=10
ExecStartPre=/usr/bin/pmrepconf -c \
             --option interval=60 \
             --option es_index=pcp \
             --option es_hostid= \
             --option es_server=http://localhost:9200 \
             --option es_search_type=pcp-/usr/share/ansible/roles/rhel-system-roles.metrics/roles/performancecopilot_metrics_elasticsearch \
             /etc/pcp/pcp2elasticsearch.conf
ExecStart=/usr/bin/pcp2elasticsearch --include-labels :metrics
Restart=on-failure

[Install]
WantedBy=multi-user.target

As can be seen the value of the cmdline switch es_search_type, which has a value defined as "{{ metrics_provider }}-{{ role_name }}" contains full path of the role, instead of an expected short role identifier like metrics.

Perhaps the best way to avoid such name space collisions will be to rename the locally defined variable role_name to a name which is not used globally.

tests_sanity_into_elasticsearch fails - __elasticsearch_packages_export_pcp is undefined

https://fedorapeople.org/groups/linuxsystemroles/logs/linux-system-roles-metrics-pull-linux-system-roles_metrics-60-0c55307-fedora-32-20210202-172849/artifacts/ansible.log

TASK [/tmp/tmp_puugelq/roles/elasticsearch : Establish Elasticsearch metrics export package names] ***
task path: /tmp/tmp_puugelq/roles/elasticsearch/tasks/main.yml:20
fatal: [/cache/fedora-32.qcow2]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: '__elasticsearch_packages_export_pcp' is undefined\n\nThe error appears to be in '/tmp/tmp_puugelq/roles/elasticsearch/tasks/main.yml': line 20, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Establish Elasticsearch metrics export package names\n  ^ here\n"}

@natoscott @kurik

Configuration of Elasticsearch, MSSQL and BPFtrace agents fail

Configuration of Elasticsearch agent fails with the following error:

fatal: [10.0.139.221]: FAILED! => {"changed": false, "checksum": "c076ec4531eacb5b8058b3a0a5b82a1218acf987", "msg": "Destination directory /etc/pcp/elasticsearch does not exist"}

The issue occurs when the following playbook runs:

# SPDX-License-Identifier: MIT
---
- name: Install Elastic search
  hosts: all

  roles:
    - role: linux-system-roles.metrics
      vars:
        metrics_from_elasticsearch: yes

The problem is in roles/performancecopilot_metrics_elasticsearch/tasks/main.yml file in section named Ensure PCP Elasticsearch agent is configured, where ansible is trying to install an elasticsearch config file from a template to /etc/pcp/elasticsearch directory. Unfortunately the destination directory does not exists, because at that time there is no pcp-pmda-elasticsearch package installed yet. The pcp-pmda-elasticsearch package owns the /etc/pcp/elasticsearch directory.

Note 1: The same issue applies to roles/performancecopilot_metrics_mssql/tasks/main.yml* and pcp-pmda-mssql
Note 2: The same issue applies to roles/performancecopilot_metrics_bpftrace/tasks/main.yml* and pcp-pmda-bpftrace

SASL authentication is not configured properly

I am using the following playbook:

# SPDX-License-Identifier: MIT
---
- name: Ensure that authentication is configured
  hosts: all

  roles:
    - role: linux-system-roles.metrics
      vars:
        metrics_from_bpftrace: yes
        metrics_username: pcptest
        metrics_password: tdlendle

  tasks:
    - name: Check if authentication functionality works
      shell: sasldblistusers2 -f /etc/pcp/passwd.db | grep -wq pcptest

    - name: Check if a client can access metrics
      command: pminfo -f -h "pcp://127.0.0.1?username=pcptest&password=tdlendle" disk.dev.read

Issue # 1

The expectation is, the role will set SASL password for the pcptest user. Unfortunately this does not happen and the /etc/pcp/passwd.db file is not created. When digging a bit deeper into the role, the problem is IMO in the file roles/performancecopilot_metrics_pcp/tasks/pmcd.yml, namely in its section Ensure performance metric collector SASL accounts are configured..

In this section the name of a SASL user is expected to be stored in a field saslname, however there is no such field defined. The role uses field sasluser instead, which is set to the expected value.

When I change the field saslname to sasluser in the roles/performancecopilot_metrics_pcp/tasks/pmcd.yml file, then the role generates the expected /etc/pcp/passwd.db file.

Issue # 2

Even if I apply the change, I have just described above (issue # 1), the created /etc/pcp/passwd.db file is empty (contains no users). That is because the password for the user in roles/performancecopilot_metrics_pcp/tasks/pmcd.yml file is set using saslpasswd2 command. But the saslpasswd2 command uses -n switch which prevents the command from storing credentials. Removing the -n switch from saslpasswd2 command fixes the issue and /etc/pcp/passwd.db file now contains the password for the user and command sasldblistusers2 -f /etc/pcp/passwd.db | grep -wq pcptest on the host machine succeeds.

Issue # 3

After I apply fixes described above (issue # 1 and # 2) there is still one problem, why command pminfo -f -h "pcp://127.0.0.1?username=pcptest&password=tdlendle" disk.dev.read fails.
On the host system, there is no cyrus-sasl-scram package installed. When I install this package manually, then everything start to work as expected.

The cyrus-sasl-scram package is defined in the role, in file roles/performancecopilot_metrics_pcp/vars/RedHat.yml as a variable __pcp_packages_sasl. However this variable is not used anywhere else, as far I can see.

Update fields names collected from the openmetrics endpoint

I need to update the fields names you mentioned in issue #3, and add additional metadata.
In Elasticsearch we send json documents.

This is an example of how it should look like when we send to Elasticsearch.

{
	"time": "2017-01-19T:29:10+00:00",
	"collectd.processes.ps_code": 21635072,
	"dstypes": "gauge",
	"interval": 10.0,
	"host": "dhcp-0-135.tlv.redhat.com",
	"plugin": "processes",
	"plugin_instance": "collectd",
	"type": "ps_code",
	"type_instance": "",
	"ovirt": {
		"entity": "host",
		"host_id": "{{ ovirt_vds_vds_id }}",
		"engine_fqdn": "{{ ovirt_engine_fqdn }}",
		"cluster_name": "{{ ovirt_vds_cluster_name }}"
	},
	"tag": "project.ovirt-metrics-{{ ovirt_env_name }}",
	"hostname": "hostname",
	"ipaddr4": "ip address"
}

Under "ovirt" are additional metadata.
@lberk

"Install Redis packages" task fails when run on CentOS 9 Stream system

When running the Metrics role against a CentOS Stream 9 system, the "Install Redis packages" task fails with error:

fatal: [c9s-server1.example.com]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: '__redis_packages_extra' is undefined\n\nThe error appears to be in '/usr/share/linux-system-roles/metrics/roles/redis/tasks/main.yml': line 15, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Install Redis packages\n ^ here\n"}

This is probably related to there being no metrics/roles/redis/vars/CentOS_9.yml file to define this variable.

Here are the ansible_distribution variables gathered from my CentOS 9 Stream system:

    "ansible_distribution": "CentOS",
    "ansible_distribution_file_parsed": true,
    "ansible_distribution_file_path": "/etc/redhat-release",
    "ansible_distribution_file_variety": "RedHat",
    "ansible_distribution_major_version": "9",
    "ansible_distribution_release": "NA",
    "ansible_distribution_version": "9",

Specify grafana username/password

When you use metrics role with metrics_graph_service set to true, and then login to grafana instance for the 1st time, you need to change username and password. But then when you run a playbook with the metrics role again, it fails on Ensure graphing service runtime settings are configured task since now ansible can't login to grafana.

TASK [/usr/share/linux-system-roles/metrics/roles/grafana : Ensure graphing service runtime settings are configured] ***                           
fatal: [centos]: FAILED! => {"cache_control": "no-cache", "changed": false, "connection": "close", "content": "{\"message\":\"invalid username or p
assword\"}", "content_length": "42", "content_type": "application/json; charset=UTF-8", "date": "Sat, 11 Dec 2021 23:26:16 GMT", "elapsed": 0, "exp
ires": "-1", "json": {"message": "invalid username or password"}, "msg": "Status code was 401 and not [200]: HTTP Error 401: Unauthorized", "pragma
": "no-cache", "redirected": false, "status": 401, "url": "http://admin:admin@localhost:3000/api/plugins/performancecopilot-pcp-app/settings", "x_c
ontent_type_options": "nosniff", "x_frame_options": "deny", "x_xss_protection": "1; mode=block"}

Would it be possible to specify grafana username/password to prevent this failure? Ideally I would like to set the password directly in the playbook to avoid setting it for the first time, but I'm not sure if grafana provides a simple way to do that.

Missing default value for "elasticsearch_agent" causes "performancecopilot_metrics_elasticsearch" role to fail

Field elasticsearch_agent in roles/performancecopilot_metrics_elasticsearch/tasks/main.yml file does not have defined a default value. This is causing fail of the role when metrics_from_elasticsearch: yes is set in a playbook.

Here is an example playbook:

# SPDX-License-Identifier: MIT
---
- name: Ensure that the role runs
  hosts: all

  roles:
    - role: linux-system-roles.metrics
      vars:
        metrics_from_elasticsearch: yes

Here is the error message as reported by Ansible:

RUNNING HANDLER [performancecopilot_metrics_pcp : restart pmlogger] ***********************
fatal: [10.0.139.221]: FAILED! => {"msg": "The conditional check 'elasticsearch_agent | bool' failed. The error was: error while evaluating conditional (elasticsearch_agent | bool): 'elasticsearch_agent' is undefined\n\nThe error appears to be in '/usr/share/ansible/roles/metrics/roles/performancecopilot_metrics_pcp/handlers/main.yml': line 19, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: restart pmlogger\n  ^ here\n"}

[PCP] Add support for sending metrics to Viaq elasticsearch

We need to be able to send metrics to Viaq we need support for cert auth authentication, Elasticsearch index parameters, buffer handling and back-off mechanism.

In the Rsyslog role the request also sends the following parameters that
type="omelasticsearch"
name="{{ res.name | default('viaq-elasticsearch') }}"
server="{{ res.server_host | default('logging-es') }}"
serverport="{{ res.server_port | default(9200) | int }}"
template="viaq_template"
searchIndex="index_template"
dynSearchIndex="on"
searchType="com.redhat.viaq.common"
bulkmode="on"
writeoperation="create"
bulkid="id_template"
dynbulkid="on"
retryfailures="on"
retryruleset="try_es"
usehttps="on"

In Fluentd we set the following parameters:

@type elasticsearch
host {{ fluentd_elasticsearch_host }}
port {{ fluentd_elasticsearch_port }}
scheme https
client_cert {{ fluentd_elasticsearch_client_cert_path }}
client_key {{ fluentd_elasticsearch_client_key_path }}
ca_file {{ fluentd_elasticsearch_ca_cert_path }}
ssl_verify {{ fluentd_elasticsearch_ssl_verify|lower }}
target_index_key {{ fluentd_elasticsearch_target_index_key }}
remove_keys {{ fluentd_elasticsearch_remove_keys }}
type_name {{ fluentd_elasticsearch_type_name_metrics }}
request_timeout {{ fluentd_elasticsearch_request_timeout_metrics }}

Buffer configurations:
flush_interval {{ fluentd_flush_interval_metrics }}
buffer_chunk_limit {{ fluentd_buffer_chunk_limit_metrics }}
buffer_queue_limit {{ fluentd_buffer_queue_limit_metrics }}
buffer_queue_full_action {{ fluentd_buffer_queue_full_action_metrics }}
retry_wait {{ fluentd_retry_wait_metrics }}
retry_limit {{ fluentd_retry_limit_metrics }}
disable_retry_limit {{ fluentd_disable_retry_limit_metrics }}
max_retry_wait {{ fluentd_max_retry_wait_metrics }}
flush_at_shutdown {{ fluentd_flush_at_shutdown_metrics }}
num_threads {{ fluentd_num_threads_metrics }}
slow_flush_log_threshold {{ fluentd_slow_flush_log_threshold_metrics }}

Can you please update status in PCP?
What is missing, will it be possible to implement?

This is a blocker for oVirt.

@lberk @pcahyna @tabowling

grafana install fail

Galaxy playbook is failing to install Grafana. Error is:

fatal: [localhost]: FAILED! => {"changed": false, "msg": "No package matching 'grafana' found available, installed or updated", "rc": 126, "results": ["No package matching 'grafana' found available, installed or updated"]}

Grafana is available to be installed via YUM:

[justin@netmon2 pcp-install]$ sudo yum search grafana
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror.datto.com
 * epel: mirror.grid.uchicago.edu
 * extras: mirrors.xtom.com
 * updates: centos.vwtonline.net
================================================================================= N/S matched: grafana =================================================================================
pcp-webapp-grafana.noarch : Grafana web application for Performance Co-Pilot (PCP)

OS running the playbook is CentOS7.6

Multiple inputs to multiple outputs

Does PCP allow collecting metrics from multiple inputs and defining a
different output to each one?

Yes. PCP operates on a pull model. Each 'client' specifies which
metrics it wants from which pmcd (the pcp 'daemon') and they're
responded to accordingly.

For example , ovirt metrics(collected from collectd) to viaq(openshift logging elasticsearch)
and pcp-zeroconf metrics to local.

Yes. That would all work. They can all be different client tools as
well, ie:
ovirt metrics (collectd/write_prometheus) collected by pcp's
pmdaprometheus then written to elasticsearch via pcp2elasticsearch
(either on the same host, or remotely, or both remote)

and then pcp-zerconf metrics to local. This would be done by pmlogger
logging the metrics to very storage efficient archives.

Wrong setup of bpftrace users

When a playbook has set metrics_from_bpftrace: yes the role generates wrong setup for allowed_users field in /etc/pcp/bpftrace/bpftrace.conf config file.

The allowed_users field in the generated /etc/pcp/bpftrace/bpftrace.conf config file looks as follows:

allowed_users = root,/usr/share/ansible/roles/metrics/roles/performancecopilot_metrics_bpftrace

The expectation is to have only root user in the field. The path to the bpftrace sub-role is erroneous.

IMO the issue is in tasks/main.yml file, where in section "Setup bpftrace metrics" is this following line:

      - { user: "{{ role_name }}", sasluser: "{{ metrics_username }}", saslpassword: "{{ metrics_password }}" }

The user field expands to path of the sub-role.

[PCP] Add collection from openmetrics end point

We need the PCP role to be able to collect metrics from a Prometheus endpoint or a Collectd with write_prometheus output plugin.

@lberk

Is restart of PCP required after installing of additional packages?

Is restart of PCP required after installing of additional packages?
@lberk

What base packages are required when not using pcp-zeroconf

Does pcp-zeroconf deploy a default configuration file that enables the metrics to collect from the machine?
In case we don't want to collect metrics based on the pcp-zeroconf configuration, what are the base pcp packages required for pcp ?

@lberk

The role fails on RHEL-6 due to missing "cyrus-sasl-scram" package

Let's have a following playbook:

# SPDX-License-Identifier: MIT
---
- name: Ensure that the role runs
  hosts: all

  roles:
    - role: linux-system-roles.metrics

When this playbook runs on RHEL-6, it fails with the following error message:

fatal: [10.0.139.92]: FAILED! => {"changed": false, "msg": "No package matching 'cyrus-sasl-scram' found available, installed or updated", "rc": 126, "results": ["cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 providing cyrus-sasl-lib is already installed", "No package matching 'cyrus-sasl-scram' found available, installed or updated"]}

IMO the reason is in file roles/performancecopilot_metrics_pcp/vars/RedHat.yml where this package is requested to be installed. Unfortunately the 'cyrus-sasl-scram' package has been introduced in RHEL/CentOS >= 7 and it is not available on RHEL-6 nor on CentOS-6.

While on CentOS the 'cyrus-sasl-scram' package is requested only on CentOS-7 and CentOS-8 (in roles/performancecopilot_metrics_pcp/vars/CentOS_8.yml and roles/performancecopilot_metrics_pcp/vars/CentOS_7.yml files), on RHEL the Ansible tries to install it also on RHEL-6.

The role uses wrong role name when pointing to it self

In the metrics role there are the following tasks pointing to the role it self:

roles/performancecopilot_metrics_bpftrace/tasks/main.yml
roles/performancecopilot_metrics_elasticsearch/tasks/main.yml
roles/performancecopilot_metrics_mssql/tasks/main.yml

However it points to it self as pcp role, while the role it self is called performancecopilot_metrics_pcp.

Any playbook enabling bptftrace or elasticsearch or mssql fails on this issue. Renaming the role from pcp to performancecopilot_metrics_pcp in those three tasks above fix the issue.

Configuration of bpftrace PMDA does not work on platforms where PCP version is less or equal to 5.1

Background
bpftrace agent is delivered to a system via pcp-pmda-bpftrace package. Starting from PCP version 5.2 the pcp-pmda-bpftrace package has changed the layout of files on a filesystem. While in PCP <= 5.1 the PMDA files are located in /var/lib/pcp/pmdas/bpftrace directory, in PCP >= 5.2 the files are located in /usr/libexec/pcp/pmdas/bpftrace and bpftrace.conf is located in /etc/pcp/bpftrace/ directory. For PCP >= 5.2 there are symlinks from the old /var/lib/pcp/pmdas/bpftrace directory to new locations, to achieve backward compatibility.

The issue
When metrics_from_bpftrace: yes is set in a playbook, the role generates the bpftrace.conf file in /etc/pcp/bpftrace/ directory (so it supports PCP >= 5.2). However when the role runs on a platform with PCP <= 5.1, than the config file is expected by the PMDA to be in /var/lib/pcp/pmdas/bpftrace directory. As such, all the platforms with PCP <= 5.1 do not have bpftrace agent configured properly (a default config file is used instead).

List of affected platforms:

RHEL <= 8.3
CentOS - all currently available releases

The same issue affects also configuration files of mssql and elasticsearch PMDAs.

do not use ignore_errors: yes

Do not use ignore_errors: yes as in https://github.com/linux-system-roles/metrics/blob/master/roles/elasticsearch/tasks/main.yml#L62
If it is necessary to ignore errors from the task, the task should register the output, then have a separate task to fail in case there are "real" errors, so as not to mask any unexpected errors.

PMCD is not restarted after installation of ElasticSearch agent

When I try to enable import of metrics from Elastic Search, I need to manually restart pmcd on the host machine after a playbook run. Without the restart elasticsearch agent is not registered.

The role installs pcp-pmda-elasticsearch package and creates /var/lib/pcp/pmdas/elasticsearch/.NeedInstall file. However the pmcd is not restarted then, so pmcd startup script does not register it.

I am using a playbook like this one:

# SPDX-License-Identifier: MIT
---
- name: Make ElasticSearch metrics available
  hosts: all

  roles:
    - role: linux-system-roles.metrics
      vars:
        metrics_from_elasticsearch: yes

Note: When I run the playbook above I do not see any execution of restart pmcd handler in the log. However, if the playbook explicitly forces execution of handlers ...

# SPDX-License-Identifier: MIT
---
- name: Make ElasticSearch metrics available
  hosts: all

  roles:
    - role: linux-system-roles.metrics
      vars:
        metrics_from_elasticsearch: yes

  tasks:
    - name: Flush handlers
      meta: flush_handlers

... then I see the following message in the log:

RUNNING HANDLER [performancecopilot_metrics_pcp : restart pmcd] ***************************
skipping: [10.0.137.131]

So it looks like the role sends a notification to restart pmcd however the handler is skipped for some reason.

Missing installation of BCC agent when "metrics_graph_service: yes" is set

When metrics_graph_service: yes is set in a playbook, the role installs grafana-pcp package. On the latest releases of RHEL and Fedora the grafana-pcp package (version 3.x.y) delivers a Grafana dashboard PCP Vector: eBPF/BCC Overview. However this dashboard requires pcp-pmda-bcc package to be installed and configured.

The metrics role currently does not install the BCC PMDA, so all the charts of the mentioned dashboard are in error state, not being able to get metrics from the BCC agent.

PMCD is not restarted on RHEL platform

I have a playbook like this:

# SPDX-License-Identifier: MIT
---
- name: Ensure that MSSQL is configured
  hosts: all

  roles:
    - role: linux-system-roles.metrics
      vars:
        metrics_from_mssql: yes

  pre_tasks:
    - name: Ensure python3-pyodbc is installed
      package:
        name: python3-pyodbc
        state: present

  tasks:
    - name: Check if MSSQL functionality works
      shell: pmprobe -I pmcd.agent.name | grep -w '"mssql"'

On Fedora and CentOS this works just fine. However on RHEL it fails with the following error message:

TASK [Check if mssql pmda is registered] ************************************************** fatal: [10.0.139.219]: FAILED! => {"changed": true, "cmd": "pmprobe -I pmcd.agent.name | grep -w '\"mssql\"'", "delta": "0:00:00.012001", "end": "2021-01-29 04:59:55.181201", "msg": "non-zero return code", "rc": 1, "start": "2021-01-29 04:59:55.169200", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

... so the registration of the PMDA has not finished. There is /var/lib/pcp/pmdas/mssql/.NeedInstall file created, but pmcd is not restarted.

The reason is IMO somehow related to pcp-zeroconf package, which starts and enables pmcd service only on RHEL platform, while on Fedora and CentOS the pmcd is not started nor enabled after boot.

linux-system-roles / metrics Goto Github PK

metrics's People

Contributors

Stargazers

Watchers

Forkers

metrics's Issues

Recommend Projects

Recommend Topics

Recommend Org