rucio / probes Goto Github PK

Common Nagios probes to monitor Rucio

License: Apache License 2.0

Python 100.00%

probes's Introduction

Rucio - Scientific Data Management

Rucio is a software framework that provides functionality to organize, manage, and access large volumes of scientific data using customisable policies. The data can be spread across globally distributed locations and across heterogeneous data centers, uniting different storage and network technologies as a single federated entity. Rucio offers advanced features such as distributed data recovery or adaptive replication, and is highly scalable, modular, and extensible. Rucio has been originally developed to meet the requirements of the high-energy physics experiment ATLAS, and is continuously extended to support LHC experiments and other diverse scientific communities.

Documentation

General information, API/REST description and guides can be found in our documentation or on our webpage.

Try it out

We provide a dockerized environment which serves both as a demo environment and a development environment. It includes all the necessary preconfigured components for multiple storage and transfers developments.

Developers

For information on how to contribute to Rucio, please refer and follow our CONTRIBUTING guidelines. We strongly recommend to use the dockerized environment for development.

Operators

To learn how to deploy and configure Rucio, consult the documentation available online.

Getting Support

If you are looking for support, please contact us via one of our official channels.

probes's People

Contributors

Watchers

probes's Issues

Modify voms collector to ban identities

Motivation

One identity can map to multiple accounts. Therefore banning just one account is not enough.
Identities that correspond to people banned in voms must be removed.

Migrated from JIRA

Common probes to use common queries

Both

https://github.com/rucio/probes/blob/master/common/check_obsolete_replicas

and

https://github.com/rucio/probes/blob/master/common/check_expired_locked_rules

Explicitly reference atlas tables that may or may not be there for all VO's. As these are common probes, this prevents them from running on all vo's.

To fix either:

Replace the query with something using the rucio data model (continuing from #109),
Update it to not reference specific tables (ex: atlas_rucio.rules -> {schema}.rules)

Retire redundant RSEUsage counters

As mentioned in rucio/rucio#2030, there is not much benefit from keeping distinct RSEUsage counters like gsiftp, https, json and srm. They should be retired and the storage one should be used exclusively.

Probes are hard coded for ATLAS

@dchristidis CMS would like to use some of the ATLAS probes and I've verified that at least one of them works perfectly for us except for this line:

probes/common/check_transfer_queues_status

Line 39 in dfff3ce

FROM atlas_rucio.requests

Can these probes be parameterized so that we can either supply our own string for "atlas_rucio" or leave it off entirely? (Leaving it off works for us)

If you just want to parameterize one of them and leave the rest for us as we adopt them, that's OK. Whatever is easiest for you.

Authentication: OIDC XDC IAM communication prototype

Related to issue #2612 in the rucio/rucio repository.

Adapt probes to rucio/rucio#5804

record_gauge() does not exist anymore and calls to it need to be adapted.

Clean dead code in the probes (ATLAS and common)

Modernize common probes to use context manager

Probes https://github.com/rucio/probes/blob/master/common/check_expired_locked_rules and https://github.com/rucio/probes/blob/master/common/check_obsolete_replicas don't use a context manager so they're inaccessible to Prometheus. They should be updated to use common.utils.PromotheusPusher

check_lost_files: issues

I noticed the next issues while I was working on improvements of the look of the reports from the check_lost_files script:

Lost files may not be reported or duplicated in the reports.
The script can take lost-files-info info from two sources: the ready dump of lost files on the web,
and if it fails (sometimes 404 "not found" error occurs), directly from the rucio. The problem is that these two scenarios provides lost files from two different time intervals.
The dump gives lost files in the interval Mon-Sun from the previous week, but the rucio request gives files of 7 days from now. If different sources are used, two sequential intervals may intersect (cause to duplicated info) or have a gap between (cause to lost info):
E.g. the script on 2018-08-08 could not get the dump, so it used last 7days interval from
[2018-08-01, 2018-08-08]. In previous run 2018-08-01 it used the Mon-Sun dump from
[2018-07-22, 2018-07-29]. In the next run 2018-08-15 it used Mon-Sun dump from
[2018-08-06, 2018-08-12] => it means that lost files in [2018-07-30, 2018-08-01] are not reported, lost files in [2018-08-06, 2018-08-08] are double reported.
Another problem is that the time of selection "from now" is not accurate and may vary a little in different script runs, so lost files on the edges of interval may not be reported/duplicated in the reports even with the same source of lost files lists. This is the reason to review the dump creator on the web also, it may have the same (select from now) issue.
Select optimization. For some reason, the lost files info (from dump and from rucio both) is not the final that is used in the script. After receiving the selection result, the script makes selection itself: not like 'panda.%' and not like '%_sub%' and removes duplicates of the same scope:filenames (the next issue). It would be logical to receive the final lost files list, without extra selections by means of python. This optimization of SQL select can be done without substantial time increase of the request execution time.
Same files with different RSE/Datasets are ignored. The script considers only the first entries of "scope:file_name" in the lost files list. Other entries from different RSE-s or from different datasets wont be included in the e-mail reports. As the reports contain datasets and RSEs, or are split by RSE, it is look like a bug.
E.g. the tomorrow reports will not contain info about the files:
data17_13TeV:DAOD_SUSY1.19820405._000357.pool.root.1:BNL-OSG2_DATADISK
mc16_13TeV:EVNT.19802431._001314.pool.root.1:PIC_DATADISK
etc
because it will already contain the next info:
data17_13TeV:DAOD_SUSY1.19820405._000357.pool.root.1:CERN-PROD_DATADISK
mc16_13TeV:EVNT.19802431._001314.pool.root.1:UKI-SCOTGRID-GLASGOW_DATADISK
etc
Keep the history of lost files. Now the script does not care about lost files history. The request to of weekly lost files from rucio takes about 1/2 hour to complete. If this info is needed and not stored somewhere else, then it can make sense to zip and store it by the script.

Move get_prometheus_config into PrometheusPusher class

This depends on porting some other CMS metrics over to use the context manager.

Change mixed prometheus_client and probe_metric approachs to use PrometheusPusher

Most probes in common use a combination of prometheus_client.CollectorRegistry/Gauge and probe_metrics, which is outdated now that PrometheusPusher exists and works.

The probes changed in pr #89 are also impacted, but so are all of the probes in common, if they use any context manager at all. It's easy enough to just update them all.

CRIC probe

Implement a simple probe to pull RSE data from CRIC to rucio

Common: create and use new context manager for prometheus metrics

We use a context manager for prometheus in a few places. but this new one integrates with the monitor code from Rucio. That means it's cleaner and doesn't directly import prometheus

Probes using SQLAlchemy don't work in 1.31

Probes meeting this description generate stack traces with the 1.31 rucio code. They are fine in Rucio 1.30.8

I see we freeze the version of the oracle client in the probes build, so that may be the issue. But need to further investigate.

Probe corrupting ranking value of distances

There seems to be a probe which corrupts the ranking values in the distances table.
The value should not be negative and also not crazily high, but on ATLAS we are seeing ranges from -1000 to +1000

Continue migration to SQLAlchemy 2.0 syntax #6057 （https://github.com/rucio/rucio/issues/6057）

Create a contributor guide for probes

ATLAS : Fix check_site_status

After the introduction of rucio/rucio#5664 , the probe check_site_status needs to be updated to 1.29.0 to use availability_read, availability_write and availability_delete instead of availability. For this the probe needs to be updated to API + python3.