eida / eida-statistics Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 532 KB

Aggregated statistics of EIDA nodes

License: GNU General Public License v3.0

Python 96.39% Dockerfile 0.44% Gherkin 3.17%

eida-statistics's People

Contributors

Watchers

eida-statistics's Issues

Use a templating system for the documentation

In order to insert dynamic content (like the full URL) in the documentation, it would be interesting to use a templating system (jinja2 is a popular option).

The URL prefix can be taken from an environment variable like

EIDASTATS_API_HOST=server.exemple.gr
EIDASTATS_API_PATH=/eidaws/statistics/1

Define the OpenAPI for EIDA Statistics API

Let's prepare an OpenAPI specification of the EIDA Statistics API.

Some references:

WFCatalog OpenAPI specs: https://github.com/EIDA/wfcatalog/blob/master/wfcatalog_swagger.yaml
OpenAPI rendered in Swagger: https://www.orfeus-eu.org/swagger/dist/index.html?url=https://www.orfeus-eu.org/data/eida/webservices/wfcatalog/wfcatalog.yaml

[Aggregator] add compression and send data

The aggregator should compress the data before sending it.
The aggregator should be able to send the aggregation to the central webservice directly

Permission error on table dataselect_stats problem is not reported to client

We should reply error 500 in such cases and rollback the transaction.

2023-04-06 14:30:23,068 INFO  [ws_eidastats.helper_functions:134][MainThread] Registering 3557 statistics.
2023-04-06 14:30:23,094 ERROR [ws_eidastats.helper_functions:142][MainThread] Postgresql error 42501 registering statistic
2023-04-06 14:30:23,094 ERROR [ws_eidastats.helper_functions:143][MainThread] ERROR:  permission denied for table dataselect_stats

2023-04-06 14:30:23,094 INFO  [ws_eidastats.helper_functions:144][MainThread] Statistics successfully registered

csv file does not give back network when requested

https://ws.resif.fr/eidaws/statistics/1//dataselect/query?start=2022-01&end=2022-12&datacenter=NOA&network=HP&aggregate_on=location,channel,month,datacenter,network,station,country&format=csv
query.csv

Moreover format=text does not work. Only csv

Allow plain dates in start/end parameter

The query /query?start=2022-06-01 should be accepted. Currently it gives:

BAD REQUEST: invalid value of parameter 'end'

inconsistency in clients cardinality.

(Reported by @vpet98)
I noticed some inconsistency, to an extent that I don't know if should be ignored, about the number of clients and HLL objects in the results that the webservice returns.

Try this: https://ws.resif.fr/eidaws/statistics/1/dataselect/public?start=2023-01&country=GR&details=country&format=json
And then the same in node level: https://ws.resif.fr/eidaws/statistics/1/dataselect/public?start=2023-01&country=GR&level=node&details=country&format=json

You would expect adding the clients of the results of the second query to be approximately equal to the clients in the first query. But the difference is quite noticeable (first query 78 clients, second query in total 103 clients).
And is even worse for countries with more clients (in another example I had 2232 vs 3115 clients).

My SQL query includes this in the select clause: hll_union_agg(dataselect_stats.clients), which has to be correct.
Then I use this library: https://github.com/AdRoll/python-hll.
And as the library indicates in its README, I print the cardinality like this: HLL.from_bytes(NumberUtil.from_hex(row.clients[2:], 0, len(row.clients[2:]))).cardinality(), for each row that the SQL query returns.
7:50 PM

Could you have a quick look at it if there is time?

eida_statsman : add interface to manage networks and nodes policy

toggle default policy on a node
- when an operator tries to change the policy on a node, there is 2 possible behaviours:
  - if default policy is changed to "open", then make sure that all networks is open, show to the operator the list of networks with resulting restriction
  - else, make sure all networks conform to the default policy. Opening networks has to be done manually
toggle policy on a network
list policies for networks (optionally filter by node)

Starttime mandatory

To be more consistent with other FDSN webservice and reduce the default amount of responses, make starttime mandatory, endtime can be optional.

Sentry: set DSN by environment variable

Could you remove the DSN from the code and get it from an environment variable ?

Also, please look at how to setup the environment (dev, staging, production) so that sentry can make the difference.

https://docs.sentry.io/platforms/go/guides/martini/configuration/environments/

Response content type should be text/json

For now, it says :
Content-Type: text/html; charset=utf-8

All node upgrade eida-statistics-aggregator to 0.6.0

Hello @ALL

I released a new version for the dataselect statistics aggregator.
This release adds identification of temporary networks by their extended identifier. Wich is important in the statistics because otherwise we mix up statistics from different networks sharing the short network code.

Please all node, could you upgrade ? Depending on your installation method, this should not be much more work than:

pip3 install --upgrade eida-statistics-aggregator

Please note, minimal python version is 3.6 but it can run in it's isolated environment without problem. It has been tested up to python 3.10

Please report in this issue when you're done:

Implement a caching system

In order to avoid query flooding and provide fastest replies, implement the caching of the request.

See https://docs.python.org/3/library/functools.html and https://realpython.com/lru-cache-python/

Rewrite webservice/README

Extra information at the top of CSV format

I like "a lot" the extra lines with comments you included at the top of the CSV (#40 ).
Could you please consider to include an extra piece of information?
For instance: rejected or malformed parameters?

# request_parameters: start=2022-01&end=2022-12&details=month&format=csv
# rejected_parameters: groupby=day

Sort CSV output

CSV output should be sorted by date when details=month or year

Exemple :
curl -X 'GET' 'https://ws.resif.fr/eidaws/statistics/1/dataselect/public?start=2022-01&end=2022-12&details=month&format=csv'

# version: 1.0.0
# request_parameters: start=2022-01&end=2022-12&details=month&format=csv
date,node,network,station,location,channel,country,bytes,nb_reqs,nb_successful_reqs,clients
2022-09,*,*,*,*,*,*,49249517419520,93309158,61742567,3752
2022-04,*,*,*,*,*,*,52075391539200,70253741,56097249,5135
2022-03,*,*,*,*,*,*,35866232961024,76959640,62862467,6096
2022-07,*,*,*,*,*,*,47809205437440,100682394,86495962,4220
2022-08,*,*,*,*,*,*,41827452808448,199812690,111005715,3361
2022-10,*,*,*,*,*,*,34598181185536,84436994,64883858,4267
2022-06,*,*,*,*,*,*,54756623463168,92399681,75015880,4025
2022-12,*,*,*,*,*,*,75743023855104,115305619,82503762,4524
2022-02,*,*,*,*,*,*,49705000816128,92485574,76534546,4626
2022-05,*,*,*,*,*,*,70791218339072,69100676,53027093,4143
2022-11,*,*,*,*,*,*,31853315838464,122664892,65935181,4714
2022-01,*,*,*,*,*,*,47874364480512,70079038,57161798,3733

All webservice methods in one Flask application

Curently, the webservices /statistics/1/* and /dataselectstats are written to be executed in separate flask applications.

I would like to serve both in one single application:

PUSH /dataselectstats => statistics ingestion
GET /dataselectstats => statistics query
GET /query
GET /health
GET / => documentation

Besides, do not declare all the statistics/1/ part in the routes, as they will be set on the deployment side.

You can reorganize the project to split the routes and the methods as you see fit.

Output of human example links

Hello,

Thanks for this very nice webservice.
Playing with the example links for human, I noted one question about the csv content.

The nb_reqs column appears always at None. Shouldn't it be at least the same number at the column nb_successful_reqs ?

Also, the country column is always showing *. Maybe this feature is not yet implemented ?

Aggregator: dupplicated log should be taken in account

It should be enough to identify them with the creation time.

Group all statistics regarding restricted networks in "Other"

When giving statistics to a user that is not authorized to see stats
AND
When there is more than one level in the result
Show all the restricted statistics summed up in an "Other" network item.

If there is only one restricted network in the result, reply 403 unauthorized

empty stats for GFZ

Thanks for publishing this interface. When retrieving yearly network statistics for each node I get results for all nodes except GFZ:

https://ws.resif.fr/eidaws/statistics/1/dataselect/query?start=2022-01&end=2022-12&datacenter=GFZ&aggregate_on=month,station,country&format=json

returns an empty result. The same happens with unknown data center names. Better would be to return an error if the data center name is invalid.

I also tried "../submit/.." instead of "../dataselect/..". This doesn't work at all.

A simple method to get nodes and networks

We miss 2 public endpoints

/nodes to list all nodes in json format with their default policy
/networks to list all known networks with their restriction policy

The endpoint _nodes could be deleted.

Review the first specifications

There is a first specification available at https://github.com/EIDA/eida-statistics/blob/main/ingestor_specs.md

@ALL would you please comment ?

It's very basic, and should be straightforward to implement (at least the ingestion part). Thank you.

Internal 500 errors

Based on Sentry issues and https://docs.sqlalchemy.org/en/20/errors.html#error-3o7r, I think we need to try to increase the QueuePool SQLAlchemy uses for connections.

I'll commit now in the development branch firstly, though this fix can be tested more efficiently when goes into production.

Use just one connexion to database backend

Instead of issuing one connexion to the SQL backend on each request, use the SQL alchemy native method to interact with the database.

This is usually done with a singleton object managing the database connexion, and all the other functions build the SQL statement and pass it to this object.

Add a webservice for getting the statistics

First task for this is to build an API in the openAPI3 standard, for instance using the swagger online tools.

In order to imagine a suitable API, you can look at the matrix document. First 2 rows define the questions and the granularity level.

The code attached to this project needs a better documentation, I'm on it (see issue #11)

The datamodel is specified in the code : https://github.com/EIDA/eida-statistics/tree/main/backend_database

You can use this project to bring up your own empty database if needed.

You can create a directory for the webservice specification and implementation at the root of this project.

Layout of the documentation

Change the title (Swagger UI -> EIDA statitistics)
Remove the banner where user can change the opapi.yaml URL

Replace datacenter and network management API with an information method

/nodes/id would show the information about a node, which is basically it's default restriction policy

More info could be added in the future, for instance, latest payload submitted ?

Strange distribution of data from some nodes.

Something strange happens with network FR.

FR seems to be distributed through RESIF, ETH and ICGC.

It might be that the ETH logging for FR stops in the beginning of 2022?? so this might be a temporary problem, but it would be nice to understand what is happening and whether something needs to be fixed.

Clear bug is that the number of users per year only shows ETH.

See result of this query: https://ws.resif.fr/eidaws/statistics/1/dataselect/public?network=FR&start=2021-01&end=2023-12&level=node&format=json

public(2).csv

Change parameter aggregate_on

On /public and /restricted methods, change aggregate_on to:

level

one value in datacenter,network,station,location,channel
if no value is provided, the server responds at EIDA level, all datacenters grouped

details

Will show the details of the query.
Possible values are:

month or year
countries

multiple values are allowed. If month and year are specified, reply 400 and a nice detail.

Better documentation

Pyramid-openapi3 dependency

New branch with latest version of pyramid-openapi3 dependency at https://github.com/EIDA/eida-statistics/tree/openapi_dependency.

Tested locally and works, hope it works in production as well.

Tell me when to merge in main.

Inefficient caching of FdsnNetExtender.extend()

FdsnNetExtender.extend(self, net, date_string) has lru_cache(maxsize=1000), but since date_string is different most of the time, caching seems to be inefficient. In any case, I can observe urls like http://www.fdsn.org/ws/networks/1/query?fdsn_code=3E being downloaded hundreds of times. Sometimes this causes an exception, which seems to be the reason of incomplete statistics at GFZ.

Maybe date_string should be reduced to year (two different temporary networks with the same code never exist in same year?). Alternatively I would suggest caching the result of urlopen(request).