Giter VIP home page Giter VIP logo

cht-app-monitoring-data-ingest's Introduction

CHT App Monitoring Data Ingestion

Purpose

Scrapes data from CHT instances and ingests in Postgres. The resulting data powers our App Monitoring dashboards and can answer questions like:

  • How many projects are running cht-core version 2.x or 3.x? How many are on the latest version?
  • Which projects are using declarative config?
  • Which projects use outbound push? How many projects enable the death_reporting transition?
  • How many users have the can_logout permission?
  • How many production deployments happened this month? How many forms changed in production this month?
  • How have data from the monitoring API changed over time?
  • Which projects upgraded this month?

Configuration

Database

You will need an instance of postgres available and a username and password with access to a database that has these tables:

CREATE TABLE public.monitoring_urls (
    id SERIAL PRIMARY KEY,
    url VARCHAR NOT NULL,
    partner_name text,
    enabled BOOLEAN NOT NULL DEFAULT true,
    access_level smallint NOT NULL DEFAULT 1
);

CREATE TABLE public.monitoring_docs (
    id SERIAL PRIMARY KEY,
    url_id INTEGER NOT NULL,
    doctype VARCHAR NOT NULL,
    created TIMESTAMP DEFAULT NOW(),
    doc JSONB NOT NULL
);

Medic App Services teammates should use the postgres/db_schema.sql instead of the above schema.

After you have created the tables above, the list of CHT instances to scrape is controlled via the Postgres table monitoring_urls:

  • To add a new instance, insert a row into the table
  • To disable an instance, set enabled to false
  • To scrape deep metrics for an instance, set access-level to 1, 2, or 3

Credentials

Credentials to write to the monitoring_docs table and read from the monitoring_urls us set with these 5 environment variables. The MONITORING_USERNAME and MONITORING_PASSWORD variables are shared and used to log in to all CHT instances. You therefore need to set the same username and password for all CHT instances you want to monitor using authenticated logins.

If you are only using access_level of 1, you do not need to set MONITORING_USERNAME and MONITORING_PASSWORD as the CHT monitoring API does not require any authentication.

Environment Variable Description Example
PG_PORT Port for Postgres connection 5432
PG_USERNAME Username for Postgres connection postgres
PG_PASSWORD Password for Postgres connection abc123
PG_HOST Host for Postgres connection localhost
PG_DATABASE Database for Postgres connection dwh_impact
MONITORING_USERNAME Username on CHT instances with monitoring permissions app-monitor
MONITORING_PASSWORD Password for CHT MONITORING_USERNAME abc123

Copy the .env.example file to .env and edit it with your deployment's values. It will look something like this:

PG_PORT=5432
PG_USERNAME=postgres
PG_PASSWORD=abc123
PG_HOST=localhost
PG_DATABASE=impact
MONITORING_USERNAME=app-monitor
MONITORING_PASSWORD=app-monitor

You can then use this file for both docker and node based deployments.

Accessing remote Postgres instances

If you need to access a Postgres instance via an SSH tunnel, consider using autossh. This can be installed with apt or snap on Ubuntu and will ensure the tunnel is up at all times to ensure consistent access to Postgres.

If you need to access a remote Postgres instance from inside a docker container over an SSH tunnel, consider binding the tunnel to the docker subnet, which is normally a /16 and is often 172.17.0.1/16. This might look like:

ssh -L 172.17.0.1:5432:localhost:5432 [email protected] 

You would then set PG_HOST to 172.17.0.1 and it will be accessible to all containers. See this blog post for more info. autossh can be used in combination with this technique.

Execution

Both the node and docker versions need to be run on a schedule, as natively the app will run once and quit. Consider using cron.

Node

  1. Ensure you have node and npm installed ( nvm is an easy way).
  2. Confirm node and npm are installed with node -v&&npm -v
  3. Clone this repo and cd into the app-monitoring-data-ingest directory
  4. Install dependencies with npm ci
  5. Set your environment vars with source .env
  6. Running the app with node .

Docker

For a fully docker based deployment, you can set up CHT couch2pg first, and then add a database and schema from above to the Postgres instance.

After you have set up your .env per above, run docker-compose up. All related should be built and downloaded as needed.

Output

As the access-level of the user increases from 1 up to 3, richer metrics are available. The results of the scraped data are stored in Postgres tables:

table doctype description
monitoring_docs settings app_settings.json
monitoring_docs monitoring Result from Monitoring API v1
monitoring_docs analysis See Analysis
monitoring_docs error An error occurred while scraping the instance. Document contains error
monitoring_couchpg - The status of couch2pg's sync for each couch database

Access Level 1

Access Level Access Requirement Output
1 Anonymous Monitoring API Only
2 Offline User Basic (Analysis)[#analysis]
3 Online User Full (Analysis)[#analysis]

Access Level 2

  1. Login as administrator
  2. Create new doc in _users database
{
  "_id": "org.couchdb.user:app-monitoring",
  "name": "app-monitoring",
  "type": "user",
  "roles": ["app_monitoring"],
  "password": "abc123"
}
  1. Create new doc in medic database
{
  "_id": "org.couchdb.user:app-monitoring",
  "name": "app-monitoring",
  "type": "user-settings",
  "roles": ["app_monitoring"]
}

Access Level 3

Access Level 2 but add role mm-online. This gives the user access to all data on the instance.

Analysis

This is the structure of documents in monitoring_docs with doctype='analysis':

Metric Description
url If you don't know what this is, this is the wrong dataset for you
status ok if scraped. Otherwise, the error that prevented scraping
numberOfUsers How many users in the _users database (Access Level 3 required)
numberOfRoles How many roles configured in app_settings
numberOfOfflineUser How many users have a role with offline: true (Access Level 3 required)
isDeclarativeConfig True if the project is using declarative configuration
numberOfConfigurableHierarchyContactTypes Number of configured contact_types in app_settings which are not the defaults (person, clinic, etc)
enabledPurging True if the project has a purge function
dhisDataSourceCount Number of elements in app_settings.dhis_data_sets
uhcEnabled Is app_settings.uhc.visit_count enabled
numberOfAppForms Number of app forms uploaded to the project
numberOfContactForms Number of contact forms uploaded to the project
numberOfTargets The number of configured targets app_settings.tasks.targets.items.length
numberOfTaskSchedulesNonDeclarative The nubmer of configured task schedules (only works for non-declarative projects) settings.tasks.schedules
countOfJsonRegistrations app_settings.registrations.length
countOfJsonPatient_reports app_settings.patient_reports.length
countOfJsonReplications app_settings.replications.length
countOfJsonSchedules app_settings.schedules.length
countOfJsonForms Object.keys(app_settings.forms).length
countOfOutboundPushes Object.keys(app_settings.outbound).length
appSettingRevs Number of revisions to the settings document. This number changes each time medic-conf upload-app-settings has an effect
formRevs Number of revisions to form documents. This number changes once per form each time medic-conf upload-app-forms upload-contact-forms has an effect
permission.* Number of users with this permission (Access Level 3 required)
transitions.* True if the transition is enabled for this project

Dependencies

Monitors projects using cht-couch2pg v3.6.2 and above.

cht-app-monitoring-data-ingest's People

Contributors

eljhkrr avatar freddieptf avatar jdndiaye avatar kennsippell avatar mrjones-plip avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cht-app-monitoring-data-ingest's Issues

Replication metrics

  • Replication success rate per day (probably a stacked bar graph of successes and failures)
  • Reasons why replications are failing - Probably a table
  • Total number of docs replicated
  • Replication performance - How long is replication taking. Probably a Box Plot
  • Move the "number of users above replication limit" from Vitals to Replication tab
  • Adding widget in superset

Hosting for App Monitoring Data Ingest

Salutations Faithful Site Reliability Engineers.

In pursuit of our OGSM strategy Improve app monitoring for deployed experiences, we are pursuing this tactical plan for app monitoring. To that end, I've written a bit of a backend service https://github.com/medic/app-services-team/pull/330 which stores the result from CHT's MonitoringAPI + this data + pulls out purge logs, and a bit of other data which we want quick access to.

Now I want to host it somewhere. These are some proposed hosting requirements:

  1. Run script daily
  2. New RDBMS account with SSH certificate which does not have passphrase (like klipfolio account)
  3. Process writes to RDBMS
  4. Process needs to be able to make outbound web requests
  5. Easy to update code that is running
  6. (optional) Nice if logs were accessible

Questions:

  1. I'm thinking to host this in docker. Getting that working with SSH tunnel to RDBMS seems a little bit of work, so I just want to confirm before I go too deep. I think our alternative is hosting it on the rdbms machine like couch2pg which is easier but maybe ghetto.
  2. Do you have recommendations on how to authenticate queries made by this tool? Rely on 1password? New monitoring accounts with a shared password? New monitoring accounts with unique passwords?
  3. For the daily schedule, I'd like your input. Would it be best to do this as kubernetes cron? If so, then my docker should just start, run the script, and close. Please confirm or let me know if you have alternatives.
  4. For our klipfolio RDBMS account, I think it either has no private key or it has a private key without a passphase. Should I create that certificate and send it? Or could you do that and manage it when you make the account? Let me know your preference.
  5. For updating the code in docker, I'm thinking to have the docker script just pull down the github code, run npm ci and execute. I'm not up to date on docker best practices, so also welcoming feedback here.

For now - this issue is just a request to talk through some questions and a heads up of a upcoming future request to actually host this.

cc @eljhkrr @freddieptf

get_tasks_by_state should not cross join with stateHistory

The current implementation of get_tasks_by_state counts the number of "task state changes" rather than the number of created task documents.

CROSS JOIN LATERAL json_populate_recordset(null::record, (doc->>''stateHistory'')::json) AS (state text, timestamp bigint)

@eljhkrr Can you confirm that we want to count the number of task documents? Or do we want to count task document states?

Automated Feedback Docs with unique causes are grouped by message "missing"

Feedback docs with info like the below are grouped and displayed with non-descript "missing"

image

  "info": {
    "cause": "Possibly unhandled rejection: {\"status\":404,\"name\":\"not_found\",\"message\":\"missing\",\"error\":true,\"reason\":\"missing\",\"docId\":\"partners\"}",
    "message": "missing"
  },

In some cases the cause more helpful and actually include docId values.
medic/cht-core#6626

Example causes which are all being grouped under this:

  • Possibly unhandled rejection: {"status":404,"name":"not_found","message":"missing","error":true,"reason":"missing","result":{"ok":false,"start_time":"2022-11-03T10:23:24.259Z","docs_read":0,"docs_written":0,"doc_write_failures":0,"errors":[],"status":"aborting","end_time":"2022-11-03T10:24:19.912Z","last_seq":5251}}
  • Possibly unhandled rejection: {"status":404,"name":"not_found","message":"missing","error":true}
  • Possibly unhandled rejection: {"status":404,"name":"not_found","message":"missing","error":true,"reason":"missing","docId":"partners"}

Many instances in app-monitoring-data-ingest have NULL version

I believe that maybe these instances with NULL version are running CHT <3.9 which is before the monitoring API was introduced?

We can run this query to get all instances with and with out a version:

select
	mu.url  as url,
--	mu.partner_name as partner,
	max(md.created) as created,
	max(md.doc#>>'{version,app}') as version
from
	monitoring_urls mu 
left join monitoring_docs md on
	md.url_id = mu.id  
where 
	enabled = true
group by 
	url
order by 
	version 

These are the ones WITH a version, which is nice to see:

url                                                   |created                |version                   |
------------------------------------------------------+-----------------------+--------------------------+
https://msf-goma.app.medicmobile.org                  |2022-10-12 00:01:56.559|3.10.0                    |
https://adp-production.app.medicmobile.org            |2022-10-12 00:01:56.582|3.10.0                    |
https://safaridoctors-ke.app.medicmobile.org          |2022-10-12 00:01:56.577|3.10.0                    |
https://morang-ne.app.medicmobile.org                 |2022-10-12 00:01:56.574|3.10.0                    |
https://bhojpur-ne.app.medicmobile.org                |2022-10-12 00:01:56.631|3.10.2                    |
https://villagehealthworks-burundi.app.medicmobile.org|2022-10-12 00:01:56.625|3.10.3                    |
https://kfn-ilam.app.medicmobile.org                  |2022-10-12 00:01:56.580|3.10.3                    |
https://triggerise.app.medicmobile.org                |2022-10-12 00:01:56.602|3.10.3                    |
https://cmmb-kenya.app.medicmobile.org                |2022-10-12 00:01:56.554|3.10.3                    |
https://clone-brac-ug.dev.medicmobile.org             |2022-10-12 00:01:56.603|3.10.4                    |
https://brac-ug.app.medicmobile.org                   |2022-10-12 00:01:56.610|3.10.5                    |
https://sankhuwashabha-ne.app.medicmobile.org         |2022-10-12 00:01:56.645|3.11.0                    |
https://moh-ug-uncdf.app.medicmobile.org              |2022-10-12 00:01:56.586|3.11.0                    |
https://vhtapp.cphluganda.org                         |2022-10-12 00:01:56.561|3.11.0                    |
https://muso-cdi.app.medicmobile.org                  |2022-10-12 00:01:56.560|3.13.0                    |
https://hth-togo.app.medicmobile.org                  |2022-10-12 00:01:56.659|3.13.0                    |
https://muso-mali.app.medicmobile.org                 |2022-10-12 00:01:56.618|3.14.2                    |
https://pih-malawi.app.medicmobile.org                |2022-10-12 00:01:56.607|3.14.2                    |
https://bhaktapur-ne.app.medicmobile.org              |2022-10-12 00:01:56.563|3.14.2                    |
https://cloned-pih-malawi.dev.medicmobile.org         |2022-10-12 00:01:56.590|3.14.2                    |
https://chis.dohs.gov.np                              |2022-10-12 00:01:56.604|3.14.2                    |
https://safesimbaglung.app.medicmobile.org            |2022-10-12 00:01:56.593|3.14.2                    |
https://moh-zanzibar.app.medicmobile.org              |2022-10-12 00:01:56.557|3.15.0                    |
https://itech-zimbabwe.dev.medicmobile.org            |2022-10-12 00:01:56.583|3.15.0                    |
https://supervisor-moh-mali.app.medicmobile.org       |2022-10-12 00:01:56.622|3.15.0                    |
https://moh-niger-chw.app.medicmobile.org             |2022-10-12 00:01:56.597|3.15.0-FR-bulk-user-upload|
https://zazic-scale-up.app.medicmobile.org            |2022-10-12 00:01:56.628|3.16.0                    |
https://disc-mali.ml                                  |2022-10-12 00:01:56.596|3.16.0                    |
https://ohw-dhading.app.medicmobile.org               |2022-10-12 00:01:56.588|3.16.0                    |
https://dho-sunsari-ne.app.medicmobile.org            |2022-10-12 00:01:56.654|3.16.0                    |
https://itech-aurum.app.medicmobile.org               |2022-10-12 00:01:56.555|3.9.0                     |
https://malariaconsortium.app.medicmobile.org         |2022-10-12 00:01:56.599|3.9.0                     |
https://cht-cop-interop.dev.medicmobile.org           |2022-10-12 00:01:56.600|3.9.0                     |

And these are the instances with NULL for a version:

url                                                   |created                |version                   |
------------------------------------------------------+-----------------------+--------------------------+
https://dho-rasuwa-ne.app.medicmobile.org             |2022-10-12 00:01:56.649|                          |
https://kailali-ne.app.medicmobile.org                |2022-10-12 00:01:56.564|                          |
https://dpho-kanchanpur.app.medicmobile.org           |2022-10-12 00:01:56.615|                          |
https://dho-dadeldhura-ne.app.medicmobile.org         |2022-10-12 00:01:56.647|                          |
https://dho-bajura-ne.app.medicmobile.org             |2022-10-12 00:01:56.637|                          |
https://care-sindhuli.app.medicmobile.org             |2022-10-12 00:01:56.612|                          |
https://dhankuta-ne.app.medicmobile.org               |2022-10-12 00:01:56.633|                          |
https://dho-sindhupalchowk-ne.app.medicmobile.org     |2022-10-12 00:01:56.651|                          |
https://dpho-banke.app.medicmobile.org                |2022-10-12 00:01:56.613|                          |
https://panchthar-ne.app.medicmobile.org              |2022-10-12 00:01:56.639|                          |
https://alerte-covid19-ne.net                         |                       |                          |
https://syangja-ne.app.medicmobile.org                |2022-10-12 00:01:56.642|                          |
https://dpho-pyuthan-ne.app.medicmobile.org           |2022-10-12 00:01:56.656|                          |
https://cht.app.medicmobile.org                       |                       |                          |
https://srhgorkha.app.medicmobile.org                 |2022-10-12 00:01:56.620|                          |
https://dho-baitadi.app.medicmobile.org               |2022-10-12 00:01:56.635|                          |

Commit a sample cron file

Useful so

  1. we are all on the same page of what is running and when
  2. serves as example documentation for how the system should be configured

couch2pg backlog

To visualize couch2pg backlog in app monitoring we need to know each project's couch2pg sequence numbers.

Workflows tab improvements

image
There is too much detail that may hamper effectively identifying live site incidents in progress. Improvements can be made as follows:

  • Split task table into two tables:

    Table 1:
    username
    document count
    completion rate
    cancellation rate

    Table 2:
    task
    document count
    completion rate
    cancellation rate

  • Remove drop down task selector

Query performance

Performance of app monitoring queries needs to be bench-marked and policies formulated on patterns that will improve on performance.

Refresh logs

Conversation from app services repository:

should we UPSERT here instead of INSERT and just maintain a table of matview > most recent refresh? i'm not sure if a full log of all refresh times is valuable.

the benefit would be that app_monitoring_refresh_log is now just a select and we don't need to worry about the table bloating after years

There are no unique indexes on the logs table so an upsert isn't possible, we could however create a new table for app_monitoring logs or see if this approach would work with how impact views are refreshed. @yrimal do we need historical refresh logs for the impact views?

This function only refreshes app_monitoring views, we could make an app_monitoring specific table with the latest refresh timestamp since it's the only information needed when evaluating data freshness

Originally posted by @kennsippell in https://github.com/medic/app-services-team/pull/388#discussion_r968919054

Datasource freshness

How can stale datasources be detected?
Add a widget showing last refresh timestamp - update a table each time materialized views are refreshed

Dockerize the App Monitoring Data Ingest tool

Background:

  • Join the #app-monitoring Slack Channel
  • Read and understand the OGSM App Monitoring project
  • Schedule a meeting with @eljhkrr as needed to ensure you understand it all
  • Understand how the App Monitoring Data Ingest tool fits into that
  • Download the App Monitoring Data Ingest tool and get it running locally
  • Read this issue to understand the hosting requirements. Contribute any thoughts to it. Ping SRE every two weeks or so to ensure they know people care.

Your Assignment:

  • Make a dockerfile or docker-compose.yml file which hosts the Data Ingest tool
  • Get an RDBMS username and password for the monitoring tool which has write access to dwh_impact
  • Your docker container will need to open an SSH tunnel to RDBMS
  • Work with SRE to get the docker container hosted in Kubernetes with a daily cron

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.