mozilla / probe-scraper Goto Github PK

View Code? Open in Web Editor NEW

21.0 29.0 52.0 5.06 MB

Scrape and publish Telemetry probe data from Firefox

Home Page: https://mozilla.github.io/probe-scraper/

License: Mozilla Public License 2.0

Jupyter Notebook 1.82% Python 96.65% C++ 0.20% Dockerfile 0.41% Makefile 0.92%

firefox telemetry

probe-scraper's Introduction

probe-scraper

Scrape Telemetry probe data from Firefox repositories.

This extracts per-version Telemetry probe data for Firefox and other Mozilla products from registry files like Histograms.json and Scalars.yaml. The data allows answering questions like "which Firefox versions is this Telemetry probe in anyway?". Also, probes outside of Histograms.json - like the CSS use counters - are included in the output data.

The data is pulled from two different sources:

From hg.mozilla.org for Firefox data.
From a configurable set of Github repositories that use Glean.

Probe Scraper outputs JSON to https://probeinfo.telemetry.mozilla.org. Effectively, this creates a REST API which can be used by downstream tools like mozilla-schema-generator and various data dictionary type applications (see below).

An OpenAPI reference to this API is available:

A web tool to explore the Firefox-related data is available at probes.telemetry.mozilla.org. A project to develop a similar view for Glean-based data is under development in the Glean Dictionary.

Deprecation

Deprecation is an important step in an application lifecycle. Because of the backwards-compatible nature of our pipeline, we do not remove Glean apps or variants from the repositories.yaml file - instead, we mark them as deprecated.

Marking an App Variant as deprecated

When an app variant is marked as deprecated (see this example from Fenix), the following happens:

It shows as [Deprecated] in the Glean Dictionary, in the Access section (see e.g. Fenix's client_id metric).

Marking an App as deprecated

When an app is marked as deprecated (see this example of Firefox for Fire TV), the following happens:

It no longer shows by default in the Glean Dictionary. (Deprecated apps can be viewed by clicking the Show deprecated applications checkbox)

Adding a New Glean Repository

To scrape a git repository for probe definitions, an entry needs to be added in repositories.yaml. The exact format of the entry depends on whether you are adding an application or a library. See below for details.

Adding an application

For a given application, Glean metrics are emitted by the application itself, any libraries it uses that also use Glean, as well as the Glean library proper. Therefore, probe scraper needs a way to find all of the dependencies to determine all of the metrics emitted by that application.

Therefore, each application should specify a dependencies parameter, which is a list of Glean-using libraries used by the application. Each entry should be a library name as specified by the library's library_names parameter.

For Android applications, if you're not sure what the dependencies of the application are, you can run the following command at the root of the project folder:

$ ./gradlew :app:dependencies

See the full application schema documentation for descriptions of all the available parameters.

Adding a library

Probe scraper also needs a way to map dependencies back to an entry in the repositories.yaml file. Therefore, any libraries defined should also include their build-system-specific library names in the library_names parameter.

See the full library schema documentation for descriptions of all the available parameters.

Developing the probe-scraper

You can choose to develop using the container, or locally. Using the container will be slower, since changes will trigger a rebuild of the container. But using the container method will ensure that your PR passes CircleCI build/test phases.

Local development

You may wish to, instead of installing all these requirements in your global Python environment, start by generating and activating a Python virtual environment. The .gitignore expects it to be called ENV or venv:

python -m venv venv
. venv/bin/activate

Install the requirements:

pip install -r requirements.txt
pip install -r test_requirements.txt
python setup.py develop

Run tests. This by default does not run tests that require a web connection:

pytest tests/

To run all tests, including those that require a web connection:

pytest tests/ --run-web-tests

To test whether the code conforms to the style rules, you can run:

python -m black --check probe_scraper tests ./*.py
flake8 --max-line-length 100 probe_scraper tests ./*.py
yamllint repositories.yaml .circleci
python -m isort --profile black --check-only probe_scraper tests ./*.py

To render API documentation locally to index.html:

make apidoc

Developing using the container

Run tests in container. This does not run tests that require a web connection:

export COMMAND='pytest tests/'
make run

To run all tests, including those that require a web connection:

make test

To test whether the code conforms to the style rules, you can run:

make lint

Tests with Web Dependencies

Any tests that require a web connection to run should be marked with @pytest.mark.web_dependency.

These will not run by default, but will run on CI.

Performing a Dry-Run

Before opening a PR, it's good to test the code you wrote on the production data. You can specify a specific Firefox version to run on by using first-version:

export COMMAND='python -m probe_scraper.runner --firefox-version 65 --dry-run'
make run

or locally via:

python -m probe_scraper.runner --firefox-version 65 --dry-run

Including --dry-run means emails will not be sent.

Additionally, you can test just on Glean repositories:

export COMMAND='python -m probe_scraper.runner --glean --dry-run'
make run

By default that will test against every Glean repository, which might take a while. If you want to test against just one (e.g. a new repository you're adding), you can use the --glean-repo argument to just test the repositories you care about:

export COMMAND='python -m probe_scraper.runner --glean --glean-repo glean-core --glean-repo glean-android --glean-repo burnham --dry-run'
make run

Replace burnham in the example above with your repository and its dependencies.

You can also do the dry-run locally:

python -m probe_scraper.runner --glean --glean-repo glean-core --glean-repo glean-android --glean-repo burnham --dry-run

Module overview

The module is built around the following data flow:

scrape registry files from mozilla-central, clone files from repositories directory
extract probe data from the files
transform probe data into output formats
save to disk

The code layout consists mainly of:

probe_scraper
- runner.py - the central script, ties the other pieces together
- scrapers
  - buildhub.py - pull build info from the BuildHub service
  - moz_central_scraper.py - loads probe registry files for multiple versions from mozilla-central
  - git_scraper.py - loads probe registry files from a git repository (no version or channel support yet, just per-commit)
- parsers/ - extract probe data from the registry files
  - third_party - these are imported parser scripts from mozilla-central
- transform_*.py - transform the extracted raw data into output formats
tests/ - the unit tests

Accessing the data files

The processed probe data is serialized to the disk in a directory hierarchy starting from the provided output directory. The directory layout resembles a REST-friendly structure.

|-- product
    |-- general
    |-- revisions
    |-- channel (or "all")
        |-- ping type
            |-- probe type (or "all_probes")

For example, all the JSON probe data in the main ping for the Firefox Nightly channel can be accessed with the following path: firefox/nightly/main/all_probes. The probe data for all the channels (same product and ping) can be accessed instead using firefox/all/main/all_probes.

The root directory for the output generated from the scheduled job can be found at https://probeinfo.telemetry.mozilla.org/. All the probe data for Firefox coming from the main ping can be found at https://probeinfo.telemetry.mozilla.org/firefox/all/main/all_probes.

Accessing `Glean` metrics data

Glean data is generally laid out as follows:

| -- glean
    | -- repositories
    | -- general
    | -- repository-name
        | -- general
        | -- metrics

For example, the data for a repository called fenix would be found at /glean/fenix/metrics. The time the data was last updated for that project can be found at glean/fenix/general.

A list of available repositories is at /glean/repositories.

probe-scraper's People

Contributors

Stargazers

Watchers

probe-scraper's Issues

Enable travis-CI

We should enable CI on this repo. At least flake8 coverage!

Validate probe data

We need to validate the probe data for correctness, e.g. by manually cross-checking the history of a few probes of each type.

Scrape reference browser `metrics.yaml`

Here: https://github.com/mozilla-mobile/reference-browser/blob/master/app/metrics.yaml

Add instructions for how to perform a dry-run for testing

As seen in #49, added tests may pass even though running against real data results in improper data.

It'd be nice to include in the README instructions for how to run the scraper quickly and locally.

As @fbertsch put it "just head to the probe-scraper repo and run: python probe_scraper/runner.py. If it's taking too long (it will), change MIN_FIREFOX_VERSION in scrapers/moz_central_scraper.py to e.g. 59 so it doesn't pull down as many files."

Include `labels` for categorical histograms

MDV2 needs this to display those histograms.

Proposal for: GCP Migration, CircleCi Migration, and Running Dependency/Probe checks

These 3 issues (GCP, CircleCI, and Dependency/Probe checking) are all related in that they require Docker integration; dependency checking specifically needs GKE integration.

The work can be done in this order; so e.g. we can be building/testing the container on CI, but still running on EMR while we change the deploy to GCP.

Local Testing and CI

For local testing and CI, we will move to a Docker workflow. This will include building a container with all of the dependencies, running tests and lint on that container, and updating CI to build, test, and deploy that container. This should follow the Dockerflow example.

This will require adding:

Dockerfile
Makefile
docker-compose (optional, but nice)
pinned requirements
circle-ci config
Dockerhub creds to circle config

Running on GCP

GCP will also run on that container. We will use the GKE Pod Operator, and use the image that CI deploys. To run this on GCP, we need to add an entrypoint script that runs the probe-scraper locally.

We will need an associated change to telemetry-airflow to update how we're running the job. This file is the one that will be running on the container (with some changes for GCP world, e.g. GCS).

(Note that we may still need to write to s3 for the probe-info-service. I'll cc @jasonthomas here on whether there is a plan to move the probe-info-service to GCS. Once it's there we can write to GCS instead.)

Integrating with Google Kubernetes Engine

In order to check for metrics or dependencies that are present in repositories, we need a development environment to build/run/test the applications. We can do this by running on Dockerhub Images. It may be the case that there is not a stable image for our needs; in that case we may need to build and deploy them ourselves, using the existing infrastructure from "Local Testing and CI". In that case we'll need an additional Dockerfile.

When we have those images available, we can run them in the probe-scraper using the GKE Python Client. We can run in those environments and get a result (whether it is dependencies, probes, pings, etc.).

Histograms expired in minor releases are not listed in the "expired" list

The MEDIA_DECODING_PROCESS_CRASH histogram was removed from the histogram.json file by this bug (Firefox 57). Up to revision 0a0a804a5b6f252f12d9808b54ed2a7f6ada27e3, it was correctly reported among the expired histograms for Firefox 57 in the release channel. However, since 3702966a64c80e17d01f613b0a464f92695524fc was scraped, it is no longer showing up in that list.

We should verify what is going on and fix this.

Consider dropping Aurora channel

We should check if anyone intends to use historical Aurora channel data.
If not, we could just drop support for that channel.

Update probe parsers in mozilla-central

We made some changes in this repository to the parsers that live in mozilla-central.
Once this repository is more stable, we should merge the updates to m-c.

Add special case for `all_pings` in Glean send_in_pings

When the probe_scraper sees all_pings, it should instead transform that to a list of all the pings. This way downstream dependencies don't need to know about the Glean keyword.

Add test coverage for "tags" scraping

In scraper.py we use the load_tags and extract_tag_data function to fetch the data from the Firefox repository and extract the releases information. This needs some test coverage.

We could:

Use the "requests" library to catch the responses and provide fake data to load_tags (an example can be found here)
Call extract_tag_data on the fake data
Make sure we get the expected data out

Missed probe removal

Hello! I'm not sure if this is the right place to document this, and it may be a dupe of #30, but:

https://telemetry.mozilla.org/probe-dictionary/?detailView=histogram%2FDEVTOOLS_TOOLBOX_OPENED_BOOLEAN does not show that the probe was removed
the probe was removed in https://hg.mozilla.org/mozilla-central/rev/668e08174667 as part of an effort to convert BOOLEANs to COUNTs

Please let me know if this isn't helpful. Thanks!

Add test coverage for correct handling of historic probe registries

We should add test coverage that tests that the output for old, real Histograms.json, Scalars.yaml, ... is correct.
E.g. taking files from the oldest supported Firefox version, the current newest and a few at important key points in between (e.g. when properties where added/removed).

Change existing git repo parsing to Glean parsing

We currently have a setup to include the types of files we expect for Desktop: Histograms.json, Scalars.yaml, etc., but for Github repos. The Glean work is replacing that effort, so we need to update the code to parse metrics.yaml files instead.

This work will encompass:

Parsing and writing out information on Glean probes to the probe-dictionary
Parsing and writing out information on Glean pings to the probe-dictionary

The former will be used for schema creation. The latter will be used for validation and table names in the pipeline.

We can use the Glean Parser library to parse the metrics.yaml files and write out results. The rationale for letting the Glean Parser serialize is then the scraper doesn't need to have intimate knowledge of which fields are required and which may be updated; instead we can just add and remove fields from the parser.

Scrape bug numbers from probes that have them

Some probes have bug_numbers fields (in fact, all new ones are required to have them, and owners are encouraged to add them when they are updated). This seems like useful information we can surface in the probe scraper.

Marking this "help wanted" for mentorship. @Dexterp37 and @georgf have offered to mentor this bug.

After this is complete, we can file a bug in https://github.com/mozilla/telemetry-dashboard for augmenting Probe Dictionary to display this information and linkify the bug numbers to make it easy to flow from the dictionary to bugzilla to learn more.

Scrape desktop telemetry probe stores

To support mozilla/probe-dictionary#60, we need to add this info to the probe-info-service.

Document the output format

We should have a README file that briefly explains the file format.

Remove scraping `json-tags` and get all revisions from BuildHub

We currently get revisions from both Buildhub and json-tags, but the former is a more canonical source, along with storing every build we have had (including nightly). We should remove the json-tags scraping and just use that source of revision data.

Use the mozparsers package

In bug 1282098 we exported the python_mozparsers package that contains the parsers we use for parsing probes.

The probe scraper should depend on that package rather than forking the ones in m-c.

This issue is about backporting the changes that were implemented in the probe scraper fork of the parsers upstream and then depending on the exported package.

Refactor this Code Base

This code base is largely about:

Scraping information about revisions
Reading probe information from those revisions
Combining probe information
Writing it back out

Given that we recently moved to Python 3, this code base could use a serious uplift by utilizing some of the nice features available. For example, the probe type could be a Dataclass that knows how to compare itself to others, and knows how to serialize itself into the final JSON output.

A revision class could compare itself to other revisions, based on push-date or version. This can be used to decide first and last revisions/versions/dates for probes.

Together, these would simplify transform_probes.py.

All logic should be moved out of runner.py and integrated into appropriate places, and that should exist just to provide the scraper CLI.

Glean probes should use the latest commit in addition to file changes

Using the latest commit will allow us to get an accurate date range for a probe, since even if it changed 5 commits ago, it may still be available, but the date will be for the last commit the metrics.yaml file changed on.

This isn't urgent, since we can order glean probes by first.

Scrape and publish Glean `pings.yaml`

This can be used in a few places:

Mozilla-schema-generator, which currently does a backwards approach of needing all pings before it can get all probes, but it has to get the pings from the probes
bigquery-etl, to easily generate queries on every table (or view across them all)

This should scrape pings.yaml, and publish a history based on the pings present. I would imagine this lives in /glean/$REPO/pings.

Rename "optout" property

In the output data we use the old "optout" terminology for probes.
We should change this to something matching the new semantics, e.g. "collected_on_release".

Currently the main consumer of this is the probe-dictionary, although we should check with the data tools team.
We can migrate easily by first adding the new property, then fix the consumers, then removing the old property.

Handle "historical" glean metrics file

The scraper currently imports the latest version of glean-parsers. However, if breaking non-backwards compatible format changes are introduced in the parsers, the scraper might break.

We should fix this. Among the possible solutions:

do version pinning for the parsers;
have two CI environments;
include a version in the metrics file.

Recording versions for probes are off by one

(This came out of telemetry-dashboard #543.)

The histogram DEVTOOLS_TOOLBOX_TIME_ACTIVE_SECONDS is showing up as recorded from 59 on release.

This seems to actually come from the source data, i.e. probe-scraper.

Another example where recording ranges are off, the scalar telemetry.discarded.child_events.
This is also a problem in the probe-scraper data.

Per 1369041 this probe should be on 56+ and not be on 55 for release & beta.

Add pagination support to probe fetch API

As the probe counts for specific components continue to grow, support for pagination in the probe fetching API will become more valuable.

Deliver JSON files gzip encoded

Loading performance is currently a pain point for the data explorer site.

The bulk of the wait is on probes.json, which is ~8MB and delivered uncompressed from the public S3 bucket.

I can think of two quick fixes here:

gzip encode in the S3 bucket (see bottom here, cuts my load times from 6s to 2s)
publish them through a CloudFront distribution

@mreid-moz , does that sound right?
Any recommendations over either?
If CF, do i need to file a bug $somewhere to make it happen?

Include "first_added" dates in probe-info-service

Description of Problem

For the Main Ping Processing in GCP, we plan on including every historical probe, so that they can continue to be queried even after they expire.

To do this, we need a total ordering or probes, with new probes appended. For example, if the current set of probes was {SEARCH_COUNTS, GC_MS}, but then tomorrow's nightly we added {PAGELOAD_IS_SSL}, I would want the probe ordering to be:

Today's: [GC_MS, SEARCH_COUNTS]
Tomorrow's: [GC_MS, SEARCH_COUNTS, PAGELOAD_IS_SSL]

`first_added` field

The easiest way to accomplish this is to include a "first_added" timestamp. If any probes share a "first_added" timestamp, they will be sorted alphanumerically. Some probes (e.g. STYLO_PARALLEL_RESTYLE_FRACTION_WEIGHTED_STYLED) only showed up on release, and others (e.g. GC_MS) only show up on pre-release. As such we can include a "first_added" date for each channel:

"GC_MS": {
   "history": {...},
    "name": "GC_MS",
    "type": "histogram",
    "first_added": {
        "nightly": 1547013850,
        "beta": 1547014850
     }
}

We can achieve this by:

Scraping all revisions from each probe file, e.g. Histograms.json
Iterate over each revision, getting list of new probes added in that revision
Getting the pushdate from json-rev.
Joining that data with the existing probe-info-service data

The only piece I don't know how to do is: given the revision, get the associated json-rev file.

Make it easy to grab the latest probes for each channel

As discussed here, requesting "probes active in the current latest release in each channel" cannot be done easily.

We should make that easier from the API.

Requesting Update of 'shared_telemetry_util.py'

In bug-1363725 (https://bugzilla.mozilla.org/show_bug.cgi?id=1363725),
the values all_child and all_childs were changed to all_children.
But it caused a conflict with 'shared_telemetry_util.py' bundled with probe-scraper.

So we have just updated '.py' file for now. After this file is updated on both ends, we will change
the values in .yaml and .json files.

Requesting to update 'shared_telemetry_util.py' on probe-scraper.
(I found its the only files that mentions 'all_child', Please tell if something else also needs to be changed.)

Add a schema for repositories.yaml

It's easy to make small, not noticeable mistakes. A schema that validates would not allow this.

Handle multiple app_ids

Fenix will be releasing multiple versions with different app_ids. Those all need to be included here. It should be straightforward to make that an array rather than a string.

Drop the .json extensions from `revisions.json` and `general.json` API endpoints

In order to mimick a REST API, let's drop the json extension from these two files as well.

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].

(Message COC001)

Break down the output by release channel

In order to make the scraper future proof and to support other internal products we need to break down the output by channel (Nightly, Aurora, Beta, Release, ESR).

We will still need to support the monolithic output mode with all the probes from all the channels to make it easy for the data-explorer to consume it.

Schedule the scraping job using Airflow

We should schedule this job to run on mozilla/telemetry-airflow.

Glean metrics not being properly recorded

Recently we added all_metrics to send_in_pings for the glean error probes. This had the desired effect: the next run of the probe-scraper, there was a new definition in history for those probes, where send_in_pings was [all_pings].

However, now the old definition has taken precedence: https://probeinfo.telemetry.mozilla.org/glean/glean/metrics

This can be seen by looking at dates/last, where the definition with send_in_pings as [metrics] has a later date. We want the dates/last for the metrics version of the definition to be before the dates/first for the all_pings version of the definition.

Add dependencies for reference browser

The reference browser integrated lib-crash, but the probe-scraper doesn't know about it, so the BQ schema doesn't have those metrics. We need to make a scrapable dependencies file (ala Fenix) and add it here.

If that's too much work, we could consider manually stating the dependencies (for a short-term solution - or it could be long-term if we realize we're not going to be adding many dependencies in the future).

cc @mdboom @travis79

Fixup old non-number values & expressions in histogram.json

Old histogram.json files contain non-number values & expressions for fields as "n_values". We need to properly handle these and convert the literals to useful values.

Save update time to general.json

While we save the update date to general.json as "lastUpdate", it would help to know at what time of day the update happened.
We could just change that to be a full ISO date string, including time and timezone info.

Wiki changes

FYI: The following changes were made to this repository's wiki:

defacing spam has been removed
Restricting write access to contributors is strongly encouraged. Please make that change (documentation).

These were made as the result of a recent automated defacement of publically writeable wikis.

Deal with EOL Whitespace in outputted JSON

According to the json.dump docs, the default separators value for Python 2 is (', ', ': '), even with a non-default indent value.

This means before the \n at the end of lines with terminal commas there is a space. EOL Whitespace! Sound the alarm!

This was fixed in Python v3.4, but as the docs recommend and the current production job use python instead of python3 this means we need to specify separators=(',',': ') in json.dump

Extract `record_in_processes` from Histograms.json

Information about in which process a particular histogram is recorded is currently not extracted by the probe scraper.

Print some stats before saving probe data to disk

To make it more easy to judge whether things worked correctly etc., it would be great to print some basic stats before saving the outputs to disk in runner.py.
E.g.:

how many probes of each type we have (histograms, scalars, events, ...)
how many versions/revisions we extracted data for

Probe Scraper should include a list of buckets for each histogram

Don't include prerelease probe in release histories

Currently we include all probes for the release channel data, even when they are only ever recorded on prerelease.

It probably makes more sense to just not include prerelease probe data for release.

@Dexterp37 @fbertsch Thoughts?