Giter VIP home page Giter VIP logo

probe-scraper's Introduction

probe-scraper

Scrape Telemetry probe data from Firefox repositories.

This extracts per-version Telemetry probe data for Firefox and other Mozilla products from registry files like Histograms.json and Scalars.yaml. The data allows answering questions like "which Firefox versions is this Telemetry probe in anyway?". Also, probes outside of Histograms.json - like the CSS use counters - are included in the output data.

The data is pulled from two different sources:

Probe Scraper outputs JSON to https://probeinfo.telemetry.mozilla.org. Effectively, this creates a REST API which can be used by downstream tools like mozilla-schema-generator and various data dictionary type applications (see below).

An OpenAPI reference to this API is available:

probeinfo API docs

A web tool to explore the Firefox-related data is available at probes.telemetry.mozilla.org. A project to develop a similar view for Glean-based data is under development in the Glean Dictionary.

Deprecation

Deprecation is an important step in an application lifecycle. Because of the backwards-compatible nature of our pipeline, we do not remove Glean apps or variants from the repositories.yaml file - instead, we mark them as deprecated.

Marking an App Variant as deprecated

When an app variant is marked as deprecated (see this example from Fenix), the following happens:

Marking an App as deprecated

When an app is marked as deprecated (see this example of Firefox for Fire TV), the following happens:

  • It no longer shows by default in the Glean Dictionary. (Deprecated apps can be viewed by clicking the Show deprecated applications checkbox)

Adding a New Glean Repository

To scrape a git repository for probe definitions, an entry needs to be added in repositories.yaml. The exact format of the entry depends on whether you are adding an application or a library. See below for details.

Adding an application

For a given application, Glean metrics are emitted by the application itself, any libraries it uses that also use Glean, as well as the Glean library proper. Therefore, probe scraper needs a way to find all of the dependencies to determine all of the metrics emitted by that application.

Therefore, each application should specify a dependencies parameter, which is a list of Glean-using libraries used by the application. Each entry should be a library name as specified by the library's library_names parameter.

For Android applications, if you're not sure what the dependencies of the application are, you can run the following command at the root of the project folder:

$ ./gradlew :app:dependencies

See the full application schema documentation for descriptions of all the available parameters.

Adding a library

Probe scraper also needs a way to map dependencies back to an entry in the repositories.yaml file. Therefore, any libraries defined should also include their build-system-specific library names in the library_names parameter.

See the full library schema documentation for descriptions of all the available parameters.

Developing the probe-scraper

You can choose to develop using the container, or locally. Using the container will be slower, since changes will trigger a rebuild of the container. But using the container method will ensure that your PR passes CircleCI build/test phases.

Local development

You may wish to, instead of installing all these requirements in your global Python environment, start by generating and activating a Python virtual environment. The .gitignore expects it to be called ENV or venv:

python -m venv venv
. venv/bin/activate

Install the requirements:

pip install -r requirements.txt
pip install -r test_requirements.txt
python setup.py develop

Run tests. This by default does not run tests that require a web connection:

pytest tests/

To run all tests, including those that require a web connection:

pytest tests/ --run-web-tests

To test whether the code conforms to the style rules, you can run:

python -m black --check probe_scraper tests ./*.py
flake8 --max-line-length 100 probe_scraper tests ./*.py
yamllint repositories.yaml .circleci
python -m isort --profile black --check-only probe_scraper tests ./*.py

To render API documentation locally to index.html:

make apidoc

Developing using the container

Run tests in container. This does not run tests that require a web connection:

export COMMAND='pytest tests/'
make run

To run all tests, including those that require a web connection:

make test

To test whether the code conforms to the style rules, you can run:

make lint

Tests with Web Dependencies

Any tests that require a web connection to run should be marked with @pytest.mark.web_dependency.

These will not run by default, but will run on CI.

Performing a Dry-Run

Before opening a PR, it's good to test the code you wrote on the production data. You can specify a specific Firefox version to run on by using first-version:

export COMMAND='python -m probe_scraper.runner --firefox-version 65 --dry-run'
make run

or locally via:

python -m probe_scraper.runner --firefox-version 65 --dry-run

Including --dry-run means emails will not be sent.

Additionally, you can test just on Glean repositories:

export COMMAND='python -m probe_scraper.runner --glean --dry-run'
make run

By default that will test against every Glean repository, which might take a while. If you want to test against just one (e.g. a new repository you're adding), you can use the --glean-repo argument to just test the repositories you care about:

export COMMAND='python -m probe_scraper.runner --glean --glean-repo glean-core --glean-repo glean-android --glean-repo burnham --dry-run'
make run

Replace burnham in the example above with your repository and its dependencies.

You can also do the dry-run locally:

python -m probe_scraper.runner --glean --glean-repo glean-core --glean-repo glean-android --glean-repo burnham --dry-run

Module overview

The module is built around the following data flow:

  • scrape registry files from mozilla-central, clone files from repositories directory
  • extract probe data from the files
  • transform probe data into output formats
  • save to disk

The code layout consists mainly of:

  • probe_scraper
    • runner.py - the central script, ties the other pieces together
    • scrapers
      • buildhub.py - pull build info from the BuildHub service
      • moz_central_scraper.py - loads probe registry files for multiple versions from mozilla-central
      • git_scraper.py - loads probe registry files from a git repository (no version or channel support yet, just per-commit)
    • parsers/ - extract probe data from the registry files
    • transform_*.py - transform the extracted raw data into output formats
  • tests/ - the unit tests

Accessing the data files

The processed probe data is serialized to the disk in a directory hierarchy starting from the provided output directory. The directory layout resembles a REST-friendly structure.

|-- product
    |-- general
    |-- revisions
    |-- channel (or "all")
        |-- ping type
            |-- probe type (or "all_probes")

For example, all the JSON probe data in the main ping for the Firefox Nightly channel can be accessed with the following path: firefox/nightly/main/all_probes. The probe data for all the channels (same product and ping) can be accessed instead using firefox/all/main/all_probes.

The root directory for the output generated from the scheduled job can be found at https://probeinfo.telemetry.mozilla.org/. All the probe data for Firefox coming from the main ping can be found at https://probeinfo.telemetry.mozilla.org/firefox/all/main/all_probes.

Accessing Glean metrics data

Glean data is generally laid out as follows:

| -- glean
    | -- repositories
    | -- general
    | -- repository-name
        | -- general
        | -- metrics

For example, the data for a repository called fenix would be found at /glean/fenix/metrics. The time the data was last updated for that project can be found at glean/fenix/general.

A list of available repositories is at /glean/repositories.

probe-scraper's People

Contributors

akkomar avatar badboy avatar benwu avatar brizental avatar chutten avatar dependabot[bot] avatar dexterp37 avatar eu9ene avatar fbertsch avatar gabrielluong avatar georgf avatar github-actions[bot] avatar gleonard-m avatar haroldwoo avatar hrdktg avatar iinh avatar jhabarsingh avatar jklukas avatar lilylme avatar marlene-m-hirose avatar mdboom avatar mreid-moz avatar perrymcmanis144 avatar relud avatar scholtzan avatar sean-rose avatar travis79 avatar whd avatar wlach avatar zzzeid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

probe-scraper's Issues

Validate probe data

We need to validate the probe data for correctness, e.g. by manually cross-checking the history of a few probes of each type.

Add instructions for how to perform a dry-run for testing

As seen in #49, added tests may pass even though running against real data results in improper data.

It'd be nice to include in the README instructions for how to run the scraper quickly and locally.

As @fbertsch put it "just head to the probe-scraper repo and run: python probe_scraper/runner.py. If it's taking too long (it will), change MIN_FIREFOX_VERSION in scrapers/moz_central_scraper.py to e.g. 59 so it doesn't pull down as many files."

Proposal for: GCP Migration, CircleCi Migration, and Running Dependency/Probe checks

These 3 issues (GCP, CircleCI, and Dependency/Probe checking) are all related in that they require Docker integration; dependency checking specifically needs GKE integration.

The work can be done in this order; so e.g. we can be building/testing the container on CI, but still running on EMR while we change the deploy to GCP.

Local Testing and CI

For local testing and CI, we will move to a Docker workflow. This will include building a container with all of the dependencies, running tests and lint on that container, and updating CI to build, test, and deploy that container. This should follow the Dockerflow example.

This will require adding:

  • Dockerfile
  • Makefile
  • docker-compose (optional, but nice)
  • pinned requirements
  • circle-ci config
  • Dockerhub creds to circle config

Running on GCP

GCP will also run on that container. We will use the GKE Pod Operator, and use the image that CI deploys. To run this on GCP, we need to add an entrypoint script that runs the probe-scraper locally.

We will need an associated change to telemetry-airflow to update how we're running the job. This file is the one that will be running on the container (with some changes for GCP world, e.g. GCS).

(Note that we may still need to write to s3 for the probe-info-service. I'll cc @jasonthomas here on whether there is a plan to move the probe-info-service to GCS. Once it's there we can write to GCS instead.)

Integrating with Google Kubernetes Engine

In order to check for metrics or dependencies that are present in repositories, we need a development environment to build/run/test the applications. We can do this by running on Dockerhub Images. It may be the case that there is not a stable image for our needs; in that case we may need to build and deploy them ourselves, using the existing infrastructure from "Local Testing and CI". In that case we'll need an additional Dockerfile.

When we have those images available, we can run them in the probe-scraper using the GKE Python Client. We can run in those environments and get a result (whether it is dependencies, probes, pings, etc.).

Histograms expired in minor releases are not listed in the "expired" list

The MEDIA_DECODING_PROCESS_CRASH histogram was removed from the histogram.json file by this bug (Firefox 57). Up to revision 0a0a804a5b6f252f12d9808b54ed2a7f6ada27e3, it was correctly reported among the expired histograms for Firefox 57 in the release channel. However, since 3702966a64c80e17d01f613b0a464f92695524fc was scraped, it is no longer showing up in that list.

We should verify what is going on and fix this.

Consider dropping Aurora channel

We should check if anyone intends to use historical Aurora channel data.
If not, we could just drop support for that channel.

Update probe parsers in mozilla-central

We made some changes in this repository to the parsers that live in mozilla-central.
Once this repository is more stable, we should merge the updates to m-c.

Add test coverage for "tags" scraping

In scraper.py we use the load_tags and extract_tag_data function to fetch the data from the Firefox repository and extract the releases information. This needs some test coverage.

We could:

  • Use the "requests" library to catch the responses and provide fake data to load_tags (an example can be found here)
  • Call extract_tag_data on the fake data
  • Make sure we get the expected data out

Add test coverage for correct handling of historic probe registries

We should add test coverage that tests that the output for old, real Histograms.json, Scalars.yaml, ... is correct.
E.g. taking files from the oldest supported Firefox version, the current newest and a few at important key points in between (e.g. when properties where added/removed).

Change existing git repo parsing to Glean parsing

We currently have a setup to include the types of files we expect for Desktop: Histograms.json, Scalars.yaml, etc., but for Github repos. The Glean work is replacing that effort, so we need to update the code to parse metrics.yaml files instead.

This work will encompass:

  • Parsing and writing out information on Glean probes to the probe-dictionary
  • Parsing and writing out information on Glean pings to the probe-dictionary

The former will be used for schema creation. The latter will be used for validation and table names in the pipeline.

We can use the Glean Parser library to parse the metrics.yaml files and write out results. The rationale for letting the Glean Parser serialize is then the scraper doesn't need to have intimate knowledge of which fields are required and which may be updated; instead we can just add and remove fields from the parser.

Scrape bug numbers from probes that have them

Some probes have bug_numbers fields (in fact, all new ones are required to have them, and owners are encouraged to add them when they are updated). This seems like useful information we can surface in the probe scraper.

Marking this "help wanted" for mentorship. @Dexterp37 and @georgf have offered to mentor this bug.

After this is complete, we can file a bug in https://github.com/mozilla/telemetry-dashboard for augmenting Probe Dictionary to display this information and linkify the bug numbers to make it easy to flow from the dictionary to bugzilla to learn more.

Remove scraping `json-tags` and get all revisions from BuildHub

We currently get revisions from both Buildhub and json-tags, but the former is a more canonical source, along with storing every build we have had (including nightly). We should remove the json-tags scraping and just use that source of revision data.

Use the mozparsers package

In bug 1282098 we exported the python_mozparsers package that contains the parsers we use for parsing probes.

The probe scraper should depend on that package rather than forking the ones in m-c.

This issue is about backporting the changes that were implemented in the probe scraper fork of the parsers upstream and then depending on the exported package.

Refactor this Code Base

This code base is largely about:

  • Scraping information about revisions
  • Reading probe information from those revisions
  • Combining probe information
  • Writing it back out

Given that we recently moved to Python 3, this code base could use a serious uplift by utilizing some of the nice features available. For example, the probe type could be a Dataclass that knows how to compare itself to others, and knows how to serialize itself into the final JSON output.

A revision class could compare itself to other revisions, based on push-date or version. This can be used to decide first and last revisions/versions/dates for probes.

Together, these would simplify transform_probes.py.

All logic should be moved out of runner.py and integrated into appropriate places, and that should exist just to provide the scraper CLI.

Glean probes should use the latest commit in addition to file changes

Using the latest commit will allow us to get an accurate date range for a probe, since even if it changed 5 commits ago, it may still be available, but the date will be for the last commit the metrics.yaml file changed on.

This isn't urgent, since we can order glean probes by first.

Scrape and publish Glean `pings.yaml`

This can be used in a few places:

  • Mozilla-schema-generator, which currently does a backwards approach of needing all pings before it can get all probes, but it has to get the pings from the probes
  • bigquery-etl, to easily generate queries on every table (or view across them all)

This should scrape pings.yaml, and publish a history based on the pings present. I would imagine this lives in /glean/$REPO/pings.

Rename "optout" property

In the output data we use the old "optout" terminology for probes.
We should change this to something matching the new semantics, e.g. "collected_on_release".

Currently the main consumer of this is the probe-dictionary, although we should check with the data tools team.
We can migrate easily by first adding the new property, then fix the consumers, then removing the old property.

Handle "historical" glean metrics file

The scraper currently imports the latest version of glean-parsers. However, if breaking non-backwards compatible format changes are introduced in the parsers, the scraper might break.

We should fix this. Among the possible solutions:

  • do version pinning for the parsers;
  • have two CI environments;
  • include a version in the metrics file.

Deliver JSON files gzip encoded

Loading performance is currently a pain point for the data explorer site.

The bulk of the wait is on probes.json, which is ~8MB and delivered uncompressed from the public S3 bucket.

I can think of two quick fixes here:

  • gzip encode in the S3 bucket (see bottom here, cuts my load times from 6s to 2s)
  • publish them through a CloudFront distribution

@mreid-moz , does that sound right?
Any recommendations over either?
If CF, do i need to file a bug $somewhere to make it happen?

Include "first_added" dates in probe-info-service

Description of Problem

For the Main Ping Processing in GCP, we plan on including every historical probe, so that they can continue to be queried even after they expire.

To do this, we need a total ordering or probes, with new probes appended. For example, if the current set of probes was {SEARCH_COUNTS, GC_MS}, but then tomorrow's nightly we added {PAGELOAD_IS_SSL}, I would want the probe ordering to be:

Today's: [GC_MS, SEARCH_COUNTS]
Tomorrow's: [GC_MS, SEARCH_COUNTS, PAGELOAD_IS_SSL]

first_added field

The easiest way to accomplish this is to include a "first_added" timestamp. If any probes share a "first_added" timestamp, they will be sorted alphanumerically. Some probes (e.g. STYLO_PARALLEL_RESTYLE_FRACTION_WEIGHTED_STYLED) only showed up on release, and others (e.g. GC_MS) only show up on pre-release. As such we can include a "first_added" date for each channel:

"GC_MS": {
   "history": {...},
    "name": "GC_MS",
    "type": "histogram",
    "first_added": {
        "nightly": 1547013850,
        "beta": 1547014850
     }
}

We can achieve this by:

  1. Scraping all revisions from each probe file, e.g. Histograms.json
  2. Iterate over each revision, getting list of new probes added in that revision
  3. Getting the pushdate from json-rev.
  4. Joining that data with the existing probe-info-service data

The only piece I don't know how to do is: given the revision, get the associated json-rev file.

Requesting Update of 'shared_telemetry_util.py'

In bug-1363725 (https://bugzilla.mozilla.org/show_bug.cgi?id=1363725),
the values all_child and all_childs were changed to all_children.
But it caused a conflict with 'shared_telemetry_util.py' bundled with probe-scraper.

So we have just updated '.py' file for now. After this file is updated on both ends, we will change
the values in .yaml and .json files.

Requesting to update 'shared_telemetry_util.py' on probe-scraper.
(I found its the only files that mentions 'all_child', Please tell if something else also needs to be changed.)

Handle multiple app_ids

Fenix will be releasing multiple versions with different app_ids. Those all need to be included here. It should be straightforward to make that an array rather than a string.

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

  1. Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
  2. Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].

(Message COC001)

Break down the output by release channel

In order to make the scraper future proof and to support other internal products we need to break down the output by channel (Nightly, Aurora, Beta, Release, ESR).

We will still need to support the monolithic output mode with all the probes from all the channels to make it easy for the data-explorer to consume it.

Glean metrics not being properly recorded

Recently we added all_metrics to send_in_pings for the glean error probes. This had the desired effect: the next run of the probe-scraper, there was a new definition in history for those probes, where send_in_pings was [all_pings].

However, now the old definition has taken precedence: https://probeinfo.telemetry.mozilla.org/glean/glean/metrics

This can be seen by looking at dates/last, where the definition with send_in_pings as [metrics] has a later date. We want the dates/last for the metrics version of the definition to be before the dates/first for the all_pings version of the definition.

Add dependencies for reference browser

The reference browser integrated lib-crash, but the probe-scraper doesn't know about it, so the BQ schema doesn't have those metrics. We need to make a scrapable dependencies file (ala Fenix) and add it here.

If that's too much work, we could consider manually stating the dependencies (for a short-term solution - or it could be long-term if we realize we're not going to be adding many dependencies in the future).

cc @mdboom @travis79

Save update time to general.json

While we save the update date to general.json as "lastUpdate", it would help to know at what time of day the update happened.
We could just change that to be a full ISO date string, including time and timezone info.

Wiki changes

FYI: The following changes were made to this repository's wiki:

These were made as the result of a recent automated defacement of publically writeable wikis.

Deal with EOL Whitespace in outputted JSON

According to the json.dump docs, the default separators value for Python 2 is (', ', ': '), even with a non-default indent value.

This means before the \n at the end of lines with terminal commas there is a space. EOL Whitespace! Sound the alarm!

This was fixed in Python v3.4, but as the docs recommend and the current production job use python instead of python3 this means we need to specify separators=(',',': ') in json.dump

Print some stats before saving probe data to disk

To make it more easy to judge whether things worked correctly etc., it would be great to print some basic stats before saving the outputs to disk in runner.py.
E.g.:

  • how many probes of each type we have (histograms, scalars, events, ...)
  • how many versions/revisions we extracted data for

Move to Circle CI

In general we've been trying to move our repos to Circle CI rather than Travis. We have a paid account on Circle and have had a lot of success on their platform.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.