mozilla / probe-scraper Goto Github PK

View Code? Open in Web Editor NEW

21.0 29.0 52.0 5.1 MB

Scrape and publish Telemetry probe data from Firefox

Home Page: https://mozilla.github.io/probe-scraper/

License: Mozilla Public License 2.0

Jupyter Notebook 1.81% Python 96.66% C++ 0.20% Dockerfile 0.41% Makefile 0.92%

firefox telemetry

probe-scraper's Issues

Glean metrics not being properly recorded

Recently we added all_metrics to send_in_pings for the glean error probes. This had the desired effect: the next run of the probe-scraper, there was a new definition in history for those probes, where send_in_pings was [all_pings].

However, now the old definition has taken precedence: https://probeinfo.telemetry.mozilla.org/glean/glean/metrics

This can be seen by looking at dates/last, where the definition with send_in_pings as [metrics] has a later date. We want the dates/last for the metrics version of the definition to be before the dates/first for the all_pings version of the definition.

Deliver JSON files gzip encoded

Loading performance is currently a pain point for the data explorer site.

The bulk of the wait is on probes.json, which is ~8MB and delivered uncompressed from the public S3 bucket.

I can think of two quick fixes here:

gzip encode in the S3 bucket (see bottom here, cuts my load times from 6s to 2s)
publish them through a CloudFront distribution

@mreid-moz , does that sound right?
Any recommendations over either?
If CF, do i need to file a bug $somewhere to make it happen?

Add pagination support to probe fetch API

As the probe counts for specific components continue to grow, support for pagination in the probe fetching API will become more valuable.

Validate probe data

We need to validate the probe data for correctness, e.g. by manually cross-checking the history of a few probes of each type.

Deal with EOL Whitespace in outputted JSON

According to the json.dump docs, the default separators value for Python 2 is (', ', ': '), even with a non-default indent value.

This means before the \n at the end of lines with terminal commas there is a space. EOL Whitespace! Sound the alarm!

This was fixed in Python v3.4, but as the docs recommend and the current production job use python instead of python3 this means we need to specify separators=(',',': ') in json.dump

Remove scraping `json-tags` and get all revisions from BuildHub

We currently get revisions from both Buildhub and json-tags, but the former is a more canonical source, along with storing every build we have had (including nightly). We should remove the json-tags scraping and just use that source of revision data.

Wiki changes

FYI: The following changes were made to this repository's wiki:

defacing spam has been removed
Restricting write access to contributors is strongly encouraged. Please make that change (documentation).

These were made as the result of a recent automated defacement of publically writeable wikis.

Add dependencies for reference browser

The reference browser integrated lib-crash, but the probe-scraper doesn't know about it, so the BQ schema doesn't have those metrics. We need to make a scrapable dependencies file (ala Fenix) and add it here.

If that's too much work, we could consider manually stating the dependencies (for a short-term solution - or it could be long-term if we realize we're not going to be adding many dependencies in the future).

cc @mdboom @travis79

Extract `record_in_processes` from Histograms.json

Information about in which process a particular histogram is recorded is currently not extracted by the probe scraper.

Histograms expired in minor releases are not listed in the "expired" list

The MEDIA_DECODING_PROCESS_CRASH histogram was removed from the histogram.json file by this bug (Firefox 57). Up to revision 0a0a804a5b6f252f12d9808b54ed2a7f6ada27e3, it was correctly reported among the expired histograms for Firefox 57 in the release channel. However, since 3702966a64c80e17d01f613b0a464f92695524fc was scraped, it is no longer showing up in that list.

We should verify what is going on and fix this.

Add special case for `all_pings` in Glean send_in_pings

When the probe_scraper sees all_pings, it should instead transform that to a list of all the pings. This way downstream dependencies don't need to know about the Glean keyword.

Print some stats before saving probe data to disk

To make it more easy to judge whether things worked correctly etc., it would be great to print some basic stats before saving the outputs to disk in runner.py.
E.g.:

how many probes of each type we have (histograms, scalars, events, ...)
how many versions/revisions we extracted data for

Probe Scraper should include a list of buckets for each histogram

Save update time to general.json

While we save the update date to general.json as "lastUpdate", it would help to know at what time of day the update happened.
We could just change that to be a full ISO date string, including time and timezone info.

Scrape desktop telemetry probe stores

To support mozilla/probe-dictionary#60, we need to add this info to the probe-info-service.

Document the output format

We should have a README file that briefly explains the file format.

Extract description for events

The event parser doesn't extract descriptions yet, we need to fix that.

Rename "optout" property

In the output data we use the old "optout" terminology for probes.
We should change this to something matching the new semantics, e.g. "collected_on_release".

Currently the main consumer of this is the probe-dictionary, although we should check with the data tools team.
We can migrate easily by first adding the new property, then fix the consumers, then removing the old property.

Fixup old non-number values & expressions in histogram.json

Old histogram.json files contain non-number values & expressions for fields as "n_values". We need to properly handle these and convert the literals to useful values.

Include `labels` for categorical histograms

MDV2 needs this to display those histograms.

Requesting Update of 'shared_telemetry_util.py'

In bug-1363725 (https://bugzilla.mozilla.org/show_bug.cgi?id=1363725),
the values all_child and all_childs were changed to all_children.
But it caused a conflict with 'shared_telemetry_util.py' bundled with probe-scraper.

So we have just updated '.py' file for now. After this file is updated on both ends, we will change
the values in .yaml and .json files.

Requesting to update 'shared_telemetry_util.py' on probe-scraper.
(I found its the only files that mentions 'all_child', Please tell if something else also needs to be changed.)

Use the mozparsers package

In bug 1282098 we exported the python_mozparsers package that contains the parsers we use for parsing probes.

The probe scraper should depend on that package rather than forking the ones in m-c.

This issue is about backporting the changes that were implemented in the probe scraper fork of the parsers upstream and then depending on the exported package.

Add a schema for repositories.yaml

It's easy to make small, not noticeable mistakes. A schema that validates would not allow this.

Proposal for: GCP Migration, CircleCi Migration, and Running Dependency/Probe checks

These 3 issues (GCP, CircleCI, and Dependency/Probe checking) are all related in that they require Docker integration; dependency checking specifically needs GKE integration.

The work can be done in this order; so e.g. we can be building/testing the container on CI, but still running on EMR while we change the deploy to GCP.

Local Testing and CI

For local testing and CI, we will move to a Docker workflow. This will include building a container with all of the dependencies, running tests and lint on that container, and updating CI to build, test, and deploy that container. This should follow the Dockerflow example.

This will require adding:

Dockerfile
Makefile
docker-compose (optional, but nice)
pinned requirements
circle-ci config
Dockerhub creds to circle config

Running on GCP

GCP will also run on that container. We will use the GKE Pod Operator, and use the image that CI deploys. To run this on GCP, we need to add an entrypoint script that runs the probe-scraper locally.

We will need an associated change to telemetry-airflow to update how we're running the job. This file is the one that will be running on the container (with some changes for GCP world, e.g. GCS).

(Note that we may still need to write to s3 for the probe-info-service. I'll cc @jasonthomas here on whether there is a plan to move the probe-info-service to GCS. Once it's there we can write to GCS instead.)

Integrating with Google Kubernetes Engine

In order to check for metrics or dependencies that are present in repositories, we need a development environment to build/run/test the applications. We can do this by running on Dockerhub Images. It may be the case that there is not a stable image for our needs; in that case we may need to build and deploy them ourselves, using the existing infrastructure from "Local Testing and CI". In that case we'll need an additional Dockerfile.

When we have those images available, we can run them in the probe-scraper using the GKE Python Client. We can run in those environments and get a result (whether it is dependencies, probes, pings, etc.).

Refactor this Code Base

This code base is largely about:

Scraping information about revisions
Reading probe information from those revisions
Combining probe information
Writing it back out

Given that we recently moved to Python 3, this code base could use a serious uplift by utilizing some of the nice features available. For example, the probe type could be a Dataclass that knows how to compare itself to others, and knows how to serialize itself into the final JSON output.

A revision class could compare itself to other revisions, based on push-date or version. This can be used to decide first and last revisions/versions/dates for probes.

Together, these would simplify transform_probes.py.

All logic should be moved out of runner.py and integrated into appropriate places, and that should exist just to provide the scraper CLI.

Break down the output by release channel

In order to make the scraper future proof and to support other internal products we need to break down the output by channel (Nightly, Aurora, Beta, Release, ESR).

We will still need to support the monolithic output mode with all the probes from all the channels to make it easy for the data-explorer to consume it.

Drop the .json extensions from `revisions.json` and `general.json` API endpoints

In order to mimick a REST API, let's drop the json extension from these two files as well.

Change existing git repo parsing to Glean parsing

We currently have a setup to include the types of files we expect for Desktop: Histograms.json, Scalars.yaml, etc., but for Github repos. The Glean work is replacing that effort, so we need to update the code to parse metrics.yaml files instead.

This work will encompass:

Parsing and writing out information on Glean probes to the probe-dictionary
Parsing and writing out information on Glean pings to the probe-dictionary

The former will be used for schema creation. The latter will be used for validation and table names in the pipeline.

We can use the Glean Parser library to parse the metrics.yaml files and write out results. The rationale for letting the Glean Parser serialize is then the scraper doesn't need to have intimate knowledge of which fields are required and which may be updated; instead we can just add and remove fields from the parser.

Recording versions for probes are off by one

(This came out of telemetry-dashboard #543.)

The histogram DEVTOOLS_TOOLBOX_TIME_ACTIVE_SECONDS is showing up as recorded from 59 on release.

This seems to actually come from the source data, i.e. probe-scraper.

Another example where recording ranges are off, the scalar telemetry.discarded.child_events.
This is also a problem in the probe-scraper data.

Per 1369041 this probe should be on 56+ and not be on 55 for release & beta.

Glean probes should use the latest commit in addition to file changes

Using the latest commit will allow us to get an accurate date range for a probe, since even if it changed 5 commits ago, it may still be available, but the date will be for the last commit the metrics.yaml file changed on.

This isn't urgent, since we can order glean probes by first.

Use buildhub2 instead of buildhub (soon to be deprecated)

From an email sent to the Stability list (thanks @chutten ), turns out that buildhub will be deprecated on July and that we should move to buildhub2, before then.

@fbertsch did you already know about this?

Scrape bug numbers from probes that have them

Some probes have bug_numbers fields (in fact, all new ones are required to have them, and owners are encouraged to add them when they are updated). This seems like useful information we can surface in the probe scraper.

Marking this "help wanted" for mentorship. @Dexterp37 and @georgf have offered to mentor this bug.

After this is complete, we can file a bug in https://github.com/mozilla/telemetry-dashboard for augmenting Probe Dictionary to display this information and linkify the bug numbers to make it easy to flow from the dictionary to bugzilla to learn more.

Add test coverage for "tags" scraping

In scraper.py we use the load_tags and extract_tag_data function to fetch the data from the Firefox repository and extract the releases information. This needs some test coverage.

We could:

Use the "requests" library to catch the responses and provide fake data to load_tags (an example can be found here)
Call extract_tag_data on the fake data
Make sure we get the expected data out

Handle multiple app_ids

Fenix will be releasing multiple versions with different app_ids. Those all need to be included here. It should be straightforward to make that an array rather than a string.

Make it easy to grab the latest probes for each channel

As discussed here, requesting "probes active in the current latest release in each channel" cannot be done easily.

We should make that easier from the API.

Include "first_added" dates in probe-info-service

Description of Problem

For the Main Ping Processing in GCP, we plan on including every historical probe, so that they can continue to be queried even after they expire.

To do this, we need a total ordering or probes, with new probes appended. For example, if the current set of probes was {SEARCH_COUNTS, GC_MS}, but then tomorrow's nightly we added {PAGELOAD_IS_SSL}, I would want the probe ordering to be:

Today's: [GC_MS, SEARCH_COUNTS]
Tomorrow's: [GC_MS, SEARCH_COUNTS, PAGELOAD_IS_SSL]

`first_added` field

The easiest way to accomplish this is to include a "first_added" timestamp. If any probes share a "first_added" timestamp, they will be sorted alphanumerically. Some probes (e.g. STYLO_PARALLEL_RESTYLE_FRACTION_WEIGHTED_STYLED) only showed up on release, and others (e.g. GC_MS) only show up on pre-release. As such we can include a "first_added" date for each channel:

"GC_MS": {
   "history": {...},
    "name": "GC_MS",
    "type": "histogram",
    "first_added": {
        "nightly": 1547013850,
        "beta": 1547014850
     }
}

We can achieve this by:

Scraping all revisions from each probe file, e.g. Histograms.json
Iterate over each revision, getting list of new probes added in that revision
Getting the pushdate from json-rev.
Joining that data with the existing probe-info-service data

The only piece I don't know how to do is: given the revision, get the associated json-rev file.

Handle "historical" glean metrics file

The scraper currently imports the latest version of glean-parsers. However, if breaking non-backwards compatible format changes are introduced in the parsers, the scraper might break.

We should fix this. Among the possible solutions:

do version pinning for the parsers;
have two CI environments;
include a version in the metrics file.

Missed probe removal

Hello! I'm not sure if this is the right place to document this, and it may be a dupe of #30, but:

https://telemetry.mozilla.org/probe-dictionary/?detailView=histogram%2FDEVTOOLS_TOOLBOX_OPENED_BOOLEAN does not show that the probe was removed
the probe was removed in https://hg.mozilla.org/mozilla-central/rev/668e08174667 as part of an effort to convert BOOLEANs to COUNTs

Please let me know if this isn't helpful. Thanks!

Add test coverage for correct handling of historic probe registries

We should add test coverage that tests that the output for old, real Histograms.json, Scalars.yaml, ... is correct.
E.g. taking files from the oldest supported Firefox version, the current newest and a few at important key points in between (e.g. when properties where added/removed).

Don't include prerelease probe in release histories

Currently we include all probes for the release channel data, even when they are only ever recorded on prerelease.

It probably makes more sense to just not include prerelease probe data for release.

@Dexterp37 @fbertsch Thoughts?

Sync probe scraper caches to s3 before and after running

Currently the probe scraper runs on a new machine every time, making the cache largely unused.

Consider dropping Aurora channel

We should check if anyone intends to use historical Aurora channel data.
If not, we could just drop support for that channel.

Scrape and publish Glean `pings.yaml`

This can be used in a few places:

Mozilla-schema-generator, which currently does a backwards approach of needing all pings before it can get all probes, but it has to get the pings from the probes
bigquery-etl, to easily generate queries on every table (or view across them all)

This should scrape pings.yaml, and publish a history based on the pings present. I would imagine this lives in /glean/$REPO/pings.

Move to Circle CI

In general we've been trying to move our repos to Circle CI rather than Travis. We have a paid account on Circle and have had a lot of success on their platform.

Schedule the scraping job using Airflow

We should schedule this job to run on mozilla/telemetry-airflow.

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].

(Message COC001)

Scrape reference browser `metrics.yaml`

Here: https://github.com/mozilla-mobile/reference-browser/blob/master/app/metrics.yaml

Update probe parsers in mozilla-central

We made some changes in this repository to the parsers that live in mozilla-central.
Once this repository is more stable, we should merge the updates to m-c.

Enable travis-CI

We should enable CI on this repo. At least flake8 coverage!

Add instructions for how to perform a dry-run for testing

As seen in #49, added tests may pass even though running against real data results in improper data.

It'd be nice to include in the README instructions for how to run the scraper quickly and locally.

As @fbertsch put it "just head to the probe-scraper repo and run: python probe_scraper/runner.py. If it's taking too long (it will), change MIN_FIREFOX_VERSION in scrapers/moz_central_scraper.py to e.g. 59 so it doesn't pull down as many files."

mozilla / probe-scraper Goto Github PK

probe-scraper's Issues

Local Testing and CI

Running on GCP

Integrating with Google Kubernetes Engine

Description of Problem

first_added field

Recommend Projects

Recommend Topics

Recommend Org

`first_added` field