Giter VIP home page Giter VIP logo

probe-scraper's Issues

Glean metrics not being properly recorded

Recently we added all_metrics to send_in_pings for the glean error probes. This had the desired effect: the next run of the probe-scraper, there was a new definition in history for those probes, where send_in_pings was [all_pings].

However, now the old definition has taken precedence: https://probeinfo.telemetry.mozilla.org/glean/glean/metrics

This can be seen by looking at dates/last, where the definition with send_in_pings as [metrics] has a later date. We want the dates/last for the metrics version of the definition to be before the dates/first for the all_pings version of the definition.

Deliver JSON files gzip encoded

Loading performance is currently a pain point for the data explorer site.

The bulk of the wait is on probes.json, which is ~8MB and delivered uncompressed from the public S3 bucket.

I can think of two quick fixes here:

  • gzip encode in the S3 bucket (see bottom here, cuts my load times from 6s to 2s)
  • publish them through a CloudFront distribution

@mreid-moz , does that sound right?
Any recommendations over either?
If CF, do i need to file a bug $somewhere to make it happen?

Validate probe data

We need to validate the probe data for correctness, e.g. by manually cross-checking the history of a few probes of each type.

Deal with EOL Whitespace in outputted JSON

According to the json.dump docs, the default separators value for Python 2 is (', ', ': '), even with a non-default indent value.

This means before the \n at the end of lines with terminal commas there is a space. EOL Whitespace! Sound the alarm!

This was fixed in Python v3.4, but as the docs recommend and the current production job use python instead of python3 this means we need to specify separators=(',',': ') in json.dump

Remove scraping `json-tags` and get all revisions from BuildHub

We currently get revisions from both Buildhub and json-tags, but the former is a more canonical source, along with storing every build we have had (including nightly). We should remove the json-tags scraping and just use that source of revision data.

Wiki changes

FYI: The following changes were made to this repository's wiki:

These were made as the result of a recent automated defacement of publically writeable wikis.

Add dependencies for reference browser

The reference browser integrated lib-crash, but the probe-scraper doesn't know about it, so the BQ schema doesn't have those metrics. We need to make a scrapable dependencies file (ala Fenix) and add it here.

If that's too much work, we could consider manually stating the dependencies (for a short-term solution - or it could be long-term if we realize we're not going to be adding many dependencies in the future).

cc @mdboom @travis79

Histograms expired in minor releases are not listed in the "expired" list

The MEDIA_DECODING_PROCESS_CRASH histogram was removed from the histogram.json file by this bug (Firefox 57). Up to revision 0a0a804a5b6f252f12d9808b54ed2a7f6ada27e3, it was correctly reported among the expired histograms for Firefox 57 in the release channel. However, since 3702966a64c80e17d01f613b0a464f92695524fc was scraped, it is no longer showing up in that list.

We should verify what is going on and fix this.

Print some stats before saving probe data to disk

To make it more easy to judge whether things worked correctly etc., it would be great to print some basic stats before saving the outputs to disk in runner.py.
E.g.:

  • how many probes of each type we have (histograms, scalars, events, ...)
  • how many versions/revisions we extracted data for

Save update time to general.json

While we save the update date to general.json as "lastUpdate", it would help to know at what time of day the update happened.
We could just change that to be a full ISO date string, including time and timezone info.

Rename "optout" property

In the output data we use the old "optout" terminology for probes.
We should change this to something matching the new semantics, e.g. "collected_on_release".

Currently the main consumer of this is the probe-dictionary, although we should check with the data tools team.
We can migrate easily by first adding the new property, then fix the consumers, then removing the old property.

Requesting Update of 'shared_telemetry_util.py'

In bug-1363725 (https://bugzilla.mozilla.org/show_bug.cgi?id=1363725),
the values all_child and all_childs were changed to all_children.
But it caused a conflict with 'shared_telemetry_util.py' bundled with probe-scraper.

So we have just updated '.py' file for now. After this file is updated on both ends, we will change
the values in .yaml and .json files.

Requesting to update 'shared_telemetry_util.py' on probe-scraper.
(I found its the only files that mentions 'all_child', Please tell if something else also needs to be changed.)

Use the mozparsers package

In bug 1282098 we exported the python_mozparsers package that contains the parsers we use for parsing probes.

The probe scraper should depend on that package rather than forking the ones in m-c.

This issue is about backporting the changes that were implemented in the probe scraper fork of the parsers upstream and then depending on the exported package.

Proposal for: GCP Migration, CircleCi Migration, and Running Dependency/Probe checks

These 3 issues (GCP, CircleCI, and Dependency/Probe checking) are all related in that they require Docker integration; dependency checking specifically needs GKE integration.

The work can be done in this order; so e.g. we can be building/testing the container on CI, but still running on EMR while we change the deploy to GCP.

Local Testing and CI

For local testing and CI, we will move to a Docker workflow. This will include building a container with all of the dependencies, running tests and lint on that container, and updating CI to build, test, and deploy that container. This should follow the Dockerflow example.

This will require adding:

  • Dockerfile
  • Makefile
  • docker-compose (optional, but nice)
  • pinned requirements
  • circle-ci config
  • Dockerhub creds to circle config

Running on GCP

GCP will also run on that container. We will use the GKE Pod Operator, and use the image that CI deploys. To run this on GCP, we need to add an entrypoint script that runs the probe-scraper locally.

We will need an associated change to telemetry-airflow to update how we're running the job. This file is the one that will be running on the container (with some changes for GCP world, e.g. GCS).

(Note that we may still need to write to s3 for the probe-info-service. I'll cc @jasonthomas here on whether there is a plan to move the probe-info-service to GCS. Once it's there we can write to GCS instead.)

Integrating with Google Kubernetes Engine

In order to check for metrics or dependencies that are present in repositories, we need a development environment to build/run/test the applications. We can do this by running on Dockerhub Images. It may be the case that there is not a stable image for our needs; in that case we may need to build and deploy them ourselves, using the existing infrastructure from "Local Testing and CI". In that case we'll need an additional Dockerfile.

When we have those images available, we can run them in the probe-scraper using the GKE Python Client. We can run in those environments and get a result (whether it is dependencies, probes, pings, etc.).

Refactor this Code Base

This code base is largely about:

  • Scraping information about revisions
  • Reading probe information from those revisions
  • Combining probe information
  • Writing it back out

Given that we recently moved to Python 3, this code base could use a serious uplift by utilizing some of the nice features available. For example, the probe type could be a Dataclass that knows how to compare itself to others, and knows how to serialize itself into the final JSON output.

A revision class could compare itself to other revisions, based on push-date or version. This can be used to decide first and last revisions/versions/dates for probes.

Together, these would simplify transform_probes.py.

All logic should be moved out of runner.py and integrated into appropriate places, and that should exist just to provide the scraper CLI.

Break down the output by release channel

In order to make the scraper future proof and to support other internal products we need to break down the output by channel (Nightly, Aurora, Beta, Release, ESR).

We will still need to support the monolithic output mode with all the probes from all the channels to make it easy for the data-explorer to consume it.

Change existing git repo parsing to Glean parsing

We currently have a setup to include the types of files we expect for Desktop: Histograms.json, Scalars.yaml, etc., but for Github repos. The Glean work is replacing that effort, so we need to update the code to parse metrics.yaml files instead.

This work will encompass:

  • Parsing and writing out information on Glean probes to the probe-dictionary
  • Parsing and writing out information on Glean pings to the probe-dictionary

The former will be used for schema creation. The latter will be used for validation and table names in the pipeline.

We can use the Glean Parser library to parse the metrics.yaml files and write out results. The rationale for letting the Glean Parser serialize is then the scraper doesn't need to have intimate knowledge of which fields are required and which may be updated; instead we can just add and remove fields from the parser.

Glean probes should use the latest commit in addition to file changes

Using the latest commit will allow us to get an accurate date range for a probe, since even if it changed 5 commits ago, it may still be available, but the date will be for the last commit the metrics.yaml file changed on.

This isn't urgent, since we can order glean probes by first.

Scrape bug numbers from probes that have them

Some probes have bug_numbers fields (in fact, all new ones are required to have them, and owners are encouraged to add them when they are updated). This seems like useful information we can surface in the probe scraper.

Marking this "help wanted" for mentorship. @Dexterp37 and @georgf have offered to mentor this bug.

After this is complete, we can file a bug in https://github.com/mozilla/telemetry-dashboard for augmenting Probe Dictionary to display this information and linkify the bug numbers to make it easy to flow from the dictionary to bugzilla to learn more.

Add test coverage for "tags" scraping

In scraper.py we use the load_tags and extract_tag_data function to fetch the data from the Firefox repository and extract the releases information. This needs some test coverage.

We could:

  • Use the "requests" library to catch the responses and provide fake data to load_tags (an example can be found here)
  • Call extract_tag_data on the fake data
  • Make sure we get the expected data out

Handle multiple app_ids

Fenix will be releasing multiple versions with different app_ids. Those all need to be included here. It should be straightforward to make that an array rather than a string.

Include "first_added" dates in probe-info-service

Description of Problem

For the Main Ping Processing in GCP, we plan on including every historical probe, so that they can continue to be queried even after they expire.

To do this, we need a total ordering or probes, with new probes appended. For example, if the current set of probes was {SEARCH_COUNTS, GC_MS}, but then tomorrow's nightly we added {PAGELOAD_IS_SSL}, I would want the probe ordering to be:

Today's: [GC_MS, SEARCH_COUNTS]
Tomorrow's: [GC_MS, SEARCH_COUNTS, PAGELOAD_IS_SSL]

first_added field

The easiest way to accomplish this is to include a "first_added" timestamp. If any probes share a "first_added" timestamp, they will be sorted alphanumerically. Some probes (e.g. STYLO_PARALLEL_RESTYLE_FRACTION_WEIGHTED_STYLED) only showed up on release, and others (e.g. GC_MS) only show up on pre-release. As such we can include a "first_added" date for each channel:

"GC_MS": {
   "history": {...},
    "name": "GC_MS",
    "type": "histogram",
    "first_added": {
        "nightly": 1547013850,
        "beta": 1547014850
     }
}

We can achieve this by:

  1. Scraping all revisions from each probe file, e.g. Histograms.json
  2. Iterate over each revision, getting list of new probes added in that revision
  3. Getting the pushdate from json-rev.
  4. Joining that data with the existing probe-info-service data

The only piece I don't know how to do is: given the revision, get the associated json-rev file.

Handle "historical" glean metrics file

The scraper currently imports the latest version of glean-parsers. However, if breaking non-backwards compatible format changes are introduced in the parsers, the scraper might break.

We should fix this. Among the possible solutions:

  • do version pinning for the parsers;
  • have two CI environments;
  • include a version in the metrics file.

Add test coverage for correct handling of historic probe registries

We should add test coverage that tests that the output for old, real Histograms.json, Scalars.yaml, ... is correct.
E.g. taking files from the oldest supported Firefox version, the current newest and a few at important key points in between (e.g. when properties where added/removed).

Consider dropping Aurora channel

We should check if anyone intends to use historical Aurora channel data.
If not, we could just drop support for that channel.

Scrape and publish Glean `pings.yaml`

This can be used in a few places:

  • Mozilla-schema-generator, which currently does a backwards approach of needing all pings before it can get all probes, but it has to get the pings from the probes
  • bigquery-etl, to easily generate queries on every table (or view across them all)

This should scrape pings.yaml, and publish a history based on the pings present. I would imagine this lives in /glean/$REPO/pings.

Move to Circle CI

In general we've been trying to move our repos to Circle CI rather than Travis. We have a paid account on Circle and have had a lot of success on their platform.

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

  1. Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
  2. Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].

(Message COC001)

Update probe parsers in mozilla-central

We made some changes in this repository to the parsers that live in mozilla-central.
Once this repository is more stable, we should merge the updates to m-c.

Add instructions for how to perform a dry-run for testing

As seen in #49, added tests may pass even though running against real data results in improper data.

It'd be nice to include in the README instructions for how to run the scraper quickly and locally.

As @fbertsch put it "just head to the probe-scraper repo and run: python probe_scraper/runner.py. If it's taking too long (it will), change MIN_FIREFOX_VERSION in scrapers/moz_central_scraper.py to e.g. 59 so it doesn't pull down as many files."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.