mozilla / probe-scraper Goto Github PK
View Code? Open in Web Editor NEWScrape and publish Telemetry probe data from Firefox
Home Page: https://mozilla.github.io/probe-scraper/
License: Mozilla Public License 2.0
Scrape and publish Telemetry probe data from Firefox
Home Page: https://mozilla.github.io/probe-scraper/
License: Mozilla Public License 2.0
Recently we added all_metrics
to send_in_pings
for the glean error probes. This had the desired effect: the next run of the probe-scraper, there was a new definition in history
for those probes, where send_in_pings
was [all_pings]
.
However, now the old definition has taken precedence: https://probeinfo.telemetry.mozilla.org/glean/glean/metrics
This can be seen by looking at dates/last
, where the definition with send_in_pings
as [metrics]
has a later date. We want the dates/last
for the metrics
version of the definition to be before the dates/first
for the all_pings
version of the definition.
Loading performance is currently a pain point for the data explorer site.
The bulk of the wait is on probes.json, which is ~8MB and delivered uncompressed from the public S3 bucket.
I can think of two quick fixes here:
@mreid-moz , does that sound right?
Any recommendations over either?
If CF, do i need to file a bug $somewhere to make it happen?
As the probe counts for specific components continue to grow, support for pagination in the probe fetching API will become more valuable.
We need to validate the probe data for correctness, e.g. by manually cross-checking the history of a few probes of each type.
According to the json.dump
docs, the default separators
value for Python 2 is (', ', ': ')
, even with a non-default indent
value.
This means before the \n
at the end of lines with terminal commas there is a space. EOL Whitespace! Sound the alarm!
This was fixed in Python v3.4, but as the docs recommend and the current production job use python
instead of python3
this means we need to specify separators=(',',': ')
in json.dump
We currently get revisions from both Buildhub and json-tags
, but the former is a more canonical source, along with storing every build we have had (including nightly). We should remove the json-tags
scraping and just use that source of revision data.
FYI: The following changes were made to this repository's wiki:
defacing spam has been removed
Restricting write access to contributors is strongly encouraged. Please make that change (documentation).
These were made as the result of a recent automated defacement of publically writeable wikis.
The reference browser integrated lib-crash, but the probe-scraper doesn't know about it, so the BQ schema doesn't have those metrics. We need to make a scrapable dependencies file (ala Fenix) and add it here.
If that's too much work, we could consider manually stating the dependencies (for a short-term solution - or it could be long-term if we realize we're not going to be adding many dependencies in the future).
Information about in which process a particular histogram is recorded is currently not extracted by the probe scraper.
The MEDIA_DECODING_PROCESS_CRASH
histogram was removed from the histogram.json
file by this bug (Firefox 57). Up to revision 0a0a804a5b6f252f12d9808b54ed2a7f6ada27e3
, it was correctly reported among the expired histograms for Firefox 57 in the release channel. However, since 3702966a64c80e17d01f613b0a464f92695524fc
was scraped, it is no longer showing up in that list.
We should verify what is going on and fix this.
When the probe_scraper sees all_pings
, it should instead transform that to a list of all the pings. This way downstream dependencies don't need to know about the Glean keyword.
To make it more easy to judge whether things worked correctly etc., it would be great to print some basic stats before saving the outputs to disk in runner.py.
E.g.:
related: mozilla/telemetry-dashboard#531
While we save the update date to general.json as "lastUpdate", it would help to know at what time of day the update happened.
We could just change that to be a full ISO date string, including time and timezone info.
To support mozilla/probe-dictionary#60, we need to add this info to the probe-info-service.
We should have a README file that briefly explains the file format.
The event parser doesn't extract descriptions yet, we need to fix that.
In the output data we use the old "optout" terminology for probes.
We should change this to something matching the new semantics, e.g. "collected_on_release".
Currently the main consumer of this is the probe-dictionary, although we should check with the data tools team.
We can migrate easily by first adding the new property, then fix the consumers, then removing the old property.
Old histogram.json files contain non-number values & expressions for fields as "n_values". We need to properly handle these and convert the literals to useful values.
MDV2 needs this to display those histograms.
In bug-1363725 (https://bugzilla.mozilla.org/show_bug.cgi?id=1363725),
the values all_child and all_childs were changed to all_children.
But it caused a conflict with 'shared_telemetry_util.py' bundled with probe-scraper.
So we have just updated '.py' file for now. After this file is updated on both ends, we will change
the values in .yaml and .json files.
Requesting to update 'shared_telemetry_util.py' on probe-scraper.
(I found its the only files that mentions 'all_child', Please tell if something else also needs to be changed.)
In bug 1282098 we exported the python_mozparsers package that contains the parsers we use for parsing probes.
The probe scraper should depend on that package rather than forking the ones in m-c.
This issue is about backporting the changes that were implemented in the probe scraper fork of the parsers upstream and then depending on the exported package.
It's easy to make small, not noticeable mistakes. A schema that validates would not allow this.
These 3 issues (GCP, CircleCI, and Dependency/Probe checking) are all related in that they require Docker integration; dependency checking specifically needs GKE integration.
The work can be done in this order; so e.g. we can be building/testing the container on CI, but still running on EMR while we change the deploy to GCP.
For local testing and CI, we will move to a Docker workflow. This will include building a container with all of the dependencies, running tests and lint on that container, and updating CI to build, test, and deploy that container. This should follow the Dockerflow example.
This will require adding:
GCP will also run on that container. We will use the GKE Pod Operator, and use the image that CI deploys. To run this on GCP, we need to add an entrypoint script that runs the probe-scraper locally.
We will need an associated change to telemetry-airflow to update how we're running the job. This file is the one that will be running on the container (with some changes for GCP world, e.g. GCS).
(Note that we may still need to write to s3 for the probe-info-service. I'll cc @jasonthomas here on whether there is a plan to move the probe-info-service to GCS. Once it's there we can write to GCS instead.)
In order to check for metrics or dependencies that are present in repositories, we need a development environment to build/run/test the applications. We can do this by running on Dockerhub Images. It may be the case that there is not a stable image for our needs; in that case we may need to build and deploy them ourselves, using the existing infrastructure from "Local Testing and CI". In that case we'll need an additional Dockerfile
.
When we have those images available, we can run them in the probe-scraper using the GKE Python Client. We can run in those environments and get a result (whether it is dependencies, probes, pings, etc.).
This code base is largely about:
Given that we recently moved to Python 3, this code base could use a serious uplift by utilizing some of the nice features available. For example, the probe
type could be a Dataclass
that knows how to compare itself to others, and knows how to serialize itself into the final JSON output.
A revision
class could compare itself to other revisions, based on push-date
or version
. This can be used to decide first
and last
revisions/versions/dates for probes.
Together, these would simplify transform_probes.py
.
All logic should be moved out of runner.py
and integrated into appropriate places, and that should exist just to provide the scraper CLI.
In order to make the scraper future proof and to support other internal products we need to break down the output by channel (Nightly, Aurora, Beta, Release, ESR).
We will still need to support the monolithic output mode with all the probes from all the channels to make it easy for the data-explorer to consume it.
In order to mimick a REST API, let's drop the json
extension from these two files as well.
We currently have a setup to include the types of files we expect for Desktop: Histograms.json
, Scalars.yaml
, etc., but for Github repos. The Glean work is replacing that effort, so we need to update the code to parse metrics.yaml
files instead.
This work will encompass:
The former will be used for schema creation. The latter will be used for validation and table names in the pipeline.
We can use the Glean Parser library to parse the metrics.yaml
files and write out results. The rationale for letting the Glean Parser serialize is then the scraper doesn't need to have intimate knowledge of which fields are required and which may be updated; instead we can just add and remove fields from the parser.
(This came out of telemetry-dashboard #543.)
The histogram DEVTOOLS_TOOLBOX_TIME_ACTIVE_SECONDS is showing up as recorded from 59 on release.
This seems to actually come from the source data, i.e. probe-scraper.
Another example where recording ranges are off, the scalar telemetry.discarded.child_events.
This is also a problem in the probe-scraper data.
Per 1369041 this probe should be on 56+ and not be on 55 for release & beta.
Using the latest commit will allow us to get an accurate date range for a probe, since even if it changed 5 commits ago, it may still be available, but the date
will be for the last commit the metrics.yaml
file changed on.
This isn't urgent, since we can order glean probes by first
.
Some probes have bug_numbers fields (in fact, all new ones are required to have them, and owners are encouraged to add them when they are updated). This seems like useful information we can surface in the probe scraper.
Marking this "help wanted" for mentorship. @Dexterp37 and @georgf have offered to mentor this bug.
After this is complete, we can file a bug in https://github.com/mozilla/telemetry-dashboard for augmenting Probe Dictionary to display this information and linkify the bug numbers to make it easy to flow from the dictionary to bugzilla to learn more.
In scraper.py we use the load_tags and extract_tag_data function to fetch the data from the Firefox repository and extract the releases information. This needs some test coverage.
We could:
Fenix will be releasing multiple versions with different app_ids. Those all need to be included here. It should be straightforward to make that an array rather than a string.
As discussed here, requesting "probes active in the current latest release in each channel" cannot be done easily.
We should make that easier from the API.
For the Main Ping Processing in GCP, we plan on including every historical probe, so that they can continue to be queried even after they expire.
To do this, we need a total ordering or probes, with new probes appended. For example, if the current set of probes was {SEARCH_COUNTS, GC_MS}
, but then tomorrow's nightly we added {PAGELOAD_IS_SSL}
, I would want the probe ordering to be:
Today's: [GC_MS, SEARCH_COUNTS]
Tomorrow's: [GC_MS, SEARCH_COUNTS, PAGELOAD_IS_SSL]
first_added
fieldThe easiest way to accomplish this is to include a "first_added" timestamp. If any probes share a "first_added" timestamp, they will be sorted alphanumerically. Some probes (e.g. STYLO_PARALLEL_RESTYLE_FRACTION_WEIGHTED_STYLED
) only showed up on release, and others (e.g. GC_MS
) only show up on pre-release. As such we can include a "first_added" date for each channel:
"GC_MS": {
"history": {...},
"name": "GC_MS",
"type": "histogram",
"first_added": {
"nightly": 1547013850,
"beta": 1547014850
}
}
We can achieve this by:
pushdate
from json-rev.The only piece I don't know how to do is: given the revision, get the associated json-rev file.
The scraper currently imports the latest version of glean-parsers
. However, if breaking non-backwards compatible format changes are introduced in the parsers, the scraper might break.
We should fix this. Among the possible solutions:
Hello! I'm not sure if this is the right place to document this, and it may be a dupe of #30, but:
Please let me know if this isn't helpful. Thanks!
We should add test coverage that tests that the output for old, real Histograms.json, Scalars.yaml, ... is correct.
E.g. taking files from the oldest supported Firefox version, the current newest and a few at important key points in between (e.g. when properties where added/removed).
Currently we include all probes for the release channel data, even when they are only ever recorded on prerelease.
It probably makes more sense to just not include prerelease probe data for release.
@Dexterp37 @fbertsch Thoughts?
Currently the probe scraper runs on a new machine every time, making the cache largely unused.
We should check if anyone intends to use historical Aurora channel data.
If not, we could just drop support for that channel.
This can be used in a few places:
This should scrape pings.yaml
, and publish a history based on the pings present. I would imagine this lives in /glean/$REPO/pings
.
In general we've been trying to move our repos to Circle CI rather than Travis. We have a paid account on Circle and have had a lot of success on their platform.
We should schedule this job to run on mozilla/telemetry-airflow.
As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:
If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].
(Message COC001)
We made some changes in this repository to the parsers that live in mozilla-central.
Once this repository is more stable, we should merge the updates to m-c.
We should enable CI on this repo. At least flake8 coverage!
As seen in #49, added tests may pass even though running against real data results in improper data.
It'd be nice to include in the README instructions for how to run the scraper quickly and locally.
As @fbertsch put it "just head to the probe-scraper repo and run: python probe_scraper/runner.py
. If it's taking too long (it will), change MIN_FIREFOX_VERSION
in scrapers/moz_central_scraper.py
to e.g. 59
so it doesn't pull down as many files."
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.