Giter VIP home page Giter VIP logo

httparchive.org's Issues

301 redirect all beta subdomain requests

Return a 301 Moved Permanently response to all requests for beta.httparchive.org. The location should be the same URL minus the beta subdomain.

It's the same content behind both URLs but we may remove the beta subdomain at some point and don't want any old links floating around.

Switch to using Python

Back in 2011 I ported httparchive to Python (using SQLAlchemy and Pyramid) and I think the port could be a good start for the planned reworking of the project.

The existing code is at https://bitbucket.org/charlie_x/python-httparchive/src/

Major differences to the original:

  • separation of model and view code
  • normalised and extended database for site reports (requests untouched), found queries ran much faster than with PHP/MySQL
  • all charts use Google Visualisation (wrote gviz_data_table to do this)
  • mobile and desktop reports are folded together
  • switched to using Postgres because it does the heavy lifting better

In addition I have local code which I use for comparative reports for customers.

A currently poorly maintained version of the site can be seen at http://mamasnewbag.org I need to push the changes to the schema and update the imported data.

ModuleNotFoundError: No module named 'urlparse'

$ python --version
Python 3.6.4

Running python main.py returns the following error:

$ python main.py
Traceback (most recent call last):
  File "main.py", line 19, in <module>
    from urlparse import urlparse
ModuleNotFoundError: No module named 'urlparse'

Allow local building without gcloud

Authenticating with Google Cloud is necessary to call the storage APIs that tell us which dates are available for reports. When building the website locally, it should be possible to skip this requirement and get a truly static/deterministic environment. This can be with a non-default build flag or something.

[Loading Speed] Metrics rename

I've been asking touchpoints leveraging Lighthouse for metrics data to rename TTCI and TTFI now that we've done so in Lighthouse:

Time To First Interactive is now First CPU Idle
Time to Consistently Interactive is now just Time to Interactive

(SpeedCurve also just made this change)

Expand desktop corpus to all ~4M desktop CrUX origins

  • extract the desktop origins from CrUX and convert to URLs (by appending a /)
  • load a CSV of the new URLs onto the HA test server (easy to do via the HA CDN)
  • change the desktop crawl capacity to handle 4M URLs
  • run make pushall to apply the config changes to desktop and mobile
  • in mysql run truncate table urlsdev; to clear out the old list of 1.3M URLs
  • run bulktest$ php importurls.php <csv-file> other to load the 4M URLs into the urlsdev table

Query to extract the desktop origins from the latest CrUX release:

SELECT
  DISTINCT CONCAT(origin, '/') AS url
FROM
  `chrome-ux-report.all.201811`
WHERE
  form_factor.name = 'desktop'

Table of desktop URLs: https://bigquery.cloud.google.com/table/httparchive:urls.2018_12_15_desktop

Categorize origins

Group origins by category/vertical, for example news/travel/etc. This will enable category deep dives and comparisons.

DMOZ is no longer operational but a recent data dump is available. We should look for alternate sources.

Slightly related: HTTPArchive/legacy.httparchive.org#75. Alexa is deprecating their top 1M ranking, so finding a rank+category solution would be a bonus.

Migrate and update Getting Started guide

The Getting Started guide written by @paulcalvano lives in the legacy repository. It was written before the new BigQuery UI started rolling out to users, so the workflow may not be applicable to those users anymore.

We should move the documentation out of the legacy repo and into this one where people can more easily find it. We can implement it similar to the FAQ, where it is a markdown file readable on GitHub and also rendered as HTML on the website.

We should also update the docs to handle both the new and old BigQuery UI (until the old UI is deprecated).

Web page metadata

Ensure all pages have metadata like <meta> descriptions, images, etc. Also consider open graph or structured data tags.

Ideally, when sharing a link to a report, a relevant image would be used as opposed to the HTTP Archive logo. For example, if deep linking to a particular metric in a report, the image used would be a screenshot of the metric's timeseries/histogram. At the very least, we could dynamically populate the page's metadata with info specific to the metric.

not an issue - clarification on har capabilities / edge use case

use case
I open a website
I capture all requests / save as har file website.har

now I have all css / js / png etc and want to "host" this content locally via localhost
I'm imagining a command like

python -m http.server -har website.har

does such a router exist?

Reimplement summary tables from raw HAR data

The legacy website uses intermediate crawl data from MySQL tables to generate CSVs containing summary data about pages and requests. As part of the beta migration, we would like to deprecate this preprocessing step and depend directly on the raw HAR data.

In BigQuery, this data is represented in the runs dataset, which has recently been split into summary_pages and summary_requests. These datasets will continue to exist, but will be generated in a BigQuery post-processing step instead, using the HAR tables as input.

A secondary goal of this process is to modernize the summary data. For example, the videoRequests field may not count modern video formats like WebM.

  • Write new queries to replicate summary data
  • Hook queries into post-processing pipeline
  • Unplug CSV -> BigQuery pipeline
  • Remove pre-processing logic

Enable viewing reports through "lenses"

Goal: Provide a mechanism for users to apply one or more "lenses" through which to view the HTTP Archive reports. These lenses effectively filter the set of websites to a subset that match some condition. For example, all websites that run on WordPress.

Strategy:

  • 1. write a query that generates a list of URLs/pages to act as the whitelist for the lens
  • 2. alter the report generation script to integrate that whitelist of pages with each metrics' timeseries and histogram queries
    • 2a. optionally replace YYYY_MM_DD placeholders in whitelist query with report date
    • 2b. save the lens results to an identifiable subdirectory on the gs:// CDN
  • 3. build a UI to enable the addition/removal of a lens
    • 3a. provide a way to persist the selected lens across page views
    • 3b. alter the report UI to make clear when a lens has been applied
    • 3c. provide URL shortcuts (eg wordpress.httparchive.org or httparchive.org/wordpress)
  • 4. alter the data visualization scripts to pull data from the correct lens on the CDN

Stretch goals:

  • 5. support multiple concurrent lenses for side-by-side comparison
    • 5a. in the lens selector UI
    • 5b. in the data visualization UI

๐ŸŽ“ Graduate to the default version of HTTP Archive

Roadmap

The goal is for beta.httparchive.org to become the new httparchive.org, targeting some time in Q1 2018. The following tasks are launch blocking:

  • automate the report generation after new tables are added to BigQuery
  • resolve all placeholder graphics (eg report images)
  • write descriptions for all reports/metrics
  • rewrite FAQ section
  • fix UX issues with data visualization
  • implement a redirect solution for legacy URLs
  • redirect to HTTPS
  • reorganize GitHub repositories

Report automation

The reports are manually generated with scripts. These scripts should be automated to run whenever a crawl is complete. Some subtasks for this issue:

  • Queries may depend on convenience datasets (eg httparchive.lighthouse) which are manually copied from catch-all datasets (eg httparchive.har). New tables in these datasets must all be created automatically.
  • Use pubsub or similar tools to trigger report jobs after the dataflow job has completed successfully.

Graphics

Concept graphics for the JS report:

image

We would need similar graphics for each report. It would also be nice to have a default graphic for new reports until a permanent one could be made.

Written descriptions

For example, "Total KB" should have a description like The sum of transfer size kilobytes of all resources requested by the page..

Reports should describe their contents and maybe even a brief analysis of the overall trends.

Rewrite FAQs

The legacy FAQs are somewhat outdated. The new FAQ page should contain updated information including any new content related to the new reports/metrics/visualizations.

Data viz UX

Some feedback on the charts include:

  • what the heck is a CDF/PDF?
  • make the tooltips more descriptive
  • unclear what the outlier bin is in the histograms
  • not obvious how to switch between timeseries/histogram modes or that separate modes exist
  • collapse desktop and mobile tables into one with both histograms side by side

Legacy redirects

Not all legacy features will be supported by the beta site at launch. Since the beta site will assume the root domain, it will start receiving requests from legacy URLs. At launch, the legacy site will still be accessible at http://legacy.httparchive.org. Known legacy URLs should be redirected to this subdomain unless the feature is also available on the beta site, in which case the URL should be mapped. Whether to use a temporary or permanent (301 vs 302) redirect depends on whether the feature is expected to be supported by the beta site.

For example, one simple case is the About page, which is http://httparchive.org/about.php. This has a corresponding page on the beta site at https://beta.httparchive.org/about. A more complicated example is http://httparchive.org/viewsite.php?pageid=84263714 which may be supported in the future.

HTTPS Redirects

Any HTTP request should automatically redirect to HTTPS. This is a simple feature but had a subtle infinite redirect bug when I last attempted a fix in the Flask layer.

Related: ensure the Let's Encrypt certificate automatically renews. Same for the https://cdn.httparchive.org certificate.

Reorg GitHub

This code base should become the canonical "httparchive" project on GitHub. The legacy code base should be renamed to "legacy.httparchive.org" or similar. Consider the careful dance of moving code around in such a way that the primary project maintains the same stars/watchers. However this may screw up the commit history.

Detect JS modularization

Determine how JS is modularized, eg webpack.

To do:

  • Come up with a set of signals given off by the modularization tools
  • Integrate the signal detection
  • Update the pipeline to expose the results (if necessary)

cc @addyosmani

Subdomains: Make beta the new www

Progress on legacy redirects in #13.

  • Add an A record to the DNS for the legacy subdomain to resolve to the legacy web server
  • Add support for the legacy subdomain in the legacy web server's nginx and apache configs
  • Add SSL cert for the legacy subdomain
  • Modify the beta server's 404 page to suggest the same URL on the legacy subdomain
  • Replace the @ and www DNS A records to CNAME records, resolving to the new (beta) App Engine server at ghs.googlehosted.com
    • do this 4+ hours prior to launch to ensure that it propagates to DNS caches
  • Update all internal links to avoid 404s
    • eg stats/trends links on the forum

DNS settings:

  • A 216.239.32.21
  • A 216.239.34.21
  • A 216.239.36.21
  • A 216.239.38.21
  • AAAA 2001:4860:4802:32::15
  • AAAA 2001:4860:4802:34::15
  • AAAA 2001:4860:4802:36::15
  • AAAA 2001:4860:4802:38::15
  • CNAME ghs.googlehosted.com (www)

Track JS library usage

Build a new custom timeseries visualization in the State of JS report that tracks usage of popular JS libraries.

I imagine this being a line chart with one line per library. We'd need to override the default Highcharts color scheme to give each line a distinct color. Right now it's set up for desktop and mobile series.

We also need to specially handle jQuery somehow. Its usage is so much higher than everything else so it would make everything else look tiny by comparison. We can enable y-axis zooming, use an exponential y-axis, or rely on manually toggling the series on/off and letting the chart autoscale.

Reconfigure mobile testing to use 4G speeds

Data from the Chrome UX Report indicates that the vast majority of connection speeds on mobile devices (phone/tablet) are effectively 4G:

#standardSQL
SELECT
  SUM(IF(effective_connection_type.name = '4G', bin.density, 0)) / SUM(bin.density) AS pct_mobile_4g
FROM 
  `chrome-ux-report.chrome_ux_report.201710`,
  UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
  form_factor.name IN ('phone', 'tablet')

https://bigquery.cloud.google.com/results/chrome-ux-report:bquijob_35020ffa_160510ab375

Result: 87.95%

Update the WebPageTest configuration for mobile tests to use 4G speeds to more accurately represent real user conditions.

Add doubleclick to CSP whitelist

Console error:

Refused to load the image 'https://stats.g.doubleclick.net/r/collect?[...]'
because it violates the following Content Security Policy directive: 
"img-src 'self' discuss.httparchive.org www.google-analytics.com".

This is a benign analytics endpoint related to Google Analytics.

To fix, add stats.g.doubleclick.net to the img-src policy in the CSP whitelist.

Integrate WebPageTest results

Make it possible for users to enter a WPT ID and see relevant test result data inline with the aggregate HTTP Archive data. For example, show the user's 6747.7 KB page weight in the context of the median desktop page weight of 1698.4 KB.

Subtasks:

  • plumbing
    • get a WPT ID from the user
    • fetch JSON results from WPT
    • parse and display metrics back to the user on HTTP Archive
  • support for simple metric formats like total bytes, page load time, etc
  • an unobtrusive UI for users to add/remove their WPT ID
    • this is not a P0 use case for HA, so we need to be conscientious about making it too prominent/distracting
  • support for computed metrics like image bytes, which require a post-processing function over the WPT results
  • integrate WPT data with visualizations, eg a line indicating where the WPT data fits in

_adult_site is false for everything since 2016 something

SELECT
  url
FROM
  [httparchive:summary_pages.2018_06_15_desktop]
WHERE
  _adult_site IS TRUE

0 results.

SELECT
  url
FROM
  [httparchive:summary_pages.2017_06_15_desktop]
WHERE
  _adult_site IS TRUE

0 results.

SELECT
  url
FROM
  [httparchive:summary_pages.2016_06_15_desktop]
WHERE
  _adult_site IS TRUE

7071 results.

Investigate drop in WordPress representation

Builtwith puts WordPress coverage at ~30% of the web.

After increasing the coverage to 1.3M URLs from CrUX, we ~doubled the number of WordPress sites detected. However, the coverage increase was a factor of ~3x, so the relative percent of WordPress sites has gone down:

date client wordpress total %
Aug 1, 2018 desktop 201,474 1,275,374 15.80%
Aug 1, 2018 mobile 198,572 1,268,277 15.66%
Jul 15, 2018 desktop 201,277 1,277,631 15.75%
Jul 15, 2018 mobile 198,290 1,272,071 15.59%
Jul 1, 2018 desktop 200,927 1,277,805 15.72%
Jul 1, 2018 mobile 101,401 451,109 22.48%
Jun 15, 2018 desktop 103,768 461,068 22.51%
Jun 15, 2018 mobile 100,978 451,307 22.37%

Prior to the coverage increase, WordPress represented ~22% of the URLs. This is still well below the 30% figure which is widely regarded as the source of truth. After the expansion, WordPress representation dropped to ~16%.

In January 2018 @pmeenan ran an experiment where he tested all 2.9M CrUX URLs and detected WordPress on 785k (27%) of them. Assuming this ratio holds true in the most recent CrUX release that formed the basis of the HA corpus, we're not capturing (or detecting) a significant proportion of these sites.

The methodology behind the corpus change was to intersect the Alexa Top 1M domains (as of March 15, 2017) with origins in the 201805 CrUX release. We should investigate whether this approach introduced bias that excluded a significant proportion of WordPress sites.

Enable unused JS Lighthouse audit

The "Unused JavaScript" Lighthouse audit is not enabled by default.

Adjust the Lighthouse config in our WebPageTest agents to enable this audit.

See the note in the LH docs for more info:

Note: This audit is off by default! It can be enabled in the Node CLI by using the lighthouse:full configuration profile.

@pmeenan can you look into this?

cc @patrickhulce

Enable a grid view of metrics

By default the view of report pages is to vertically list all metrics. Each metric has:

  • title
  • description
  • optional "See also" for other reports
  • visualization
  • data table toggle button
  • data table

All of this UI baggage may make it difficult to skim through the report to get an overall sense for how things are going.

The purpose of this feature request is to design a grid view of metrics so that only the necessary information to understand the state of the metrics is shown compactly.

Document contributor workflow

We need to improve our documentation around contributing to the project, both in terms of committing code and participating in the analysis/discussion.

The goal is to create a top-level "Contribute" navigation item on the website that links to a section in the About page with the following info:

  • We use Slack (bit.ly/http-archive-slack) for on-topic banter and administrative conversations. The #general channel is open to the public.
  • We use the forum (discuss.httparchive.org) for sharing queries and analysis related to the BigQuery datasets.
  • We use GitHub (HTTPArchive/httparchive.org) for bug reports and feature requests related to the system infrastructure. We tag issues as Good first issue if it would be a good place for new contributors to start.
  • We use Hangouts (TBD) for monthly meetings to discuss administrative tasks. This meeting is open to the public and meeting notes will be made available.
  • We use Twitter (@HTTPArchive) for promoting topics of general interest to the project.

If a person wants to get more involved with the project, they should understand when to use which channel and how to join.

Visualize WPT results in charts

Passing in a wptid URL param with a WebPageTest ID will enable those individual test results to be displayed inline with the macro data from HTTP Archive.

image

In addition to showing the data point above the chart, we should embed it in the context of the chart in the form of a visualized line or bar. For timeseries, it should be a horizontal line at the y-intercept corresponding to the WPT value. Similarly for histograms, it should be a vertical line.

Set up automatic backups of the legacy database

Per @pmeenan:

We could (should) set up a cron to back up the database and code to the google storage bucket.

The servers have been failing at a faster rate lately, so starting these backups sooner would ensure that we're protected against a failure of the master DB server.

Pat, would you have time for this?

how to set up a local env without Google Cloud

HI, I'm working on some enterprise systems which can only be accessed internally, how to set up an httparchive private instance in Linux or Windows? It may work perfectly with WPT private instance. Thanks a lot

Reports not showing up on mobile

I tried using Brave, Firefox and Safari and none of them shows the reports.

See screenshot:

Using safari I could find the logs and the browser can't find the fetch function.

I could add more details if it is necessary.

Authenticate gcloud with service account

Need to resolve this warning:

~/python-venv/lib/python2.7/site-packages/google/auth/_default.py:66: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/.
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)

Surface CDT's "third party badges" as a dimension

Could we add new columns (name + type) in our requests table to record the 3P badges? Also, could we propagate those tags down to any child requests? E.g. if a.js is tagged as XYZ corp, and fetches b.jpg, the latter should have the same tag.

AFAIK, CDT does this analysis at runtime, so it's not available in the trace? Should it be, or do we need to integrate this into our own analysis pipeline?

/cc @paulirish @rviscomi

Migrate changelog

Similar to #52, move the changelog from legacy to the new repo.

We also need to add recent events to it including the LH outage in June and the corpus increase to 1.3M CrUX URLs.

Automated report generation

Goal

When a crawl is complete, reports on beta.httparchive.org are immediately updated with the latest data.

Overview

Histogram and timeseries queries are defined for each metric. After the crawl completes, these queries need to be rerun against the latest data and their JSON results saved to Cloud Storage (GCS). Timeseries metrics are saved in undated JSON files (eg https://cdn.httparchive.org/reports/bytesJs.json). Histograms are saved in directories corresponding to the crawl date in YYYY_MM_DD format (eg https://cdn.httparchive.org/reports/2018_02_01/bytesJs.json).

The HTTP Archive web server needs to know which dates are available for each report in GCS. It maintains dates.json to track all crawls for which data is available in GCS. Each date option is made available in the UI in a dropdown field to explore historical data. This JSON file is loaded into memory when the server starts up.

image

Automation

In order to automate the report generation, we need a process that runs the queries, uploads results to GCS, and identifies which dates are available to explore in the UI.

To generate the reports automatically, we will use PubSub to signal when the data is available on BigQuery after the Dataflow pipeline has completed.

I have a few ideas for how we can use subscribers to do the actual BQ querying and backup to GCS:

  1. Use a Compute Engine (GCE) server that either polls PubSub for published events or exposes an HTTP endpoint to receive a push notification. PubSub may even be overkill here since both the Dataflow job to create the BQ tables and report generation could be on the same instance.

  2. Use a Cloud Function (GCF). These are lightweight and don't require provisioning servers. AFAICT these are only available in a Node.js environment and don't permit arbitrary first-party dependencies like a set of queries to run, so we'd need to side-load them through GCS or an even more convoluted node module from npm.

  3. Use the existing App Engine (GAE) server to receive a push notification when the crawl is done and kick off the report generation. This is nice because the entire repository is deployed to GAE, so it can directly access the queries. I don't think we can just directly run the generateReports.sh shell scripts I've already written, but porting the scripts to Python should make them runnable on GAE.

The solution we go with needs to address the following issues:

  • How is the report generation source code deployed?
  • How will it be notified of pubsub events?
  • How will it load any dependencies, like the SQL for each metric?
  • How easily does it enable manual/local testing?
  • How can we monitor the health of the pipeline?
  • How will it put the results in GCS?
  • How will the web app know the dates for which reports are available on GCS?

Using GAE for automated report generation

I'm in favor of the third option to use GAE. I'll try to address each of the above concerns:

  • Deploying the source code is trivial because the web app and generation code are developed in the same code repository and pushed to the same server. No extra deployment steps are needed.

  • GAE is obviously already listening for HTTP(S) requests, so it will use the push model to receive notifications. During prototyping, this had some issues, like securing the endpoint. We want to ensure that only authorized users can trigger report generation.

  • Loading SQL dependencies is trivial because they are included in the deployment. It will be as simple as opening a local file.

  • Testing the pubsub endpoint locally should be possible using the pubsub emulator, but this looks like a lot of overhead. We may want to create a "staging" GAE environment where we can deploy experimental code that is publicly accessible to receive pubsub notifications.

  • We can use the existing GAE Stackdriver logging support to monitor the health of the pubsub endpoint.

  • Pushing the results to GCS would be done using the cloudstorage library for GAE.

  • The web app code already keeps a local copy of dates.json, so it could simply update this with each new crawl. There's also a case to be made that the dates could be inferred from the available directories on GCS, which is a more reliable source of truth than a JSON artifact. Although for these purposes it doesn't really matter either way.

--

So this is the proposed design to automate report generation using GAE. Please let me know if you have questions about this approach or would like to explore any of the other options.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.