httparchive / httparchive.org Goto Github PK
View Code? Open in Web Editor NEWThe HTTP Archive website hosted on App Engine
Home Page: https://httparchive.org
License: Apache License 2.0
The HTTP Archive website hosted on App Engine
Home Page: https://httparchive.org
License: Apache License 2.0
I noticed that the color contrast on nav li:hover
is way to low (https://contrast-ratio.com/#white-on-%23bcced1). There might be other places where this is an issue.
I could help with that, but I did not create a pull request since design/branding changes are always controversial.
Currently, if a starting value in a dataset is 0
, the relative percentage change of a trend will display as "Infinity%" (because dividing by 0
in JS returns Infinity
):
The culprit is the getChange
function in timeseries.js
not checking if the initial value is 0
.
Proposed fix: Either hide the relative change labels in this case, or just insert something like a dash or "n/a".
sudo certbot certonly --manual --preferred-challenges dns
Return a 301 Moved Permanently response to all requests for beta.httparchive.org. The location should be the same URL minus the beta subdomain.
It's the same content behind both URLs but we may remove the beta subdomain at some point and don't want any old links floating around.
Back in 2011 I ported httparchive to Python (using SQLAlchemy and Pyramid) and I think the port could be a good start for the planned reworking of the project.
The existing code is at https://bitbucket.org/charlie_x/python-httparchive/src/
Major differences to the original:
In addition I have local code which I use for comparative reports for customers.
A currently poorly maintained version of the site can be seen at http://mamasnewbag.org I need to push the changes to the schema and update the imported data.
$ python --version
Python 3.6.4
Running python main.py
returns the following error:
$ python main.py
Traceback (most recent call last):
File "main.py", line 19, in <module>
from urlparse import urlparse
ModuleNotFoundError: No module named 'urlparse'
Authenticating with Google Cloud is necessary to call the storage APIs that tell us which dates are available for reports. When building the website locally, it should be possible to skip this requirement and get a truly static/deterministic environment. This can be with a non-default build flag or something.
Show stats from Lighthouse a11y audits. This may initially be limited to the passing rate for each audit, unless any of them give a numeric score. We could also show the distribution of numeric scores for the entire a11y audit category.
For example: https://developers.google.com/web/tools/lighthouse/audits/button-name
I've been asking touchpoints leveraging Lighthouse for metrics data to rename TTCI and TTFI now that we've done so in Lighthouse:
Time To First Interactive is now First CPU Idle
Time to Consistently Interactive is now just Time to Interactive
(SpeedCurve also just made this change)
/
)make pushall
to apply the config changes to desktop and mobiletruncate table urlsdev;
to clear out the old list of 1.3M URLsbulktest$ php importurls.php <csv-file> other
to load the 4M URLs into the urlsdev tableQuery to extract the desktop origins from the latest CrUX release:
SELECT
DISTINCT CONCAT(origin, '/') AS url
FROM
`chrome-ux-report.all.201811`
WHERE
form_factor.name = 'desktop'
Table of desktop URLs: https://bigquery.cloud.google.com/table/httparchive:urls.2018_12_15_desktop
Either as % of requests or % of distinct hosts. See https://discuss.httparchive.org/t/http-2-adoption/792/16
Group origins by category/vertical, for example news/travel/etc. This will enable category deep dives and comparisons.
DMOZ is no longer operational but a recent data dump is available. We should look for alternate sources.
Slightly related: HTTPArchive/legacy.httparchive.org#75. Alexa is deprecating their top 1M ranking, so finding a rank+category solution would be a bonus.
The Getting Started guide written by @paulcalvano lives in the legacy repository. It was written before the new BigQuery UI started rolling out to users, so the workflow may not be applicable to those users anymore.
We should move the documentation out of the legacy repo and into this one where people can more easily find it. We can implement it similar to the FAQ, where it is a markdown file readable on GitHub and also rendered as HTML on the website.
We should also update the docs to handle both the new and old BigQuery UI (until the old UI is deprecated).
Ensure all pages have metadata like <meta>
descriptions, images, etc. Also consider open graph or structured data tags.
Ideally, when sharing a link to a report, a relevant image would be used as opposed to the HTTP Archive logo. For example, if deep linking to a particular metric in a report, the image used would be a screenshot of the metric's timeseries/histogram. At the very least, we could dynamically populate the page's metadata with info specific to the metric.
use case
I open a website
I capture all requests / save as har file website.har
now I have all css / js / png etc and want to "host" this content locally via localhost
I'm imagining a command like
python -m http.server -har website.har
does such a router exist?
The legacy website uses intermediate crawl data from MySQL tables to generate CSVs containing summary data about pages and requests. As part of the beta migration, we would like to deprecate this preprocessing step and depend directly on the raw HAR data.
In BigQuery, this data is represented in the runs
dataset, which has recently been split into summary_pages
and summary_requests
. These datasets will continue to exist, but will be generated in a BigQuery post-processing step instead, using the HAR tables as input.
A secondary goal of this process is to modernize the summary data. For example, the videoRequests
field may not count modern video formats like WebM.
Goal: Provide a mechanism for users to apply one or more "lenses" through which to view the HTTP Archive reports. These lenses effectively filter the set of websites to a subset that match some condition. For example, all websites that run on WordPress.
Strategy:
Stretch goals:
The goal is for beta.httparchive.org to become the new httparchive.org, targeting some time in Q1 2018. The following tasks are launch blocking:
The reports are manually generated with scripts. These scripts should be automated to run whenever a crawl is complete. Some subtasks for this issue:
httparchive.lighthouse
) which are manually copied from catch-all datasets (eg httparchive.har
). New tables in these datasets must all be created automatically.Concept graphics for the JS report:
We would need similar graphics for each report. It would also be nice to have a default graphic for new reports until a permanent one could be made.
For example, "Total KB" should have a description like The sum of transfer size kilobytes of all resources requested by the page.
.
Reports should describe their contents and maybe even a brief analysis of the overall trends.
The legacy FAQs are somewhat outdated. The new FAQ page should contain updated information including any new content related to the new reports/metrics/visualizations.
Some feedback on the charts include:
Not all legacy features will be supported by the beta site at launch. Since the beta site will assume the root domain, it will start receiving requests from legacy URLs. At launch, the legacy site will still be accessible at http://legacy.httparchive.org. Known legacy URLs should be redirected to this subdomain unless the feature is also available on the beta site, in which case the URL should be mapped. Whether to use a temporary or permanent (301 vs 302) redirect depends on whether the feature is expected to be supported by the beta site.
For example, one simple case is the About page, which is http://httparchive.org/about.php
. This has a corresponding page on the beta site at https://beta.httparchive.org/about
. A more complicated example is http://httparchive.org/viewsite.php?pageid=84263714
which may be supported in the future.
Any HTTP request should automatically redirect to HTTPS. This is a simple feature but had a subtle infinite redirect bug when I last attempted a fix in the Flask layer.
Related: ensure the Let's Encrypt certificate automatically renews. Same for the https://cdn.httparchive.org certificate.
This code base should become the canonical "httparchive" project on GitHub. The legacy code base should be renamed to "legacy.httparchive.org" or similar. Consider the careful dance of moving code around in such a way that the primary project maintains the same stars/watchers. However this may screw up the commit history.
just noticed it https://httparchive.org/reports/progressive-web-apps#pwaScores
and js bootup's last data is jul15 rather than aug15
The percent of vulnerable JS dropped to 0 in the most recent release.
https://httparchive.org/reports/state-of-the-web?start=2018_06_15&end=2018_07_01#pctVuln
Determine how JS is modularized, eg webpack.
To do:
cc @addyosmani
Progress on legacy redirects in #13.
legacy
subdomain to resolve to the legacy web serverlegacy
subdomain in the legacy web server's nginx and apache configslegacy
subdomain
@
and www
DNS A records to CNAME records, resolving to the new (beta) App Engine server at ghs.googlehosted.com
DNS settings:
Build a new custom timeseries visualization in the State of JS report that tracks usage of popular JS libraries.
I imagine this being a line chart with one line per library. We'd need to override the default Highcharts color scheme to give each line a distinct color. Right now it's set up for desktop and mobile series.
We also need to specially handle jQuery somehow. Its usage is so much higher than everything else so it would make everything else look tiny by comparison. We can enable y-axis zooming, use an exponential y-axis, or rely on manually toggling the series on/off and letting the chart autoscale.
Data from the Chrome UX Report indicates that the vast majority of connection speeds on mobile devices (phone/tablet) are effectively 4G:
#standardSQL
SELECT
SUM(IF(effective_connection_type.name = '4G', bin.density, 0)) / SUM(bin.density) AS pct_mobile_4g
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
form_factor.name IN ('phone', 'tablet')
https://bigquery.cloud.google.com/results/chrome-ux-report:bquijob_35020ffa_160510ab375
Result: 87.95%
Update the WebPageTest configuration for mobile tests to use 4G speeds to more accurately represent real user conditions.
Console error:
Refused to load the image 'https://stats.g.doubleclick.net/r/collect?[...]'
because it violates the following Content Security Policy directive:
"img-src 'self' discuss.httparchive.org www.google-analytics.com".
This is a benign analytics endpoint related to Google Analytics.
To fix, add stats.g.doubleclick.net
to the img-src
policy in the CSP whitelist.
Make it possible for users to enter a WPT ID and see relevant test result data inline with the aggregate HTTP Archive data. For example, show the user's 6747.7 KB page weight in the context of the median desktop page weight of 1698.4 KB.
Subtasks:
SELECT
url
FROM
[httparchive:summary_pages.2018_06_15_desktop]
WHERE
_adult_site IS TRUE
0 results.
SELECT
url
FROM
[httparchive:summary_pages.2017_06_15_desktop]
WHERE
_adult_site IS TRUE
0 results.
SELECT
url
FROM
[httparchive:summary_pages.2016_06_15_desktop]
WHERE
_adult_site IS TRUE
7071 results.
Builtwith puts WordPress coverage at ~30% of the web.
After increasing the coverage to 1.3M URLs from CrUX, we ~doubled the number of WordPress sites detected. However, the coverage increase was a factor of ~3x, so the relative percent of WordPress sites has gone down:
date | client | wordpress | total | % |
---|---|---|---|---|
Aug 1, 2018 | desktop | 201,474 | 1,275,374 | 15.80% |
Aug 1, 2018 | mobile | 198,572 | 1,268,277 | 15.66% |
Jul 15, 2018 | desktop | 201,277 | 1,277,631 | 15.75% |
Jul 15, 2018 | mobile | 198,290 | 1,272,071 | 15.59% |
Jul 1, 2018 | desktop | 200,927 | 1,277,805 | 15.72% |
Jul 1, 2018 | mobile | 101,401 | 451,109 | 22.48% |
Jun 15, 2018 | desktop | 103,768 | 461,068 | 22.51% |
Jun 15, 2018 | mobile | 100,978 | 451,307 | 22.37% |
Prior to the coverage increase, WordPress represented ~22% of the URLs. This is still well below the 30% figure which is widely regarded as the source of truth. After the expansion, WordPress representation dropped to ~16%.
In January 2018 @pmeenan ran an experiment where he tested all 2.9M CrUX URLs and detected WordPress on 785k (27%) of them. Assuming this ratio holds true in the most recent CrUX release that formed the basis of the HA corpus, we're not capturing (or detecting) a significant proportion of these sites.
The methodology behind the corpus change was to intersect the Alexa Top 1M domains (as of March 15, 2017) with origins in the 201805 CrUX release. We should investigate whether this approach introduced bias that excluded a significant proportion of WordPress sites.
Possible topics:
I followed a link to https://www.httparchive.org/reports/state-of-the-web (actually to http://www.httparchive.org/interesting.php , thanks for redirecting) and the graphs don't load because the CDN only allows https://httparchive.org/ CORS. I don't know whether you'd want to redirect www to bare, or add the www to the CORS allow header, but thought I'd let you know.
_cpu.v8.compile
appears 0 times in any of the four crawls' HAR payloads in February. It made 45562739 appearances the previous month. @pmeenan is this intentional/expected?
This data is currently used in the compileJs metric, so recent crawls have broken charts (fix pending).
The "Unused JavaScript" Lighthouse audit is not enabled by default.
Adjust the Lighthouse config in our WebPageTest agents to enable this audit.
See the note in the LH docs for more info:
Note: This audit is off by default! It can be enabled in the Node CLI by using the lighthouse:full configuration profile.
@pmeenan can you look into this?
By default the view of report pages is to vertically list all metrics. Each metric has:
All of this UI baggage may make it difficult to skim through the report to get an overall sense for how things are going.
The purpose of this feature request is to design a grid view of metrics so that only the necessary information to understand the state of the metrics is shown compactly.
We need to improve our documentation around contributing to the project, both in terms of committing code and participating in the analysis/discussion.
The goal is to create a top-level "Contribute" navigation item on the website that links to a section in the About page with the following info:
If a person wants to get more involved with the project, they should understand when to use which channel and how to join.
Passing in a wptid
URL param with a WebPageTest ID will enable those individual test results to be displayed inline with the macro data from HTTP Archive.
In addition to showing the data point above the chart, we should embed it in the context of the chart in the form of a visualized line or bar. For timeseries, it should be a horizontal line at the y-intercept corresponding to the WPT value. Similarly for histograms, it should be a vertical line.
See https://developers.google.com/web/updates/2017/04/devtools-release-notes#coverage
@pmeenan is this something we can get directly from WPT if the right bits are fiddled?
Per @pmeenan:
We could (should) set up a cron to back up the database and code to the google storage bucket.
The servers have been failing at a faster rate lately, so starting these backups sooner would ensure that we're protected against a failure of the master DB server.
Pat, would you have time for this?
HI, I'm working on some enterprise systems which can only be accessed internally, how to set up an httparchive private instance in Linux or Windows? It may work perfectly with WPT private instance. Thanks a lot
To accelerate (and track) the adoption of better compression across the web, it would be most helpful if HTTP Archive could track supported compression methods. Specifically, tracking the set of sites which are served with Brotli compression (content-encoding:br) vs. gzip.
https://opensource.googleblog.com/2015/09/introducing-brotli-new-compression.html
https://www.gstatic.com/b/brotlidocs/brotli-2015-09-22.pdf
Thank you!
On pages like https://httparchive.org/reports/state-of-javascript the default time shown in the big chart is... the past 12 months? But then the time selector beneath is about 9 times wider. Could the UI default to showing 50% of the available data? (or in the case of some newer reports, no less than 2 years?)
wdyt
Need to resolve this warning:
~/python-venv/lib/python2.7/site-packages/google/auth/_default.py:66: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/.
warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
See https://twitter.com/janssenstom/status/1036171871237091329?s=19
Difficult to navigate to subitems.
Could we add new columns (name + type) in our requests table to record the 3P badges? Also, could we propagate those tags down to any child requests? E.g. if a.js is tagged as XYZ corp, and fetches b.jpg, the latter should have the same tag.
AFAIK, CDT does this analysis at runtime, so it's not available in the trace? Should it be, or do we need to integrate this into our own analysis pipeline?
/cc @paulirish @rviscomi
When a crawl is complete, reports on beta.httparchive.org are immediately updated with the latest data.
Histogram and timeseries queries are defined for each metric. After the crawl completes, these queries need to be rerun against the latest data and their JSON results saved to Cloud Storage (GCS). Timeseries metrics are saved in undated JSON files (eg https://cdn.httparchive.org/reports/bytesJs.json). Histograms are saved in directories corresponding to the crawl date in YYYY_MM_DD format (eg https://cdn.httparchive.org/reports/2018_02_01/bytesJs.json).
The HTTP Archive web server needs to know which dates are available for each report in GCS. It maintains dates.json to track all crawls for which data is available in GCS. Each date option is made available in the UI in a dropdown field to explore historical data. This JSON file is loaded into memory when the server starts up.
In order to automate the report generation, we need a process that runs the queries, uploads results to GCS, and identifies which dates are available to explore in the UI.
To generate the reports automatically, we will use PubSub to signal when the data is available on BigQuery after the Dataflow pipeline has completed.
I have a few ideas for how we can use subscribers to do the actual BQ querying and backup to GCS:
Use a Compute Engine (GCE) server that either polls PubSub for published events or exposes an HTTP endpoint to receive a push notification. PubSub may even be overkill here since both the Dataflow job to create the BQ tables and report generation could be on the same instance.
Use a Cloud Function (GCF). These are lightweight and don't require provisioning servers. AFAICT these are only available in a Node.js environment and don't permit arbitrary first-party dependencies like a set of queries to run, so we'd need to side-load them through GCS or an even more convoluted node module from npm.
Use the existing App Engine (GAE) server to receive a push notification when the crawl is done and kick off the report generation. This is nice because the entire repository is deployed to GAE, so it can directly access the queries. I don't think we can just directly run the generateReports.sh shell scripts I've already written, but porting the scripts to Python should make them runnable on GAE.
The solution we go with needs to address the following issues:
I'm in favor of the third option to use GAE. I'll try to address each of the above concerns:
Deploying the source code is trivial because the web app and generation code are developed in the same code repository and pushed to the same server. No extra deployment steps are needed.
GAE is obviously already listening for HTTP(S) requests, so it will use the push model to receive notifications. During prototyping, this had some issues, like securing the endpoint. We want to ensure that only authorized users can trigger report generation.
Loading SQL dependencies is trivial because they are included in the deployment. It will be as simple as opening a local file.
Testing the pubsub endpoint locally should be possible using the pubsub emulator, but this looks like a lot of overhead. We may want to create a "staging" GAE environment where we can deploy experimental code that is publicly accessible to receive pubsub notifications.
We can use the existing GAE Stackdriver logging support to monitor the health of the pubsub endpoint.
Pushing the results to GCS would be done using the cloudstorage
library for GAE.
The web app code already keeps a local copy of dates.json, so it could simply update this with each new crawl. There's also a case to be made that the dates could be inferred from the available directories on GCS, which is a more reliable source of truth than a JSON artifact. Although for these purposes it doesn't really matter either way.
--
So this is the proposed design to automate report generation using GAE. Please let me know if you have questions about this approach or would like to explore any of the other options.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.