Comments (15)
Closing this out. We can join with @patrickhulce's third-party-web data on BigQuery to achieve the same effect. See this example query.
from httparchive.org.
I don't think the attribution itself is in the raw data. The best way would probably be to implement it in the WebPageTest agent and maintain a copy of the lookup table there. The dependency chains could also be followed on the agent (though if the attribution was added to the trace or devtools events then a parallel table wouldn't have to be maintained).
Presumably we'd also want some high-levels stats at the page level for breaking out 1st vs 3rd party counts and sizes.
from httparchive.org.
@paulirish any thoughts on moving the tagging upstream in CDT, to make it part of the devtools trace? Seems like a generally useful feature for various downstream consumers, no? :)
@pmeenan agent approach makes sense (assuming we can't move it even further upstream). Ditto for page-level stats, but those we could produce outside of the agent too when we do our aggregations.
A counter argument for moving this logic upstream is if the tables are periodically updated and improved, having the identification logic live in our pipeline allows us to rerun the pipeline and regenerate the classification. @paulirish how often (if at all) is the CDT classification updated?
from httparchive.org.
Can I assist with implementing this? Let me know how I could get started.
My thoughts are:
- We would do this logic inside of WPT, and have the data flow the tables in BigQuery
- pages - add numDomainsThirdParty, add reqTotalThirdParty, add bytesTotalThirdParty
- requests - add isThirdParty, thirdPartyCategory
The logic I've used in the past involved (prior to Google creating CDT third party badges) used a regexp of pattern matching for 3rd party domains. And once I detected that a resource was third party, any subsequent requests that had a referrer from that resource would "inherit" the third party attribute.
I'm not sure if we capture the Chrome resource initiator inside of WPT, but if we do, then we could do an even better job on handling 3rd party detection.
A question on the Third Party database - can it be configured to detect this resource:
https://d1z2jf7jlzjs58.cloudfront.net/code/ptrack-v1.0.3-engagedtime.js
(this is the first resource in a chain that loads additional parsely.com resources into the page -- these are for the parse.ly audience insights platform.
Proper detection for parse.ly requires more than just looking at the domain, because there are lots of cloudfront.net resources that are part of customer websites.
/cc @rviscomi
from httparchive.org.
Bumping the priority of this feature. @LeslieMurphy let me know if you're still interested in working on this.
from httparchive.org.
A question on the Third Party database - can it be configured to detect this resource:
https://d1z2jf7jlzjs58.cloudfront.net/code/ptrack-v1.0.3-engagedtime.js
(this is the first resource in a chain that loads additional parsely.com resources into the page -- these are for the parse.ly audience insights platform.
This is an important point to get right. We want to make sure that we attribute all of the requests to the right parent/initiator, and then add another layer of smarts that tags these initiators against a set of categories like analytics, advertising, social, etc.
This could be done at runtime within WPT, or after the fact based on the dependency tree.. but that also means we should have high confidence in all the edges being present for the dependency tree. Do we?
from httparchive.org.
As of now we don't have a complete dependency tree. About 28% of requests are missing the "initiator" field in the HAR payload. I opened this thread on the WPT forum to see if this is an upstream issue. cc @pmeenan
Once we get that sorted out, it should be straightforward to follow the chain of initiators from a known third party to all of its dependent requests. We could use a technique similar to @paulcalvano's where we join a table of known third parties and their host names with the HTTP Archive requests to better understand which requests are third parties, what type of third party are they (ads, analytics, etc), and what are they loading/doing. This wouldn't require any pre-processing of the requests and could be done entirely in BigQuery.
from httparchive.org.
As far as I know, I report all of the initiator information that Dev tools collects. One thing we discussed maybe adding is to associate all unknown requests in a sub-frame with the main request for the frame which should help with the attribution for ads.
from httparchive.org.
tldr: The drop in initiator reliability correlates with M70+.
The website I included as an example in the WPT thread is www.usedtrucks.mercedes-benz.co.uk/. In the most recent crawl, only the initial HTML request is annotated with the expected initiator (empty string). The field is omitted entirely from all other requests.
When I manually test the page in Chrome (version 71.0.3578.98), I do see the expected initiator data:
I just retested the page in WPT and now I actually do see initiator fields in the HAR consistently for all 3 runs: https://www.webpagetest.org/result/190109_5H_10eab806c99209dd025fc14b48f8d820/
We've been testing this particular URL in HA since July 2018, so we can see if the percent of requests with an initiator field has changed:
SELECT
_TABLE_SUFFIX AS crawl,
SUM(IF(JSON_EXTRACT(payload, '$._initiator') IS NOT NULL, 1, 0)) / COUNT(0) AS pct_initiators
FROM
`httparchive.requests.*`
WHERE
page = 'http://www.usedtrucks.mercedes-benz.co.uk/'
GROUP BY
crawl
HAVING
pct_initiators IS NOT NULL
ORDER BY
crawl
Surprisingly, things took a nosedive on October 15
date | desktop | mobile |
---|---|---|
2018_07_01 | 100.00% | |
2018_07_15 | 100.00% | 98.37% |
2018_08_01 | 100.00% | 100.00% |
2018_08_15 | 100.00% | 100.00% |
2018_09_01 | 100.00% | 100.00% |
2018_09_15 | 100.00% | 100.00% |
2018_10_01 | 100.00% | 100.00% |
2018_10_15 | 0.36% | 0.36% |
2018_11_01 | 0.34% | 0.34% |
2018_11_15 | 0.34% | 0.34% |
2018_12_01 | 0.34% | 0.34% |
2018_12_15 | 0.32% | 0.32% |
And when we look at initiators for all requests on all pages things are interesting:
date | desktop | mobile |
---|---|---|
2018_07_01 | 99.08% | 99.01% |
2018_07_15 | 99.00% | 99.00% |
2018_08_01 | 99.08% | 99.02% |
2018_08_15 | 99.08% | 99.08% |
2018_09_01 | 99.04% | 99.06% |
2018_09_15 | 99.04% | 99.08% |
2018_10_01 | 99.04% | 99.06% |
2018_10_15 | 59.81% | 54.89% |
2018_11_01 | 42.76% | 44.94% |
2018_11_15 | 42.46% | 44.81% |
2018_12_01 | 41.65% | 43.40% |
2018_12_15 | 72.21% | 60.01% |
Again, things changed globally on October 15. And December 15 was actually much better in terms of coverage than crawls since November.
Looking at the Chrome versions during this timeframe, it seems like we switched from Chrome 69 to 70. So I wonder if there were some unexpected reliability issues in M70+ with the initiator field.
date | 68 | 69 | 70 | 71 |
---|---|---|---|---|
2018_09_01 | 50.64% | 49.36% | ||
2018_09_15 | 100.00% | |||
2018_10_01 | 100.00% | |||
2018_10_15 | 24.33% | 75.67% | ||
2018_11_01 | 0.06% | 0.17% | 99.77% | |
2018_11_15 | 100.00% | |||
2018_12_01 | 87.02% | 12.98% | ||
2018_12_15 | 0.07% | 99.93% |
One thing I can't explain is why the Mercedes example from the December 15 crawl had ~0% initiators in Chrome 71.0.3578.98 but from my ad hoc test today 100% of initiators are present in the exact same browser version.
from httparchive.org.
from httparchive.org.
Ok, let's wait for the 1/1 crawl to complete and see if the initiators are appearing as expected.
from httparchive.org.
Here's an updated table of initiator coverage with 2019_01_01:
date | desktop | mobile |
---|---|---|
2018_09_01 | 99.04% | 99.06% |
2018_09_15 | 99.04% | 99.08% |
2018_10_01 | 99.04% | 99.06% |
2018_10_15 | 59.81% | 54.89% |
2018_11_01 | 42.76% | 44.94% |
2018_11_15 | 42.46% | 44.81% |
2018_12_01 | 41.65% | 43.40% |
2018_12_15 | 72.21% | 60.01% |
2019_01_01 | 83.90% | 83.90% |
(yes desktop and mobile actually come out to the same rounded value)
There's definitely been some improvement but still not quite back to normal ~99%.
from httparchive.org.
99% seems unrealistically high. In normal testing I see a few URLs per page that have "other" as the initiator in the raw dev tools data. I can include that if it would help but it amounts to "unknown".
I did JUST take a look and push an improvement for cases where the initiator was a call stack in JavaScript that references a script ID but didn't include the script URL. That can happen when a script inserts a script directly into the dom (or does an eval) so now I monitor all script compilations and walk the call stack for every script ID to see what caused the script to get added and use that as the initiator in those cases.
We're around 1/3 the way into this month's crawl so there will be a decent bump in coverage and then another small bump in March (assuming nothing in Chrome changes between now and then).
from httparchive.org.
Copying an image from an earlier comment for clarification:
Are you saying the ~99% we saw from May 2017 to October 2018 was anomalous and the ~80% before and after that range is the realistic expectation?
from httparchive.org.
Yep, pretty much (or that the extra 19% had an empty (but present) initiator. At a minimum, the main document request usually doesn't have one and AFAIK, any iFrame src URLs (in addition to a bunch of other edge cases). 80% ish sounds like a good expectation.
from httparchive.org.
Related Issues (20)
- Some reports have failed for 2022_05_01 HOT 12
- May 2022 Loading Speed Graphs show huge improvement on previous months. HOT 4
- Missing histogram data for Nov 15, 2010 HOT 10
- Page indexing issues detected on httparchive.org HOT 1
- Add a change log entry about response bodies in May 2022 HOT 3
- Should we track the newer image formats? HOT 2
- Video bytes and requests not displayed for a single selected month HOT 2
- Link in gettingstarted_bigquery.md leads to error HOT 3
- Improve social metadata
- Some reports have failed for 2022_07_01 HOT 1
- Some reports have failed for 2022_07_01 HOT 1
- Incorrect units in chart titles. HOT 3
- Some reports have failed for 2022_08_01 HOT 2
- Some reports have failed for 2022_10_01 HOT 1
- Getting Started guide might not be correct anymore
- Some reports have failed for 2022_11_01 HOT 1
- All reports have failed for 2022_12_01 HOT 1
- Some reports have failed for 2023_01_01 HOT 6
- BigQuery extract of all datapoints for top and worst 1,000
- Store Technology meta data in HTTP Archive HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from httparchive.org.