Comments (13)
I'm not seeing any duplicates on the main server the dumps are created on either:
mysql> SELECT url, COUNT(0) AS n FROM pages WHERE crawlid = 587 GROUP BY url HAVING n > 1 ORDER BY n DESC;
Empty set (35.76 sec)
587 is the crawl ID for the 2019-04-01 desktop crawl.
Any chance you're not including the protocol (http:// or https://) part of the URL when collecting the site stats? It is expected that there will be both for sites that have traffic on both since they are different origins.
Looking at domains that have both http:// and https:// results:
mysql> SELECT LEFT(RIGHT(`url` ,length(`url`) -(position('//' IN `url`) + 1)) ,position('/' IN RIGHT(`url` ,length(`url`) -(position('//' IN `url`) + 1))) - 1) AS domain, COUNT(0) AS n FROM pages WHERE crawlid = 587 GROUP BY domain HAVING n > 1 ORDER BY n DESC;
113500 rows in set (40.32 sec)
That looks pretty close to half of your number so, assuming you are counting both of them, my guess is you are calculating stats at a domain level, not origin and we test both http and https versions of an origin if CrUX shows that both had traffic during the previous month.
from legacy.httparchive.org.
If both protocols are in CrUX that means that both had a meaningful amount of traffic during the month. If it was redirecting from http -> https then CrUX would only report the https. The content can be completely different if, fo example, a site is migrating to a new system and deploying https as part of the migration and the performance characteristics are likely to be very different.
If you are going to de-dupe them into a single entry, I'd recommend favoring the https:// variant when there are duplicates. For the HTTPArchive it makes sense to just collect both of them so we have a clean and complete dataset that matches CrUX.
from legacy.httparchive.org.
Yes, when the crawl started the mobile URL list had not been updated correctly, so we did that on the fly and restarted the crawl. About 40k tests had already been started and those correspond to the aborted crawl.
from legacy.httparchive.org.
Thanks for the explanation, but in the interests of data consistency those tests should either be removed from the dump or a relevant crawl id should be added.
from legacy.httparchive.org.
Are you able to omit 581 from your end of the pipeline? We're not maintaining any of the legacy systems beyond recovering from data loss or breakages, and this doesn't seem to affect anything critical. I'm not aware of anyone else who depends on the raw legacy results, so if you could work around it for now we should be ok.
from legacy.httparchive.org.
I can work around this one fairly easily because it raises an exception directly when I import the data. A bigger pain are duplicate tests within the same crawl because these don't show up until I create some reports. The right constraints on the database would stop this happening but I can appreciate you not wanting to touch the schema at this stage, particularly as fixing the duplicate tests requires window functions which I'm not sure MySQL does.
Would be great if you could give a heads up at the end of any crawl if any such anomalies are expected.
from legacy.httparchive.org.
Will do! Thanks for your understanding.
from legacy.httparchive.org.
FWIW, I'm still seeing around 120,000 duplicates in every run. I haven't done any further analysis but I think you should be looking at what's causing this.
from legacy.httparchive.org.
In the desktop crawl for 2019-04-01 there are around 227,000 sites with duplicate tests so we're getting close to 5%. This will start to affect any derived statistics and is also a waste of resources. Do we have any idea what's causing this? Are some crawls being allocated twice?
from legacy.httparchive.org.
Could you share the query you're running to get that count?
summary_pages
on BigQuery is created from the MySQL-based CSV dumps and it's not showing any duplicates:
SELECT
url,
COUNT(0) AS n
FROM
`httparchive.summary_pages.2019_04_01_desktop`
GROUP BY
url
HAVING
n > 1
ORDER BY
n DESC
If it's true I agree it's worth investigating, at least for resource conservation.
from legacy.httparchive.org.
My query is always slightly different due to the way I import data into Postgres but I can provide a list of what I think are duplicate sites.
from legacy.httparchive.org.
Pat, I think you've identified the problem: I do normalise on the domain which is why these count as duplicates. I'm not sure that there is any point in not doing this just because both protocols are in the CrUX dataset: is it right to consider these as distinct websites? But the important thing is we've identified the source of the anomaly and I can adjust my import script.
from legacy.httparchive.org.
I appreciate exactly what you're saying about the protocols but a cursory check suggests that these are duplicates and that the websites are just not configured to redirect. At some point, as we move towards http/2 the issue may resolve itself, in the meantime I guess it's an interesting effect itself.
For my purposes I'm doing just what you suggest and am keeping only the https
variants.
from legacy.httparchive.org.
Related Issues (20)
- Legacy website explorer limited to July 2018 HOT 2
- Crawlid 558 missing from stats HOT 3
- Legacy Website Reports are Missing Historical Data HOT 3
- Update FAQs HOT 1
- Video summary needs updating HOT 3
- Data from 2018-12-15 contains duplicate tests for some sites HOT 4
- Calculation of reqTotal incorrect for many sites for 2019-03-01 data HOT 1
- Legacy website not reachable HOT 4
- stats download for November is empty HOT 5
- A11Y metrics bugs HOT 3
- wpt_bodies meta description and robots gathering is invalid as the selector used is case sensitive HOT 1
- Create documentation file listing the contents of each custom metrics file
- Add a shorter timeout for fetches in custom metrics HOT 13
- Better script element custom metrics
- Create new scripts to detect importScripts() and usage of SW methods inside pwa.js HOT 1
- New event-names and pwa metrics did not use JSON.stringify HOT 7
- Add nativeSource to a11y custom metric
- Improve avif detection
- Improve a11y metric for captioned tables HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from legacy.httparchive.org.