About 40,000 results in the 2019-01-01 mobile datasets have crawlid 581 which isn't li

Data from 2019-01-01 contains unknown crawls about legacy.httparchive.org HOT 13 CLOSED

Themanwithoutaplan commented on September 18, 2024

Data from 2019-01-01 contains unknown crawls

from legacy.httparchive.org.

Comments (13)

pmeenan commented on September 18, 2024 1

I'm not seeing any duplicates on the main server the dumps are created on either:

mysql> SELECT url, COUNT(0) AS n FROM pages WHERE crawlid = 587 GROUP BY url HAVING n > 1 ORDER BY n DESC;
Empty set (35.76 sec)

587 is the crawl ID for the 2019-04-01 desktop crawl.

Any chance you're not including the protocol (http:// or https://) part of the URL when collecting the site stats? It is expected that there will be both for sites that have traffic on both since they are different origins.

Looking at domains that have both http:// and https:// results:

mysql> SELECT LEFT(RIGHT(`url` ,length(`url`) -(position('//' IN `url`) + 1)) ,position('/' IN RIGHT(`url` ,length(`url`) -(position('//' IN `url`) + 1))) - 1) AS domain, COUNT(0) AS n FROM pages WHERE crawlid = 587 GROUP BY domain HAVING n > 1 ORDER BY n DESC;

113500 rows in set (40.32 sec)

That looks pretty close to half of your number so, assuming you are counting both of them, my guess is you are calculating stats at a domain level, not origin and we test both http and https versions of an origin if CrUX shows that both had traffic during the previous month.

from legacy.httparchive.org.

pmeenan commented on September 18, 2024 1

If both protocols are in CrUX that means that both had a meaningful amount of traffic during the month. If it was redirecting from http -> https then CrUX would only report the https. The content can be completely different if, fo example, a site is migrating to a new system and deploying https as part of the migration and the performance characteristics are likely to be very different.

If you are going to de-dupe them into a single entry, I'd recommend favoring the https:// variant when there are duplicates. For the HTTPArchive it makes sense to just collect both of them so we have a clean and complete dataset that matches CrUX.

from legacy.httparchive.org.

rviscomi commented on September 18, 2024

Yes, when the crawl started the mobile URL list had not been updated correctly, so we did that on the fly and restarted the crawl. About 40k tests had already been started and those correspond to the aborted crawl.

from legacy.httparchive.org.

Themanwithoutaplan commented on September 18, 2024

Thanks for the explanation, but in the interests of data consistency those tests should either be removed from the dump or a relevant crawl id should be added.

from legacy.httparchive.org.

rviscomi commented on September 18, 2024

Are you able to omit 581 from your end of the pipeline? We're not maintaining any of the legacy systems beyond recovering from data loss or breakages, and this doesn't seem to affect anything critical. I'm not aware of anyone else who depends on the raw legacy results, so if you could work around it for now we should be ok.

from legacy.httparchive.org.

Themanwithoutaplan commented on September 18, 2024

I can work around this one fairly easily because it raises an exception directly when I import the data. A bigger pain are duplicate tests within the same crawl because these don't show up until I create some reports. The right constraints on the database would stop this happening but I can appreciate you not wanting to touch the schema at this stage, particularly as fixing the duplicate tests requires window functions which I'm not sure MySQL does.
Would be great if you could give a heads up at the end of any crawl if any such anomalies are expected.

from legacy.httparchive.org.

rviscomi commented on September 18, 2024

Will do! Thanks for your understanding.

from legacy.httparchive.org.

Themanwithoutaplan commented on September 18, 2024

FWIW, I'm still seeing around 120,000 duplicates in every run. I haven't done any further analysis but I think you should be looking at what's causing this.

from legacy.httparchive.org.

Themanwithoutaplan commented on September 18, 2024

In the desktop crawl for 2019-04-01 there are around 227,000 sites with duplicate tests so we're getting close to 5%. This will start to affect any derived statistics and is also a waste of resources. Do we have any idea what's causing this? Are some crawls being allocated twice?

from legacy.httparchive.org.

rviscomi commented on September 18, 2024

Could you share the query you're running to get that count?

summary_pages on BigQuery is created from the MySQL-based CSV dumps and it's not showing any duplicates:

SELECT
  url,
  COUNT(0) AS n
FROM
  `httparchive.summary_pages.2019_04_01_desktop`
GROUP BY
  url
HAVING
  n > 1
ORDER BY
  n DESC

If it's true I agree it's worth investigating, at least for resource conservation.

from legacy.httparchive.org.

Themanwithoutaplan commented on September 18, 2024

My query is always slightly different due to the way I import data into Postgres but I can provide a list of what I think are duplicate sites.

from legacy.httparchive.org.

Themanwithoutaplan commented on September 18, 2024

Pat, I think you've identified the problem: I do normalise on the domain which is why these count as duplicates. I'm not sure that there is any point in not doing this just because both protocols are in the CrUX dataset: is it right to consider these as distinct websites? But the important thing is we've identified the source of the anomaly and I can adjust my import script.

from legacy.httparchive.org.

Themanwithoutaplan commented on September 18, 2024

I appreciate exactly what you're saying about the protocols but a cursory check suggests that these are duplicates and that the websites are just not configured to redirect. At some point, as we move towards http/2 the issue may resolve itself, in the meantime I guess it's an interesting effect itself.

For my purposes I'm doing just what you suggest and am keeping only the https variants.

from legacy.httparchive.org.

Data from 2019-01-01 contains unknown crawls about legacy.httparchive.org HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent