Giter VIP home page Giter VIP logo

Comments (9)

gregorywolf avatar gregorywolf commented on August 12, 2024 1

Rick - I listened to your recent talk in NYC at Performance Meet Up about the categorizing of URLs. I think this would be a powerful feature!! Any thoughts on how this could get moved forward? I would volunteer to help the effort. Hopefully more folks will feel the same way and we could proceed before too long.

from httparchive.org.

rviscomi avatar rviscomi commented on August 12, 2024

Hey Greg, thanks for volunteering! Assigning this to you :)

The next steps for this issue are:

  • survey the landscape of options: are there any other services similar to DMOZ that are regularly updated? what is the URL coverage and how does it overlap with the Alexa 500K that we're using? is there room for growth as we expand URL coverage? category correctness/granularity/etc...
  • plan and integrate the category info with HTTP Archive's data: what changes need to be made to the Dataflow pipeline and BigQuery schema?
  • analyze the new data and surface interesting reports on the beta site

from httparchive.org.

paulcalvano avatar paulcalvano commented on August 12, 2024

I was thinking about this the other day and didn't realize there was an issue open. During my searches the only thing I was able to find was archived DMOZ data. Here's the dump I found in case it's useful - https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OMV93V

from httparchive.org.

igrigorik avatar igrigorik commented on August 12, 2024

Sadly, DMOZ is deprecated.. I don't think we should hitch our wagon to this particular dataset.

from httparchive.org.

rviscomi avatar rviscomi commented on August 12, 2024

Yeah the DMOZ dump could be used as a last resort but it'd be preferable to find a service that's actively maintained.

Ilya also did some work on joining DMOZ data with Alexa URLs here: https://bigquery.cloud.google.com/table/httparchive:urls.20170315?tab=preview. Of the 1M URLs, only ~170K (17%) have topics/categories.

from httparchive.org.

paulcalvano avatar paulcalvano commented on August 12, 2024

Ah, cool. I'll stop uploading that dataset to bigquery then. Was about to do the same analysis :)

from httparchive.org.

gregorywolf avatar gregorywolf commented on August 12, 2024

Rick - I'll start poking around and see what I can find. Stay tuned.

from httparchive.org.

rviscomi avatar rviscomi commented on August 12, 2024

Hey @gregorywolf have you made any progress on this?

from httparchive.org.

gregorywolf avatar gregorywolf commented on August 12, 2024

from httparchive.org.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.