Comments (9)
Rick - I listened to your recent talk in NYC at Performance Meet Up about the categorizing of URLs. I think this would be a powerful feature!! Any thoughts on how this could get moved forward? I would volunteer to help the effort. Hopefully more folks will feel the same way and we could proceed before too long.
from httparchive.org.
Hey Greg, thanks for volunteering! Assigning this to you :)
The next steps for this issue are:
- survey the landscape of options: are there any other services similar to DMOZ that are regularly updated? what is the URL coverage and how does it overlap with the Alexa 500K that we're using? is there room for growth as we expand URL coverage? category correctness/granularity/etc...
- plan and integrate the category info with HTTP Archive's data: what changes need to be made to the Dataflow pipeline and BigQuery schema?
- analyze the new data and surface interesting reports on the beta site
from httparchive.org.
I was thinking about this the other day and didn't realize there was an issue open. During my searches the only thing I was able to find was archived DMOZ data. Here's the dump I found in case it's useful - https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OMV93V
from httparchive.org.
Sadly, DMOZ is deprecated.. I don't think we should hitch our wagon to this particular dataset.
from httparchive.org.
Yeah the DMOZ dump could be used as a last resort but it'd be preferable to find a service that's actively maintained.
Ilya also did some work on joining DMOZ data with Alexa URLs here: https://bigquery.cloud.google.com/table/httparchive:urls.20170315?tab=preview. Of the 1M URLs, only ~170K (17%) have topics/categories.
from httparchive.org.
Ah, cool. I'll stop uploading that dataset to bigquery then. Was about to do the same analysis :)
from httparchive.org.
Rick - I'll start poking around and see what I can find. Stay tuned.
from httparchive.org.
Hey @gregorywolf have you made any progress on this?
from httparchive.org.
from httparchive.org.
Related Issues (20)
- Beta report - some entries missing when categories selected
- Beta report - ALL missing from categories
- Tech report data tooltips show datetime in local timezone
- Beta Report - Blank Dropdown List Items in Dark Theme
- Beta Report - Confusion about where the data is HOT 2
- Rename "ALL" to "All technologies"
- Improve tech report page weight formatting HOT 1
- Update beta header HOT 1
- Upgrade to production tech report API
- Beta report formatting of large numbers and dates
- Remove "Vulnerable JS" chart from reports
- Classification of the URLs in HA dataset HOT 6
- Bug: Color theme doesn't get remembered on the landing page
- Bug: Dropdowns turn white while in focus (dark mode)
- Tech Report: focus on the new technology input field when adding another one
- Tech Report: Improve how to search for technologies
- Some reports have failed for 2024_06_01 HOT 1
- Some reports have failed for 2024_07_01
- Tech report: Dropdowns z-index bug
- Tech report: Improve loading states
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from httparchive.org.