Giter VIP home page Giter VIP logo

Comments (5)

jeffreyliu avatar jeffreyliu commented on August 17, 2024

Hm, so I'm not clear on what the distinction between this and web_scraping would be? I'm also not entirely clear on what non-archiving scraping means in this context. Could you clarify those points? I think these resources should definitely be included, but perhaps just under the web_scraping section.

Perhaps for each category, there should be an issue for suggesting links to include, and those that we think are good resources should be added via PR?

from research.

ebenp avatar ebenp commented on August 17, 2024

I don't think there should be a distinction between this and web_harvesting.
Sorry, I just looked and meant to mention this location to save.
https://github.com/datatogether/research/tree/master/web_harvesting

I was thinking this would be a readme or a google sheet link inside the web_harvesting folder as the location to save this if that makes sense.

The only scraping distinction I was thinking of is between links such as these and scraping that we do with datatogether archiving that has archivertools and morph.io usage. I think those examples should be kept out of this research repo.

from research.

mhucka avatar mhucka commented on August 17, 2024

You're right, this needs clarification. I hope to get back to this this week.

from research.

mhucka avatar mhucka commented on August 17, 2024

@ebenp @jeffreyliu Finally looking at this, I now remember the original idea behind the two directories. One is for cataloging software systems that do web archiving/scraping/etc., and the other is meant to be research on approaches to doing that (i.e., overall approach, algorithms, examples of software that does it, etc.). I struggled with how to name the directories, and clearly failed badly.

What if web_harvesting were renamed to harvesting_approaches or something similar?

Regarding the distinction between scraping and archiving, I might be wrong, but I think there is a difference, because a system to scrape web pages does not necessary have to archive or store the results.Β For exaple, I've written a system that scrapes pages to get info and store specific bits of info in a custom database, but it doesn't archive the whole page or harvest the page/site in the way that we talk about those things in Archivers & Data Together.

IMHO, the term "harvesting" could mean either scraping or archiving, although looking around, I now see that Wikipedia basically makes "web scraping" synonymous with "web harvesting" and "web data extraction", so I guess it's closer to the meaning of scraping.

from research.

ebenp avatar ebenp commented on August 17, 2024

web harvesting makes sense to me and I also really like the detail given above about what harvesting and archiving is in terms of Data Together. Maybe those definitions could end up in the directory readme.

from research.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.