Looking up code syntax I found the following blog post and referenced github repo. I w

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Adding in non Archiver scraping web links about research HOT 5 OPEN

ebenp commented on August 17, 2024

Adding in non Archiver scraping web links

from research.

Comments (5)

jeffreyliu commented on August 17, 2024

Hm, so I'm not clear on what the distinction between this and web_scraping would be? I'm also not entirely clear on what non-archiving scraping means in this context. Could you clarify those points? I think these resources should definitely be included, but perhaps just under the web_scraping section.

Perhaps for each category, there should be an issue for suggesting links to include, and those that we think are good resources should be added via PR?

from research.

ebenp commented on August 17, 2024

I don't think there should be a distinction between this and web_harvesting.
Sorry, I just looked and meant to mention this location to save.
https://github.com/datatogether/research/tree/master/web_harvesting

I was thinking this would be a readme or a google sheet link inside the web_harvesting folder as the location to save this if that makes sense.

The only scraping distinction I was thinking of is between links such as these and scraping that we do with datatogether archiving that has archivertools and morph.io usage. I think those examples should be kept out of this research repo.

from research.

mhucka commented on August 17, 2024

You're right, this needs clarification. I hope to get back to this this week.

from research.

mhucka commented on August 17, 2024

@ebenp @jeffreyliu Finally looking at this, I now remember the original idea behind the two directories. One is for cataloging software systems that do web archiving/scraping/etc., and the other is meant to be research on approaches to doing that (i.e., overall approach, algorithms, examples of software that does it, etc.). I struggled with how to name the directories, and clearly failed badly.

What if web_harvesting were renamed to harvesting_approaches or something similar?

Regarding the distinction between scraping and archiving, I might be wrong, but I think there is a difference, because a system to scrape web pages does not necessary have to archive or store the results. For exaple, I've written a system that scrapes pages to get info and store specific bits of info in a custom database, but it doesn't archive the whole page or harvest the page/site in the way that we talk about those things in Archivers & Data Together.

IMHO, the term "harvesting" could mean either scraping or archiving, although looking around, I now see that Wikipedia basically makes "web scraping" synonymous with "web harvesting" and "web data extraction", so I guess it's closer to the meaning of scraping.

from research.

ebenp commented on August 17, 2024

web harvesting makes sense to me and I also really like the detail given above about what harvesting and archiving is in terms of Data Together. Maybe those definitions could end up in the directory readme.

from research.

Adding in non Archiver scraping web links about research HOT 5 OPEN

Comments (5)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent