Giter VIP home page Giter VIP logo

econlog_scraper's Introduction

Econlog Article Scaper

This project provides a toolset for scraping articles by author from the EconLog website. This scraper is tailored to extract detailed article information, including HTML content, word counts, textual data, and embedded links. While the primary focus of this repository is to enable users to collect and save data for further analysis, some examples are in the notebooks folder.

Core Objectives

  1. Efficient Data Collection: Faciliatate the automated collection of articles from EconLog, categorized by authors, to streamline text analysis.
  2. Rich Data Extraction: Retrieve a set fo data points for each article, encompassing the full HTML content, word counts, pure text, and embedded links within articles.
  3. Versatility: While the primary aim is data acquisition, the toolset supports a wide range of secondary analyses, including topic analysis, text mining, and trend identification.

Ideal for:

  1. Sentiment analysis or topic modeling.
  2. Exploring trends, themes, and evolution of discourse within articles.

Disclaimer: No permissions were granted by the organization to scrape or otherwise use their data. Good luck.

Table of Contents

Installation

Clone repository: git clone https://github.com/gfbarbieri/econlog_article_scraper.git

Check requirements: put requirements.txt here

How to Use

Example: Obtain a list of published authors.

from scraper import EconlogScraper

# Intantiate EconLogScraper with default author.
els = EconlogScraper()

# Request all authors publised on EconLog.
print(els.request_authors())

Example: Extract an article's text.

from scraper import EconlogScraper

# Define EconLog article.
article_url = 'https://econlib.org/econlog/article-name-here'

# Intantiate EconLogScraper with defaults.
els = EconlogScraper()

# Request article's contents and extract text.
p_tags, _ = els.request_article_content(url=article_url)
full_text = els.extract_article_text(article_content=p_tags)

# Print first 100 characters in the article.
print(full_text[:100])

Example: Extract text from all articles.

from scraper import EconlogScraper

# Intantiate EconLogScraper with author.
els = EconlogScraper(author='author')

# Obtain the HTML container for every article published by the author.
containers = els.request_article_containers()

# Extract the URL from each container, request the content at the article's
# URL, extract text.
article_text = []

for container in containers:
    metadata = els.extract_article_metadata(article_container=container)
    p_tags, topics = els.request_article_content(url=metadata['url'])
    article_text.append(els.extract_article_text(article_content=p_tags))

# Print first article.
print(article_text[0])

Example: Extract all features from all articles and add to metadata.

from scraper import EconlogScraper
from utils import text_utils

# Intantiate EconLogScraper with author.
els = EconlogScraper()

# Obtain the HTML container for every article published by the author.
containers = els.request_article_containers()

# Extract article metadata from each container.
metadata = [els.extract_article_metadata(article_container=container) for container in containers]

# Extract all features from each article.
for indx, article in enumerate(metadata):
    p_tags, topics = els.request_article_content(url=article['url'])
    full_text = els.extract_article_text(article_content=p_tags)
    embedded_urls = els.extract_embedded_urls(article_content=p_tags)
    word_count, word_freq = text_utils.word_counter(document=full_text)

    article['text'] = full_text
    article['topics'] = topics
    article['embedded_urls'] = embedded_urls
    article['word_count'] = word_count
    article['word_freq'] = word_freq

# Show example.
print(metadata[0])

econlog_scraper's People

Contributors

gfbarbieri avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.