cnn-scraper's Introduction

CNN Website Scraper

This repository contains a Python script that uses the Scrapy framework to scrape news headlines, timestamps, and related images from the CNN website. A demonstration video has also been added to the repository to easily understand the working.

Installation
Usage
Output
Contributing
License

Installation

Clone this repository to your local machine using the following command:

git clone https://github.com/prathamgarg911/CNN-scraper.git

Install the required dependencies using pip:

pip install -r requirements.txt

Usage

Navigate to the project directory:

cd web_cnn

Run the Scrapy spider:

scrapy crawl web_scrape_cnn -O scraped_data.json

This will start the scraping process and save the data to an scraped_data.json file and the images to a folder named scraped-images.

Output

The scraper will generate a JSON file (scraped_data.json) with the following structure:

[
  {
    "headline": "Sample Headline",
    "timestamp": "Day Month Date, Year",
    "image_link": "https://example.com/image.jpg",
    "local_image": "web_cnn/scraped-images/image1.jpg"
  }
]

You can also generate a CSV file (scraped_data.csv) with a similar structure.

Contributing

If you'd like to contribute to this project, feel free to fork the repository and submit a pull request. Make sure to follow the Contributing Guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Recommend Projects

prathamgarg911 / cnn-scraper Goto Github PK

cnn-scraper's Introduction

CNN Website Scraper

Table of Contents

Installation

Usage

Output

Contributing

License

cnn-scraper's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent