This repository contains a Python script that uses the Scrapy framework to scrape news headlines, timestamps, and related images from the CNN website. A demonstration video has also been added to the repository to easily understand the working.
- Clone this repository to your local machine using the following command:
git clone https://github.com/prathamgarg911/CNN-scraper.git
- Install the required dependencies using pip:
pip install -r requirements.txt
- Navigate to the project directory:
cd web_cnn
- Run the Scrapy spider:
scrapy crawl web_scrape_cnn -O scraped_data.json
This will start the scraping process and save the data to an scraped_data.json
file and the images to a folder named scraped-images.
The scraper will generate a JSON file (scraped_data.json
) with the following structure:
[
{
"headline": "Sample Headline",
"timestamp": "Day Month Date, Year",
"image_link": "https://example.com/image.jpg",
"local_image": "web_cnn/scraped-images/image1.jpg"
}
]
You can also generate a CSV file (scraped_data.csv
) with a similar structure.
If you'd like to contribute to this project, feel free to fork the repository and submit a pull request. Make sure to follow the Contributing Guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.