1000books_webscraping_python_scrapy

This is a webscraping project using Python and Scrapy to scrape the website. The scrapy library is very useful to scrap the large amount of data. The scrapeops api is used for the different BrowserHeaders.

Requirements
Installation
How to Run the Project
What I've Done
Key Files
Contributing

Requirements: -

Python
Scrapy
scrapeops api for use different BrowserHeaders.
SQLite Database

Installation: -

Clone the repository

    git clone https://github.com/AdityaPatadiya/1000books_webscraping_python_scrapy.git
    
    cd bookscraper

Create a virtual environment using the following command:

python -m venv venv

Activate the virtual environment:

# for windows
.\venv\Scripts\activate

# for mac
source venv/bin/activate

4.install the required packages using the following commands:

pip install -r requirements.txt

How to run the project: -

To run the project, use the following command:

cd bookscraper

scrapy crawl bookspider

What I've done: -

I used to scrap 1000 of books from the website and store the data in the JSON file, and also I used SQLite database to store the scraped data. The scraped data is stored in SQLite database. And also it will create the JSON file.

The scrapeops api provides different BrowserHeaders to avoid the blocking of the website.

Warning

Before going to scrape any website, we should obey the robots.txt file of that website to avoid violating the website's terms of service and potentially engaging in illegal or unethical behavior.

The settings.py provides that feature to avoid violating the website's terms of service.

Key Files

bookspider.py: Main spider code for scraping data.
items.py: Defines the fields to be scraped.
pipelines.py: Operations on scraped data and storage in SQLite.
middlewares.py: Applies ScrapeOps API Browser Headers.
settings.py: Scrapy project settings including ScrapeOps API credentials.

Tip

We can also use the scrapy shell to test the code before running the spider. The scrapy shell is a command-line tool that allows you to test your XPath or CSS selectors and run Python code in the context of a Scrapy project.

First we have to install the ipython package using the following command:

pip install ipython

and also add the following line in the scrapy.cfg file:

shell = ipython

To run the scrapy shell, use the following command in terminal:

scrapy shell

Here you can test your code in the shell by providing the XPath or CSS selectors.

Contributing: -

We welcome contributions from the community! Here are some guidelines to help you get started:

How to Contribute

Fork the repository: Click the "Fork" button at the top of this repository to create a copy of the repository under your own GitHub account.
Clone your fork: Clone your forked repository to your local machine using the following command:
```
git clone https://github.com/AdityaPatadiya/1000books_webscraping_python_scrapy.git
```
Create a new branch: Create a new branch for your feature or bugfix with a descriptive name:
```
git checkout -b your-branch-name
```
Make your changes: Make your changes to the codebase. Ensure that your code follows the project's coding standards and passes all tests.
Commit your changes: Commit your changes with a clear and descriptive commit message:
```
git add .
git commit -m "Description of your changes"
```
Push to your fork: Push your changes to your forked repository:
```
git push origin your-branch-name
```
Open a Pull Request: Go to the original repository on GitHub and open a pull request. Provide a clear and descriptive title and description for your pull request.

Getting Help If you need any help, feel free to ask questions in the Discussions section or contact the maintainers.

Thank you for your interest in contributing!

adityapatadiya / 1000books_webscraping_python_scrapy Goto Github PK