This is a webscraping project using Python and Scrapy to scrape the website. The scrapy library is very useful to scrap the large amount of data. The scrapeops api is used for the different BrowserHeaders.
- Python
- Scrapy
- scrapeops api for use different BrowserHeaders.
- SQLite Database
- Clone the repository
git clone https://github.com/AdityaPatadiya/1000books_webscraping_python_scrapy.git
cd bookscraper
- Create a virtual environment using the following command:
python -m venv venv
- Activate the virtual environment:
# for windows
.\venv\Scripts\activate
# for mac
source venv/bin/activate
4.install the required packages using the following commands:
pip install -r requirements.txt
To run the project, use the following command:
cd bookscraper
scrapy crawl bookspider
I used to scrap 1000 of books from the website and store the data in the JSON file, and also I used SQLite database to store the scraped data. The scraped data is stored in SQLite database. And also it will create the JSON file.
The scrapeops api provides different BrowserHeaders to avoid the blocking of the website.
Warning
Before going to scrape any website, we should obey the robots.txt file of that website to avoid violating the website's terms of service and potentially engaging in illegal or unethical behavior.
The settings.py provides that feature to avoid violating the website's terms of service.
- bookspider.py: Main spider code for scraping data.
- items.py: Defines the fields to be scraped.
- pipelines.py: Operations on scraped data and storage in SQLite.
- middlewares.py: Applies ScrapeOps API Browser Headers.
- settings.py: Scrapy project settings including ScrapeOps API credentials.
Tip
We can also use the scrapy shell to test the code before running the spider. The scrapy shell is a command-line tool that allows you to test your XPath or CSS selectors and run Python code in the context of a Scrapy project.
First we have to install the ipython package using the following command:
pip install ipython
and also add the following line in the scrapy.cfg file:
shell = ipython
To run the scrapy shell, use the following command in terminal:
scrapy shell
Here you can test your code in the shell by providing the XPath or CSS selectors.
We welcome contributions from the community! Here are some guidelines to help you get started:
-
Fork the repository: Click the "Fork" button at the top of this repository to create a copy of the repository under your own GitHub account.
-
Clone your fork: Clone your forked repository to your local machine using the following command:
git clone https://github.com/AdityaPatadiya/1000books_webscraping_python_scrapy.git
-
Create a new branch: Create a new branch for your feature or bugfix with a descriptive name:
git checkout -b your-branch-name
-
Make your changes: Make your changes to the codebase. Ensure that your code follows the project's coding standards and passes all tests.
-
Commit your changes: Commit your changes with a clear and descriptive commit message:
git add . git commit -m "Description of your changes"
-
Push to your fork: Push your changes to your forked repository:
git push origin your-branch-name
-
Open a Pull Request: Go to the original repository on GitHub and open a pull request. Provide a clear and descriptive title and description for your pull request.
Getting Help If you need any help, feel free to ask questions in the Discussions section or contact the maintainers.
Thank you for your interest in contributing!