Giter VIP home page Giter VIP logo

1000books_webscraping_python_scrapy's Introduction

1000books_webscraping_python_scrapy

This is a webscraping project using Python and Scrapy to scrape the website. The scrapy library is very useful to scrap the large amount of data. The scrapeops api is used for the different BrowserHeaders.

Table of Contents

Requirements: -

  1. Python
  2. Scrapy
  3. scrapeops api for use different BrowserHeaders.
  4. SQLite Database

Installation: -

  1. Clone the repository
    git clone https://github.com/AdityaPatadiya/1000books_webscraping_python_scrapy.git
    
    cd bookscraper
  1. Create a virtual environment using the following command:
python -m venv venv
  1. Activate the virtual environment:
# for windows
.\venv\Scripts\activate

# for mac
source venv/bin/activate

4.install the required packages using the following commands:

pip install -r requirements.txt

How to run the project: -

To run the project, use the following command:

cd bookscraper

scrapy crawl bookspider

What I've done: -

I used to scrap 1000 of books from the website and store the data in the JSON file, and also I used SQLite database to store the scraped data. The scraped data is stored in SQLite database. And also it will create the JSON file.

The scrapeops api provides different BrowserHeaders to avoid the blocking of the website.

Warning

Before going to scrape any website, we should obey the robots.txt file of that website to avoid violating the website's terms of service and potentially engaging in illegal or unethical behavior.

The settings.py provides that feature to avoid violating the website's terms of service.

Key Files

Tip

We can also use the scrapy shell to test the code before running the spider. The scrapy shell is a command-line tool that allows you to test your XPath or CSS selectors and run Python code in the context of a Scrapy project.

First we have to install the ipython package using the following command:

pip install ipython

and also add the following line in the scrapy.cfg file:

shell = ipython

To run the scrapy shell, use the following command in terminal:

scrapy shell

Here you can test your code in the shell by providing the XPath or CSS selectors.

Contributing: -

We welcome contributions from the community! Here are some guidelines to help you get started:

How to Contribute

  1. Fork the repository: Click the "Fork" button at the top of this repository to create a copy of the repository under your own GitHub account.

  2. Clone your fork: Clone your forked repository to your local machine using the following command:

    git clone https://github.com/AdityaPatadiya/1000books_webscraping_python_scrapy.git
  3. Create a new branch: Create a new branch for your feature or bugfix with a descriptive name:

    git checkout -b your-branch-name
  4. Make your changes: Make your changes to the codebase. Ensure that your code follows the project's coding standards and passes all tests.

  5. Commit your changes: Commit your changes with a clear and descriptive commit message:

    git add .
    git commit -m "Description of your changes"
  6. Push to your fork: Push your changes to your forked repository:

    git push origin your-branch-name
  7. Open a Pull Request: Go to the original repository on GitHub and open a pull request. Provide a clear and descriptive title and description for your pull request.

Getting Help If you need any help, feel free to ask questions in the Discussions section or contact the maintainers.

Thank you for your interest in contributing!

1000books_webscraping_python_scrapy's People

Contributors

adityapatadiya avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.