Giter VIP home page Giter VIP logo

pa-0 / marketplace-scraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from chenasraf64/marketplace-scraper

1.0 0.0 0.0 25 KB

This is a web scraper developed to fetch product listings from different marketplaces given a specific search word. When operating on a product, it extracts its title, description, price, and image path. The scraper has been designed to be easily extensible to different marketplaces, starting with eBay.

Python 100.00%

marketplace-scraper's Introduction

marketplace-scraper

This is a web scraper developed to fetch product listings from different marketplaces given a specific search word. When operating on a product, it extracts its title, description, price, and image path. The scraper has been designed to be easily extensible to different marketplaces, starting with eBay.

Dependencies

scrapy , requests , os , json

Configuration and Extensibility

Marketplace Configuration

This scraper is powered by a configuration-driven approach. The configuration file, named marketplace_configurations.py, contains a dictionary (search_url_dictionary) structured as follows:

  • Key: Name of the marketplace.
  • Value: An array where each index has specific details:
    • [0]: URL template for search results. e.g., 'https://www.ebay.com/sch/i.html?...
    • [1]: CSS selector for extracting individual item URLs.
    • [2]: CSS selector for extracting the next page URL.
    • [3]: URL template to fetch a specific item's description.
    • [4]: CSS selector for the item title.
    • [5]: CSS selector for the item price.
    • [6]: CSS selector for the primary image of the item.
    • [7]: CSS selector for the specific item number.

To illustrate, here's a sample configuration for eBay:

search_url_dictionary = {
    'ebay' : [
        'https://www.ebay.com/sch/i.html?...', # Search result URL
        '.s-item__link::attr(href)', # Item URL
        '.pagination__next::attr(href)', # Next page URL
        'https://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?item={item_id}', # Item description
        'h1 span::text', # Item title
        '.x-price-primary span::text', # Item price
        '.ux-image-carousel-item img::attr(src)', # Primary image
        '.ux-layout-section__textual-display--itemId .ux-textspans--BOLD::text' # Item number
    ]
}

For integrating additional marketplaces, simply add the relevant configurations by adhering to the aforementioned structure. This approach ensures that the scraper is both extensible and maintainable.

Storage

All scraped data is saved as JSON files in a dynamically created directory. Each product's properties are saved under the name: [MARKETPLACE_NAME]_[PRODUCT_ID].json (For example, Ebay_123456.json). The folder's location corresponds to the running context of the scraper and is named according to the search word and the marketplace name.

Note on User Agent

The project's settings.py has been modified to utilize a specific USER_AGENT string tailored for eBay. Ensure to adjust this User-Agent or employ appropriate middlewares if targeting other websites or to evade potential scraping blocks.

Running Scraper

  1. cd to marketplace_scraper folder using shell.
  2. Run the following command: scrapy crawl marketplacespider -a marketplace_name={marketplace_name} -a search_word={search_word}

Make sure to provide the appropriate values for {marketplace_name} and {search_word} when executing the command. If the search word contains more than one word, replace spaces with +. For example, "apple watch" should be "apple+watch".

or

  1. Execute the main() function and provide the necessary inputs when prompted.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.