Giter VIP home page Giter VIP logo

address-scraping's Introduction

Address Data Scraper

Overview

This script is designed to automate the process of scraping address data from a specified website. It utilizes the Selenium WebDriver for navigating the web pages and extracting the required information. The scraped data includes details such as district, neighborhood, street, building number, independent section, and geographical coordinates (longitude and latitude).

Features

  • Resumes scraping from the last point by reading the last row of the existing CSV file.
  • Handles dynamic dropdowns for city (il), district (ilce), neighborhood (mahalle), street (CSBM), and building number (BINA NO).
  • Handles reCAPTCHA if it appears during the scraping session.
  • Stores the scraped data into a CSV file, appending new data to it.

Requirements

  • Python 3.x
  • Selenium
  • Pandas
  • Chrome WebDriver (Make sure it's compatible with the installed Chrome version)

Installation

  1. Clone this repository or download the script.
  2. Install the required Python packages:
    pip install selenium pandas
    
  3. Download and set up the appropriate version of Chrome WebDriver.

Usage

  1. Ensure the Chrome WebDriver path is correctly set in the script or the system PATH.
  2. Run the script:
    python main.py
    
  3. The script will start scraping data and will automatically handle pagination and dropdowns.
  4. The scraped data will be stored in data/scraped_data.csv.
  5. To stop the script, press 'q' when prompted.

Function Descriptions

  • read_last_row_from_csv(file_path): Reads the last row from the CSV file to determine where to resume scraping.
  • get_current_map_center(): Retrieves the current center coordinates of the map view from the webpage.
  • append_to_csv(data, file_path): Appends the scraped data to the specified CSV file.
  • init_driver(): Initializes the Selenium WebDriver with necessary options.
  • has_options(driver, select_id): Checks if a select element has options available.
  • main(): Main function to control the flow of the script, including handling of reCAPTCHA, navigating dropdowns, and storing data.

Note

  • The script is designed to be resilient against website structure changes, specifically for dropdown handling and waiting for elements to be clickable or present.
  • The script may require modifications if the website's structure or the CAPTCHA handling changes significantly.

Contribution

Feel free to fork the project, make improvements, or adapt the script for other websites or purposes. Any suggestions are welcome!

address-scraping's People

Contributors

cappittall avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.