Giter VIP home page Giter VIP logo

0999ad / darkwebscraper Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 49 KB

TorScraperPro is an advanced Python-based web scraping and content extraction tool. Designed for both privacy and efficiency, it navigates the web, including the Tor network, to extract content from specified websites based on user-defined keywords. Utilizing headless Chrome integrated with Selenium WebDriver

License: GNU General Public License v3.0

Python 68.93% HTML 31.07%

darkwebscraper's Introduction

TorScraperPro: Comprehensive Technical Overview

Legal Disclaimer

Educational Use and Compliance with Local Laws

This script is provided for educational purposes only. Users are responsible for ensuring that their use of the script complies with local legal laws and regulations. The originator of this code disclaims any responsibility for unethical or illegal use of the script. Users should exercise due diligence and respect the terms of service and data usage policies of the websites they interact with using this script.

Introduction

TorScraperPro is an advanced Python-based web scraping and content extraction tool. Designed for both privacy and efficiency, it navigates the web, including the Tor network, to extract content from specified websites based on user-defined keywords. Utilizing headless Chrome integrated with Selenium WebDriver, TorScraperPro effectively captures dynamic content. A Flask application dynamically serves the scraped HTML files, providing live updates and easy access to the data collected.

Features

  • TOR Network Support: Routes requests through the TOR network to ensure anonymity, enabling the scraping of .onion websites while maintaining user privacy.
  • Keyword-Based Scraping: Focuses on extracting content that contains specified keywords, facilitating targeted data collection.
  • Dynamic Content Rendering: With Selenium WebDriver, the tool can interact with JavaScript-heavy pages, ensuring comprehensive content capture.
  • Live Data Presentation: A local Flask web server serves the scraped HTML files, offering a dashboard for live monitoring and access to scraped data.
  • Automatic ChromeDriver Management: The script employs webdriver_manager to automatically select and use the correct ChromeDriver version, ensuring compatibility with the installed Google Chrome browser.
  • Flask Dashboard: Provides a user-friendly interface for reviewing scraping outcomes, enhancing usability and accessibility.

Dependencies

  • Python 3.6 or later
  • Flask
  • Requests
  • BeautifulSoup4
  • Selenium WebDriver
  • webdriver_manager
  • Tor (Optional for .onion site scraping)

Installation and Setup

  1. Install Python and Dependencies: Ensure Python 3.6+ is installed on your system. Install the required Python packages using pip:

    pip install flask requests beautifulsoup4 selenium webdriver_manager
  2. Google Chrome: Install the latest version of Google Chrome to ensure compatibility with ChromeDriver.

  3. TOR (Optional): For scraping .onion sites, install and configure TOR on your system.

  4. Clone the Repository: Obtain the TorScraperPro script by cloning its repository to your local machine.

Running Instructions

Execute TorScraperPro from the command line with the desired parameters:

python TorScraperPro.py -v -d <Depth> -p <Pause>

Options include:

  • -v: Enables verbose logging for detailed operational insights.
  • -d <Depth>: Defines the scraping depth for website traversal.
  • -p <Pause>: Sets a pause duration between requests to mitigate server load and mimic human interaction.

Key Functions

  • setup_chrome(): Initializes a headless Chrome browser session for web interaction.
  • check_tor_connection(): Verifies connectivity to the TOR network and logs the TOR-assigned IP address.
  • get_keywords(): Loads keywords from a specified file, guiding the scraping focus.
  • scrape_and_extract(): Core function that navigates to URLs, renders JavaScript, and extracts content based on keyword matches.
  • Flask Web Server: Runs concurrently in a separate thread, serving scraped content through a user-friendly dashboard.

Flask Web Server

Initiated at script startup, the Flask app presents a simple yet effective interface for real-time access to the scraped content. It facilitates navigation between current and archived runs, enhancing the review process of collected data. Current Run page Screenshot 2024-03-21 at 16 40 27

Previous runs page

Screenshot 2024-03-19 at 16 45 23

Conclusion

TorScraperPro version 1.0 emerges as a powerful solution for sophisticated web scraping needs, emphasizing privacy through TOR integration and flexibility with dynamic content handling. Its user-friendly dashboard and targeted scraping capabilities make it an invaluable tool for data analysts, researchers, and cybersecurity professionals.

darkwebscraper's People

Contributors

0999ad avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.