TorScraperPro: Comprehensive Technical Overview

Legal Disclaimer

Educational Use and Compliance with Local Laws

This script is provided for educational purposes only. Users are responsible for ensuring that their use of the script complies with local legal laws and regulations. The originator of this code disclaims any responsibility for unethical or illegal use of the script. Users should exercise due diligence and respect the terms of service and data usage policies of the websites they interact with using this script.

Introduction

TorScraperPro is an advanced Python-based web scraping and content extraction tool. Designed for both privacy and efficiency, it navigates the web, including the Tor network, to extract content from specified websites based on user-defined keywords. Utilizing headless Chrome integrated with Selenium WebDriver, TorScraperPro effectively captures dynamic content. A Flask application dynamically serves the scraped HTML files, providing live updates and easy access to the data collected.

Features

TOR Network Support: Routes requests through the TOR network to ensure anonymity, enabling the scraping of .onion websites while maintaining user privacy.
Keyword-Based Scraping: Focuses on extracting content that contains specified keywords, facilitating targeted data collection.
Dynamic Content Rendering: With Selenium WebDriver, the tool can interact with JavaScript-heavy pages, ensuring comprehensive content capture.
Live Data Presentation: A local Flask web server serves the scraped HTML files, offering a dashboard for live monitoring and access to scraped data.
Automatic ChromeDriver Management: The script employs webdriver_manager to automatically select and use the correct ChromeDriver version, ensuring compatibility with the installed Google Chrome browser.
Flask Dashboard: Provides a user-friendly interface for reviewing scraping outcomes, enhancing usability and accessibility.

Dependencies

Python 3.6 or later
Flask
Requests
BeautifulSoup4
Selenium WebDriver
webdriver_manager
Tor (Optional for .onion site scraping)

Installation and Setup

Install Python and Dependencies: Ensure Python 3.6+ is installed on your system. Install the required Python packages using pip:
```
pip install flask requests beautifulsoup4 selenium webdriver_manager
```
Google Chrome: Install the latest version of Google Chrome to ensure compatibility with ChromeDriver.
TOR (Optional): For scraping .onion sites, install and configure TOR on your system.
Clone the Repository: Obtain the TorScraperPro script by cloning its repository to your local machine.

Running Instructions

Execute TorScraperPro from the command line with the desired parameters:

python TorScraperPro.py -v -d <Depth> -p <Pause>

Options include:

-v: Enables verbose logging for detailed operational insights.
-d <Depth>: Defines the scraping depth for website traversal.
-p <Pause>: Sets a pause duration between requests to mitigate server load and mimic human interaction.

Key Functions

setup_chrome(): Initializes a headless Chrome browser session for web interaction.
check_tor_connection(): Verifies connectivity to the TOR network and logs the TOR-assigned IP address.
get_keywords(): Loads keywords from a specified file, guiding the scraping focus.
scrape_and_extract(): Core function that navigates to URLs, renders JavaScript, and extracts content based on keyword matches.
Flask Web Server: Runs concurrently in a separate thread, serving scraped content through a user-friendly dashboard.

Flask Web Server

Initiated at script startup, the Flask app presents a simple yet effective interface for real-time access to the scraped content. It facilitates navigation between current and archived runs, enhancing the review process of collected data. Current Run page

Previous runs page

Conclusion

TorScraperPro version 1.0 emerges as a powerful solution for sophisticated web scraping needs, emphasizing privacy through TOR integration and flexibility with dynamic content handling. Its user-friendly dashboard and targeted scraping capabilities make it an invaluable tool for data analysts, researchers, and cybersecurity professionals.

0999ad / darkwebscraper Goto Github PK

darkwebscraper's Introduction

TorScraperPro: Comprehensive Technical Overview

Legal Disclaimer

Educational Use and Compliance with Local Laws

Introduction

Features

Dependencies

Installation and Setup

Running Instructions

Key Functions

Flask Web Server

Conclusion

darkwebscraper's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent