Giter VIP home page Giter VIP logo

smarter-web-scraping-with-python's Introduction

Smater Web Scraping with Python Selenium and Llama2

Generate podcast clips related to daily top submissions on Hacker News via web scraping with Python & Selenium, generative ai with Ollama and LLama2, Transcript generation OpenAI Whisper, iTunes Podcast Search, and more.

Coming soon

Requirements

  • Python 3.10 and up
  • A Bright Data Account (includes $25 credit)
  • ffmpeg (required for transcribing audio with OpenAI Whisper)

A Proxy-based Web Scraping approach

In this repo, we use a web scraping proxy service from Bright Data. Using a proxy service makes our requests more reliable. You can see the actual code for the Selenium-based remote connection here src/helpers/brightdata.py.

With Remote Proxy

our computer -> request -> proxy -> web server -> proxy -> response -> our computer

Without Remote Proxy

our computer -> request -> web server -> response -> our computer

Usage

# from 'src/2 - Connection Sample.ipynb'
from selenium.webdriver import Remote, ChromeOptions

# import this function
from helpers.brightdata import get_sbr_connection

options = ChromeOptions()

# options.headless = True # old method
options.add_argument("--headless=new") # new method

url = 'https://news.ycombinator.com'

with Remote(sbr_connection, options=options) as driver:
    driver.get(url)
    print(driver.page_source)

Getting Started

Clone project

mkdir -p ~/dev/smarter-scraping
cd ~/dev/smarter-scraping
git clone https://github.com/codingforentrepreneurs/Smarter-Web-Scraping-with-Python .

(Optional) Working through the course?

Use the course_start branch with:

mac/linux

git checkout course_start
rm -rf .git 
git init

windows

git checkout course_start
Remove-Item .git -Recurse -Force
git init

Create a Python Virtual Environment

cd ~/dev/smarter-scraping # or where you cloned the repo

mac/linux

python3 -m venv venv

windows

c:\Python311\python.exe -m venv venv

Activate the virtual enviornment

Always activate your environment!

cd ~/dev/smarter-scraping # or where you cloned the repo

mac/linux

source venv/bin/activate

windows

.\venv\Scripts\activate

If done correctly, your command line should start with (venv)

Install requirements

With virtual envionoment activated (e.g. (venv)), run:

(venv) python -m pip install pip --upgrade
(venv) python -m pip install -r requirements.txt

Implement Environment Variables with dotenv

mac/linux

cp sample-env-file .env

windows

Copy-Item .env.sample -Destination .env

Be sure to add your Bright Data proxy information:

  • BRIGHT_DATA_USERNAME
  • BRIGHT_DATA_PASSWORD
  • BRIGHT_DATA_HOST

Add Ollama data too (for Running the OpenAI drop-in replacement Llama2)

  • OPENAI_BASE_URL=http://localhost:11434/v1
  • OPENAI_API_KEY=ollama
  • OPENAI_COMPLETION_MODEL=llama2

Loading Environment Variables

With code that lives inside the src/ directory, you can import the helpers module to load your environment variables.

We created a simple function to extend the incredible python-decouple package (it's in src/helpers/env.py):

import helpers

MY_VAR = helpers.config('MY_VAR', default="Not set", cast=str)

Run Jupyter

Explore the notebooks!

jupyter notebook

smarter-web-scraping-with-python's People

Contributors

codingforentrepreneurs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

stophobia

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.