Giter VIP home page Giter VIP logo

prefect-webscraper-example's Introduction

prefect-webscraper-example

This repository is a complete tutorial of how to use Prefect to scrape a website, while also deploying it to Prefect Cloud for scheduled orchestration.

This follows mostly the tutorial documented here, but written to run on Prefect Cloud on a schedule:

Following that example, it's writing data to a local SQLite table, which doesn't make much sense when our images are ephemeral, but it should illustrate the pipeline execution. In practice, when orchestrating through Prefect Cloud, we'd likely want to preserve the data to some database or repository that resides on a different dedicated system.

Installation

You'll need a Python environment with the following packages installed.

It's best practice to setup a unique environment for each project. You can accomplish this through Anaconda or pure Python:

Python Virtual Environment

pip install virtualenv
python -m venv prefect-webscraper-example
source activate prefect-webscraper-example/bin/activate

Conda Virtual Environment

conda create -n prefect-webscraper-example python=3.7
source activate prefect-webscraper-example

Package installation

To install the packages, you'll need to use PIP as not all the packages are on the Conda Channels:

pip install -r requirements.txt

Visualization Note

If you want to visualize the DAG, you'll need graphviz installed. This can be done with one command if you're using conda:

conda install graphviz

If you want to use the pure Python approach, refer to the official documentation here:

Examples

BeautifulSoup

The example on Prefect's site leverages the requests library, along with beautifulsoup4. This pattern works for basic websites that don't involve a lot of JavaScript manipulation of the DOM.

A working example of using BeautifulSoup to parse a website on a schedule in Prefect Cloud is found in:

Selenium

For more modern websites that use a lot of AJAX with JavaScript DOM manipulation, you'll need to simulate execution of the JavaScript, and parse the page as it would load in a traditional browser. For this, there are headless versions of popular web browsers, that allow you to parse it with similar CSS or XPATH syntax.

A working example of using Selenium to parse a website on a schedule in Prefect Cloud is found in:

Selenium Drivers

To leverage Selenium on your local machine, you'll need to download the appropriate driver from their website:

In this example, we're using the chromedriver located in the same directory as this code.

When deploying to Prefect Cloud, the reference code will take hints from the official selenium chrome image as a base, then add the Prefect Flow code for the final image that's orchestrated.

This can be viewed in the Dockerfile file.

Project Layout

TYPE OBJECT DESCRIPTION
๐Ÿ“ docker Non-source code related files used by the Dockerfile during the build process
๐Ÿ“„ build_docker_base_image.sh Dockerfiles to build a base image for the selenium chrome driver
๐Ÿ“„ Dockerfile Dockerfiles to build a base image for the selenium chrome driver
๐Ÿ“„ example-bs4.py Example website scraper Prefect Flow ready for Prefect Cloud using BeautifulSoup
๐Ÿ“„ example-selenium.py Example website scraper Prefect Flow ready for Prefect Cloud using Selenium
๐Ÿ“„ README.md This file you're reading now
๐Ÿ“„ requirements.txt Python packages required for local development of Prefect Flows in this repository

prefect-webscraper-example's People

Contributors

szelenka-cisco avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.