Giter VIP home page Giter VIP logo

imdb-etl's Introduction

README

Introduction

This is a self-learning project on ETL using Spark, Airflow, MinIO and Redash. The IMDb datasets can be found here and the corresponding documentation can be found here.

After doing some casual browsing as well as data wrangling on the tsv files downloaded from the datasets page, I decided to design the schema as shown below:

The datasets are downloaded from the datasets page and uploaded to MinIO. After that, I read the tsv files into dataframes and did some processing to fit the schema. The processed data is loaded into a Postgres database. A BI tool like Redash then pulls data out from the database to populate the dashboard. I used Airflow to automate the ETL process. Following is the high-level overview of the pipeline:

Dashboard

A dashboard created using Redash

Steps

  1. Change the values in airflow/config.json and airflow/create_cnx.py
  2. Initialize Redash database

docker-compose run --rm redash create_db

  1. Start all services using docker compose

docker-compose up -d

  1. Visit Airflow Web UI at localhost:8082 and login using

username: airflow

password: airflow

  1. Go to Admin > Variables > Choose File > choose the config.json in airflow directory > Import Variables
  2. Trigger the DAGs that only need to run ONCE (create_connections and create_tables)
  3. Trigger the upload_imdb_datasets_minio DAG and the imdb_etl DAG will be triggered after the first one has finished.
  4. Visit Redash at localhost and set up password
  5. Add data source, create queries using the sql statements in dashboard-queries folder and create a dashboard from those queries

Spark standalone on local machine when developing using notebooks

  1. Installation
sudo apt update  
sudo apt install default-jdk scala git -y
java -version; javac -version; scala -version; git --version
wget https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar -xvzf spark-*
sudo mv spark-3.1.2-bin-hadoop3.2 /opt/spark
echo "export SPARK_HOME=/opt/spark" >> ~/.profile
echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> ~/.profile
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile
echo "export SPARK_MASTER_WEBUI_PORT=8080" >> ~/.profile
source ~/.profile
  1. Jar Dependencies
  • Go to Maven Repository and download all these jar files. Then move all of them to $SPARK_HOME/jars. You need to delete the original guava jar file in $SPARK_HOME/jars
No. Jar File Version
1. hadoop-aws 3.2.0
2. aws-java-sdk-bundle 1.11.375
3. guava 30.1.1-jre
4. jets3t 0.9.4
5. postgresql 42.2.0
  1. Start master and worker
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh spark://<machine_hostname>:7077 
  1. In the jupyter notebooks, run the following lines to set number of cores and amount of memory to use
import os

master = "spark://<machine_hostname>:7077"  
os.environ['PYSPARK_SUBMIT_ARGS'] = f'--master {master} --driver-memory 4g --total-executor-cores 6 --executor-memory 8g --packages org.postgresql:postgresql:42.1.1 pyspark-shell'

NOTE: Visit localhost:8080 and you can see the value of your machine hostname

  1. Order of execution for notebooks
1. download_date.ipynb
2. titles.ipynb
3. episodes.ipynb
4. casts.ipynb
5. crew.ipynb

Great Expectations

  1. Install great_expectations
. venv/bin/activate
pip install great_expectations
  1. Initialize Great Expectations deployment
great_expectations init
  1. Connect to local filesystem datasource (Pandas)
$ great_expectations datasource new

the datasources section will look something like this:
datasources:
  imdb_pandas:
    data_asset_type:
      module_name: great_expectations.dataset
      class_name: PandasDataset
    batch_kwargs_generators:
      subdir_reader:
        class_name: SubdirReaderBatchKwargsGenerator
        base_directory: ./data
    module_name: great_expectations.datasource
    class_name: PandasDatasource

  1. Place all the tsv files in the great_expectations/data directory (because I don't know how to use files from S3 yet...)

  2. Create expectation suite for each tsv file

$ great_expectations suite new

Add the following in the `batch_kwargs` dictionary in order to read tsv file
"reader_options": {"sep": "\t"}
  1. Write table and column expectations in the Jupyter Notebook - expectations glossary

Use S3 store for expectations, validations and data_docs site

sudo apt install awscli

activate venv then
$ pip install boto3==1.17.0 fsspec 's3fs<=0.4'

set aws credentials
$ aws configure

Expectations json files

  1. Create bucket
  2. Copy existing expectation json files to s3 bucket
$ aws s3 --endpoint-url http://localhost:9000 sync expectations s3://great-expectations/expectations
  1. Verify ge is using s3 to store expectations
$ great_expectations suite list

Validations files

  1. Create bucket
  2. Copy validations files to s3 bucket
$ aws s3 --endpoint-url http://localhost:9000 sync uncommitted/validations s3://great-expectations/validations
  1. Verify ge is using s3 to store validations
$ great_expectations store list

Minio data docs site

  1. Create a bucket called data-docs
  2. Use minio client to change bucket policy to download
$ docker run -it --entrypoint=/bin/sh minio/mc

# change it to your IP address 
$ mc alias set minio http://192.168.0.191:9000 admin password --api S3v4
$ mc policy set download minio/data-docs
  1. Build data_docs in the data-docs bucket
great_expectations docs build --site-name s3_site
  1. Visit localhost:82 to view the data docs

The minio_site Nginx service defined in the docker-compose.yaml will serve the data_docs site

data-docs

Minio datasource

  1. Set bucket policy to download
mc policy set download minio/imdb
  1. Edit great_expectations.yml
datasources:
  s3_pandas:
    class_name: PandasDatasource
    boto3_options:
      endpoint_url: 'http://localhost:9000'
    batch_kwargs_generators:
      pandas_s3_generator:
        class_name: S3GlobReaderBatchKwargsGenerator
        boto3_options:
          endpoint_url: 'http://localhost:9000'
        reader_options:
          sep: "\t"
        bucket: imdb # Only the bucket name here (i.e., no prefix)
        assets:
          imdb_data:
            prefix: 2021-06/26/ # trailing slash is important
            regex_filter: .*.tsv  # The regex filter will filter the results returned by S3 for the key and prefix to only those matching the regex
    module_name: great_expectations.datasource
    data_asset_type:
      class_name: PandasDataset
      module_name: great_expectations.dataset
  1. Open Jupyter Notebook and write expectations
great_expectations suite scaffold s3_pandas

Todo

  • Refactor code to reduce hardcoded stuffs
  • metrics monitoring

imdb-etl's People

Contributors

gamberooni avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

edwinguo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.