Giter VIP home page Giter VIP logo

de-zoomcamp-project's Introduction

DOCUMENTATION for the project

Problem description

The aim of the project is to deliver metrics calculations based on a finanncial trading market data. Deutsche Börse Public Dataset used.

Metrics are:

  • implied volatility of the traded securities/instruments
  • most traded securities/instruments by traded volume

Source data links: https://registry.opendata.aws/deutsche-boerse-pds/

Overall dataset description: https://github.com/Deutsche-Boerse/dbg-pds

Data dictionary: https://github.com/Deutsche-Boerse/dbg-pds/blob/master/docs/data_dictionary.md.

Cloud technologies used

project-architecture

The source data is AWS S3.

But for the pipeline mainly will be used:

  • GCP (VM with Apache Airflow, data lake in Google Cloud Storage, BigQuery for DWH, Google Data Studio for visualisations)
  • dbt cloud for building reporting tables.

All the infrastructure deployed using Terraform. VM with Apache Airflow deployed using a Machine Image prepared in advance (with docker, docker-compose, and repo directory). The source code for deployment is here.

Data ingestion (batch)

Data ingestion pipeline uses Airflow DAG load_xetra_data.

DAG has 3 tasks:

image

  • s3_download downloads data from S3 to the local worker storage (it uses S3Tools class for reusable methods)

  • upload_to_raw_gcs loads from local airflow worker storage to GCS data lake in deutsche_boerse bucket. GCPTools class is used for that. Inside the bucket objects placed according to daily partitioning strategy for external tables:

image

  • remove_files task cleans the airflow worker local storage using FilesTools class.

Batching strategy

DAG runs daily and takes data for the ended day (yesterday using daily schedule and {{ ds }} macro in airflow). So, for example DAG sarted 2022-03-27 will load data for 2022-03-26 to the data lake. Initially data loaded for 2022 year using catchup parameter. Code for the DAG is here, variables are here

Data warehouse

BigQuery used for DWH. Initial external table created using separate script in repo.

Other objects created in BigQuery using dbt cloud.

Transformations

Data transformations are made with dbt cloud. Source code for models is here.

The data is rather clean, so, increment for last 40 days is put to raw layer from external table. And mart layer is built on top of it with metrics calculations. Google Data Studio will be connected to mart layer tables.

dbt job runs daily and refreshes data for last 7 and 30 days for specified metrics tables.

Job runs succesfully

image

image

All objects created in the target production dataset

image

Dashboard

Dashboard prepared in Google Data Studio

There is a dashboard with 2 tiles (https://datastudio.google.com/s/igJNnmGAikI):

  • Most traded instruments by traded volume, in millions of EUR (for the last 30 days)
  • Instruments' implied volatility (for the last 7 days)

image

Reproducibility

How to reproduce using existing repo:

  1. Prepare Google cloud service accounts and grant permissions
  2. Prepare virtual machine in Google Cloud. Install docker, docker-compose, git there. Clone repo. Provision credentials file for a service account. Build images using (docker-compose build). Make a machine image of this VM. Terrafrom commands were run from local machine, but could be installed on VM or separate VM.
  3. Change user specific attributes in the code:
  • In Terraform variables.tf change project id, region, name of the bucket. In main.tf change the name of the machine image and zone
  • In docker-compose.yaml change env variables GOOGLE_APPLICATION_CREDENTIALS (path to credentials file on VM), GCP_PROJECT_ID, GCP_GCS_BUCKET. Pay attention to volumes (for dags and plugins the repo directories were used, for logs there is a separate directory on VM, also directory for google credentials)
  • In create dataset and external table scripts change project specific attributes
  1. As of now, script on startup of the VM doesn't work, so, after Terrafrom deployment connect to VM and run set of commands. Go to repo directory and startup airflow (docker-compose up airflow-init, docker-compose up). Forward port 8080 and check if DAG got from repo
  2. Unpause the DAG and it'll start to load data from 2022-01-01
  3. Go to BigQuery and create dataset using provided CREATE SCHEMA script. Additionaly create dataset for your dbt development environment
  4. Initialize dbt project

Improvements to do:

de-zoomcamp-project's People

Contributors

yauheni-khvainitski avatar

Stargazers

Ufuk CEYHANLI avatar Abdelhak avatar James Obafemi avatar Basanagouda Somanakatti avatar Basem Khalaf  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.