Giter VIP home page Giter VIP logo

data-copier-live's Introduction

Data Copier

A simple pipeline to copy data from a database (MySQL) to another database (Postgres).

Setup Docker

Setup MySQL

Setup Postgres

Run Application

  • Building Docker Image - docker build -t data-copier-live .
  • Running Application using Docker with entrypoint.
docker run --name data-copier \
  -v `pwd`:/app \
  -it \
-e SOURCE_DB_USER=retail_user \
-e SOURCE_DB_PASS=itversity \
-e TARGET_DB_USER=retail_user \
-e TARGET_DB_PASS=itversity \
--entrypoint python \
data-copier-live app.py dev departments

Setup AirFlow

Let us setup AirFlow and develop the pipeline for Data Pipeline.

Using Sqlite

Let us understand how to setup AirFlow for development using sqlite.

  • AirFlow is Python based Library and hence we can install using pip.
  • Let us create virtual environment for AirFlow and then setup AirFlow.
  • Create directory by name airflow - mkdir airflow. Get into the directory by running cd airflow.
  • Here is the command to create virtual environment - python -m venv airflow-env.
  • Activate the virtual environment - source airflow-env/bin/activate
  • Install AirFlow - pip install apache-airflow.
  • Run airflow initdb to intialize the database and add configuration files. All the databases and configuration files will be created in our working directory airflow.
  • By default it uses sqlite database.
  • Run following commands to start airflow webserver and scheduler.
airflow webserver -p 8080 -D
airflow scheduler -D

Here are some of the disadvantages.

  • Scalability
  • Useful for development and evaluate AirFlow Features

Using MySQL

In Non Development environments we have to setup AirFlow using traditional RDBMS Databases. Let us understand how we can configure AirFlow with MySQL Database and also using LocalScheduler.

  • Make sure to stop all the airflow processes.
cat airflow-scheduler.pid | xargs kill
cat airflow-webserver.pid|xargs kill
ps -ef|grep airflow # if you find any outstanding processes kill using kill command
  • Install mysql-connector-python so that we can use MySQL Database - pip install mysql-connector-python.
  • Make sure MySQL database is setup.
docker run \
    --name mysql_airflow \
    -e MYSQL_ROOT_PASSWORD=itversity \
    -d \
    -p 4306:3306 \
    mysql
  • Connect to MySQL and create database as well as username for airflow database - docker exec -it mysql_airflow mysql -u root -p
CREATE DATABASE airflow;
CREATE USER airflow IDENTIFIED BY 'itversity';
GRANT ALL ON airflow.* TO airflow;
FLUSH PRIVILEGES;
  • Set executor to LocalExecutor.
executor = SequentialExecutor
  • We can also use other Executors.
  • Update sql_alchemy_conn with MySQL URL.
sql_alchemy_conn = mysql+mysqlconnector://airflow:itversity@localhost:4306/airflow?use_pure=True
  • Make sure some of the properties related to concurrency is adjusted to lower numbers.
parallelism = 8
dag_concurrency = 4
max_active_runs_per_dag = 4
workers = 4
worker_concurrency = 4
worker_autoscale = 4,2
  • Run airflow initdb to initialize MySQL Database.
  • Start webserver and scheduler in the background.
airflow webserver -p 8080 -D
airflow scheduler -D

We can switch over to CeleryExecutor by following these steps.

  • Install Celery using pip install apache-airflow['celery']
  • Change executor to CeleryExecutor
executor = CeleryExecutor
  • Kill all the webserver and scheduler processes.
ps -ef|grep scheduler|awk -F" " '{ print $2 }'|xargs kill -9
cat *pid|xargs kill -9
rm *pid
ps -ef|grep airflow #Kill any remaining sessions
  • Start the airflow components and validate by visiting the URL.
airflow webserver -p 8080 -D
airflow scheduler -D

Schedule using AirFlow

By this time we should be ready with our application as well as AirFlow. Let us understand how we can integrate both.

  • As we are going to run our applicatio using Docker Container, we will use DockerOperator provided by AirFlow.
  • Add Docker to AirFlow - pip install apache-airflow['docker']

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.