Giter VIP home page Giter VIP logo

airflow-repo-template's Introduction

Airflow Codebase Template

Background

Apache Airflow is the leading orchestration tool for batch workloads. Originally conceived at Facebook and eventually open-sourced at AirBnB, Airflow allows you to define complex directed acyclic graphs (DAG) by writing simple Python.

Airflow has a number of built-in concepts that make data engineering simple, including DAGs (which describe how to run a workflow) and Operators (which describe what actually gets done). See the Airflow documentation for more detail: https://airflow.apache.org/concepts.html

Airflow also comes with its own architecture: a database to persist the state of DAGs and connections, a web server that supports the user-interface, and workers that are managed together by the scheduler and database. Logs persist both in flat files and the database, and Airflow can be setup to write remote logs (to S3 for example). Logs are viewable in the UI.

Airflow Architecture

A Note on managing Airflow dependencies

Airflow is tricky to install correctly because it is both an application and a library. Applications freeze their dependencies to ensure stability, while libraries leave their dependencies open for upgrades to take advantage of new features. Airflow is both, so it doesn't freeze dependencies. This means that depending on the day, a simple pip install apache-airflow is not guaranteed to produce a workable version of the Airflow application.

To combat this, Airflow provides a set of constraints files that are known working versions of Airflow.

This template installs Airflow using the constraints file at https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt and allows for building a custom Airflow image on top of this constraints file by simply adding additional dependencies to airflow.requirements.txt. Local dependencies are added to local-requirements.txt.

Why not just use the official docker-compose file?

It's easier to customize our additional dependencies with Airflow by building our own image. The master Airflow image doesn't allow this kind of low-level control.

This template is particularly useful to Airflow power users that tend to write a lot of custom plugins or functionality using external dependencies.

Why not just extend off of the official Airflow image?

You can do this, but customizing the image yields far more optimizations, and doesn't come with any additional complexity. To add Airflow extras, you can simply add it to the AIRFLOW_EXTRAS variable in the Makefile:

    AIRFLOW_EXTRAS := postgres,google

To install any other pip dependencies, simply add it to airflow.requirements.txt.

Getting Started

This repository was created with Python 3.8.6, but should work for all versions of Python 3.

DAGs should be developed & tested locally first, before being promoted to a development environment for integration testing. Once DAGs are successful in the lower environments, they can be promoted to production.

Code is contributed either in dags, a directory that houses all Airflow DAG configuration files, or plugins, a directory that houses Python objects that can be used to extend Airflow.

Running Airflow locally

This project uses a Makefile to consolidate common commands and make it easy for anyone to get started. To run Airflow locally, simply:

    make start-airflow

This command will build your local Airflow image and start Airflow automatically!

Navigate to http://localhost:8080/ and start writing & testing your DAGs! Login with the user-password combo: admin:admin (you can change this in docker-compose.yaml).

You'll notice in docker-compose.yaml that both DAGs and plugins are mounted as volumes. This means once Airflow is started, any changes to your code will be quickly synced to the webserver and scheduler. You shouldn't have to restart the Airflow instance during a period of development!

When you're done, simply:

    make stop-airflow

Testing & Linting

Instantiating a local virtual environment is now entirely optional. You can develop entirely through Docker, as Airflow runs inside of docker-compose and test-docker and lint-docker provide avenues for running those steps without a virtual environment.

However, not using a virtual environment also means sacrificing any linting/language-server functionality provided by your IDE. To setup your virtual environment:

    make  venv

This project is also fully linted with black and pylint, even using a cool pylint plugin called pylint-airflow. To run linting:

With your virtual environment:

    make lint

With Docker:

    make lint-docker

Any tests can be placed under tests, we've already included a few unit tests for validating all of your DAGs and plugins to make sure they're valid to install in Airflow. To run tests:

With your virtual environment:

    make test

Inside Docker:

    make test-docker

Cleaning up your local environment

If at any point you simply want to clean up or reset your local environment, you can run the following commands:

Reset your local docker-compose:

    make reset-airflow

Rebuild the local Airflow image for docker-compose (useful if you make changes to the Dockerfile):

    make rebuild-airflow

Clean up Pytest artifacts:

    make clean-pytest

Reset your virtual environment:

    make clean-venv

Start completely from scratch:

    make clean-all

Deployment

Once you've written your DAGs, the next step is to deploy them to your Airflow instance. This is a matter of syncing the dags and plugins directories to their respective destinations.

TODO: add some examples and documentation of deployments to different Airflow cloud providers (Astronomer, Cloud Composer, etc.) using different CI technologies (CircleCI, Github Actions, etc.)

airflow-repo-template's People

Contributors

soggycactus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

airflow-repo-template's Issues

Upgrade to Airflow 2.2.5

Hello!

Lately, I've been trying to kickstart with my personal Airflow instance using your template, but had to upgrade it to 2.2.5. Successfully, I've succeeded doing so by modifying Dockerfile, that I'm sharing below. Maybe that will be useful when you'd decide to upgrade template

FROM apache/airflow:2.2.5

#Install Make pre-requisites
USER root
RUN apt-get update \
  && apt-get install -y --no-install-recommends build-essential
USER airflow

#Install additional dependencies
COPY Makefile Makefile
COPY airflow.requirements.txt airflow.requirements.txt

RUN make internal-install-airflow
RUN make internal-install-deps

Personally, I've found a little bit difficult to understand code behind make internal-install-airflow component, so I'm putting additional Airflow modules into airflow.requirements.txt file, but I'm pretty sure that this will work too after providing appropriate airflow and python version in Makefile.

Run airflow not form `root`

Hello!

Thank you for this template.

--

I try to run airflow not from the root user:

For this I change the dockerfile:

From

# Install Airflow and any additonal dependencies
WORKDIR /root

To

# Install Airflow and any additonal dependencies
WORKDIR /app

And then create user airflow and give him all the rights to the /app directory

After that I change all occurencies of /root in docker-compose file, so the volumes point to the /app dir now

--

I expect that when I boot up airflow I'd get my DAGs up and running, but instead it seems like DAGs have not been imported at all

Could you please point me in a way why this might happen? I did not find a way to specify DAGs loading folder in the code :(

Upgrading Airflow minor versions

Hi @soggycactus ,
thanks for providing this useful template!

One question: I'm just starting off and noticed Airflow 2.0.1 is already out - should I simply adjust AIRFLOW_VERSION in the Makefile to use it or are there other prerequisites to take care of?

Thanks!

Psycopg2-binary Error: Service 'initdb' failed to build

Hello, I am currently attempting to run make start-airflow and am getting hit with the following error:

executor failed running [/bin/sh -c make internal-install-airflow]: exit code: 2
ERROR: Service 'initdb' failed to build
make: *** [start-db] Error 1

I've traced this back to the makefile attempting to pip install pyycopg2-binary, which then throws the following error:

#12 12.55     Error: pg_config executable not found.
#12 12.55     
#12 12.55     pg_config is required to build psycopg2 from source.  Please add the directory
#12 12.55     containing pg_config to the $PATH or specify the full executable path with the
#12 12.55     option:
#12 12.55     
#12 12.55         python setup.py build_ext --pg-config /path/to/pg_config build ...
#12 12.55     
#12 12.55     or with the pg_config option in 'setup.cfg'.
#12 12.55     
#12 12.55     If you prefer to avoid building psycopg2 from source, please install the PyPI
#12 12.55     'psycopg2-binary' package instead.
#12 12.55     
#12 12.55     For further information please check the 'doc/src/install.rst' file (also at
#12 12.55     <https://www.psycopg.org/docs/install.html>).
#12 12.55     
#12 12.55     ----------------------------------------

It's funny, since I thought the whole point to pip installing psycopg2-binary was to avoid having to deal with all of these executable files. When I attempt to install psycopg2-binary on its own in a fresh conda environment, the library installs just fine.

I ended up finding this post on the psycopg2 issues page, where someone said that adding libpq-dev to the list of dependencies in the dockerfile solves this issue. I tried it and now I am able to run make start-airflow without any errors. I'm not sure if this is a completely safe fix for this issue, I just wanted to point this issue out in case anyone else runs into it.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.