Giter VIP home page Giter VIP logo

airflow-pipeline's People

Contributors

chrissng avatar guangie88 avatar jghoman avatar lawliet89 avatar tingweiftw avatar xkjyeah avatar xtrntr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

airflow-pipeline's Issues

Sample DAGs

Add some sample DAGs that uses HDFS, Sqoop and Spark

jessie-backports sometimes gives hash sum mismatch

Depending on the internet connection, sometimes fetching from jessie-backports can give hashsum mismatch (logs at the end). Guide suggests that this could be due to ISP caching: https://www.reddit.com/r/debian/comments/64xk33/jessiebackports_is_giving_a_hash_sum_mismatch/.

Can consider using the newer Debian stretch that doesn't require the above additional package, see example: https://github.com/datagovsg/airflow-pipeline/tree/debian-upgrade.

Error Log:

> docker build . -t datagovsg/airflow-pipeline --build-arg SPARK_VERSION=2.1.2 --build-arg HADOOP_VERSION=2.6.5 --build-arg SPARK_PY4J=python/lib/py4j-0.10.4-src.zip

...

Step 3/59 : RUN set -ex     && (echo 'deb http://deb.debian.org/debian jessie-backports main' > /etc/apt/sources.list.d/backports.list)     && apt-get update     && DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y --force-yes vim-tiny libsasl2-dev libffi-dev gosu krb5-user     && rm -rf /var/lib/apt/lists/*     && pip install --no-cache-dir "apache-airflow[devel_hadoop, crypto]==1.9.0" psycopg2     && pip install --no-cache-dir sqlalchemy==1.1.17
 ---> Running in 2b9b735eccda
+ echo deb http://deb.debian.org/debian jessie-backports main
+ apt-get update
Get:1 http://security.debian.org jessie/updates InRelease [94.4 kB]
Ign http://deb.debian.org jessie InRelease
Get:2 http://deb.debian.org jessie-updates InRelease [145 kB]
Get:3 http://deb.debian.org jessie-backports InRelease [166 kB]
Get:4 http://deb.debian.org jessie Release.gpg [2434 B]
Get:5 http://deb.debian.org jessie Release [148 kB]
Get:6 http://security.debian.org jessie/updates/main amd64 Packages [622 kB]
Get:7 http://deb.debian.org jessie-updates/main amd64 Packages [23.0 kB]
Get:8 http://deb.debian.org jessie-backports/main amd64 Packages [1170 kB]
Get:9 http://deb.debian.org jessie/main amd64 Packages [9064 kB]
Fetched 11.4 MB in 6s (1663 kB/s)
W: Failed to fetch http://deb.debian.org/debian/dists/jessie-backports/main/binary-amd64/Packages  Hash Sum mismatch

E: Some index files failed to download. They have been ignored, or old ones used instead.
The command '/bin/sh -c set -ex     && (echo 'deb http://deb.debian.org/debian jessie-backports main' > /etc/apt/sources.list.d/backports.list)     && apt-get update     && DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y --force-yes vim-tiny libsasl2-dev libffi-dev gosu krb5-user     && rm -rf /var/lib/apt/lists/*     && pip install --no-cache-dir "apache-airflow[devel_hadoop, crypto]==1.9.0" psycopg2     && pip install --no-cache-dir sqlalchemy==1.1.17' returned a non-zero code: 100

Provide multiple versions of Airflow

Currently there are different builds for Spark-1.6 and Spark-2.1 (due to datagovsg use case), but not for Airflow itself, which is kind of strange since this is an Airflow repository.

Considering there is at least one major differences between v1.9 and v1.10, such as the setting of system timezone (https://issues.apache.org/jira/browse/AIRFLOW-288), there should be various builds to build different Airflow versions.

Would be ideal to pair up with various Spark version builds, but at the same time try not to overly complicate the number of combinations of components.

Python 3

Is it possible to have both python 2 and 3 supported at the same time?

Move Docker files back to root

Reasons:

  • It is unconventional to have them in a sub-directory
  • Docker does not allow you to copy files "above" a directory (i.e. COPY ../xxx will fail)
  • Unnecessary extra flags to build an image, so more places to make mistakes.

Update to Python3

Since python2 will be sunset. Any idea on how to update it to python3?

USER: unbound variable

While running docker image, I am receiving the following error:
/entrypoint.sh: line 7: USER: unbound variable.

Code of entrypoint.sh.

#!/bin/bash

set -euo pipefail

export SPARK_DIST_CLASSPATH=$(hadoop classpath)
POSTGRES_TIMEOUT=60

echo "Running as: ${USER}"
if [ "${USER}" != "root" ]; then
echo "Changing owner of files in ${AIRFLOW_HOME} to ${USER}"
chown -R "${USER}" ${AIRFLOW_HOME}
fi

set +e

Declaration of Environment variable user in Docker File:

Delay creation of user and group

ONBUILD ARG THEUSER=afpuser
ONBUILD ARG THEGROUP=hadoop

ONBUILD ENV USER ${THEUSER}
ONBUILD ENV GROUP ${THEGROUP}
ONBUILD RUN groupadd -r "${GROUP}" && useradd -rmg "${GROUP}" "${USER}"

Reduce image size by deleting docs

Here's a list of the top biggest subdirs from /:

19396 /usr/share/perl
21204 /usr/lib/python2.7/dist-packages
28280 /opt/sqoop-1.4.6.bin__hadoop-2.0.4-alpha
32544 /lib
33452 /usr/include
37848 /usr/bin
40256 /var/lib/dpkg/info
41480 /var/lib/dpkg
41648 /usr/lib/python2.7
49452 /usr/local/lib/python2.7
51416 /var/lib
53928 /var
56680 /usr/local/lib
57504 /usr/local
62036 /usr/lib/gcc/x86_64-linux-gnu
62040 /usr/lib/gcc
91716 /usr/share/locale
95188 /opt/hadoop/share/doc
100252 /usr/lib/jvm/java-7-openjdk-amd64
100260 /usr/lib/jvm
217952 /opt/hadoop/share/hadoop
241476 /usr/lib/x86_64-linux-gnu
310444 /opt/spark-1.6.1/lib
313144 /opt/hadoop/share
318940 /opt/hadoop
322208 /opt/spark-1.6.1
453724 /usr/share/doc/openjdk-7-jre-headless
472720 /usr/lib
534528 /usr/share/doc
669432 /opt
700440 /usr/share
1303788 /usr
2074236 /

Seems like it would be a great idea to save 500MB by deleting /usr/share/doc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.