dsaidgovsg / airflow-pipeline Goto Github PK
View Code? Open in Web Editor NEWAn Airflow docker image preconfigured to work well with Spark and Hadoop/EMR
License: Apache License 2.0
An Airflow docker image preconfigured to work well with Spark and Hadoop/EMR
License: Apache License 2.0
docker-compose -f docker-compose.test.yml up --build
Add some sample DAGs that uses HDFS, Sqoop and Spark
Depending on the internet connection, sometimes fetching from jessie-backports
can give hashsum mismatch (logs at the end). Guide suggests that this could be due to ISP caching: https://www.reddit.com/r/debian/comments/64xk33/jessiebackports_is_giving_a_hash_sum_mismatch/.
Can consider using the newer Debian stretch
that doesn't require the above additional package, see example: https://github.com/datagovsg/airflow-pipeline/tree/debian-upgrade.
Error Log:
> docker build . -t datagovsg/airflow-pipeline --build-arg SPARK_VERSION=2.1.2 --build-arg HADOOP_VERSION=2.6.5 --build-arg SPARK_PY4J=python/lib/py4j-0.10.4-src.zip
...
Step 3/59 : RUN set -ex && (echo 'deb http://deb.debian.org/debian jessie-backports main' > /etc/apt/sources.list.d/backports.list) && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y --force-yes vim-tiny libsasl2-dev libffi-dev gosu krb5-user && rm -rf /var/lib/apt/lists/* && pip install --no-cache-dir "apache-airflow[devel_hadoop, crypto]==1.9.0" psycopg2 && pip install --no-cache-dir sqlalchemy==1.1.17
---> Running in 2b9b735eccda
+ echo deb http://deb.debian.org/debian jessie-backports main
+ apt-get update
Get:1 http://security.debian.org jessie/updates InRelease [94.4 kB]
Ign http://deb.debian.org jessie InRelease
Get:2 http://deb.debian.org jessie-updates InRelease [145 kB]
Get:3 http://deb.debian.org jessie-backports InRelease [166 kB]
Get:4 http://deb.debian.org jessie Release.gpg [2434 B]
Get:5 http://deb.debian.org jessie Release [148 kB]
Get:6 http://security.debian.org jessie/updates/main amd64 Packages [622 kB]
Get:7 http://deb.debian.org jessie-updates/main amd64 Packages [23.0 kB]
Get:8 http://deb.debian.org jessie-backports/main amd64 Packages [1170 kB]
Get:9 http://deb.debian.org jessie/main amd64 Packages [9064 kB]
Fetched 11.4 MB in 6s (1663 kB/s)
W: Failed to fetch http://deb.debian.org/debian/dists/jessie-backports/main/binary-amd64/Packages Hash Sum mismatch
E: Some index files failed to download. They have been ignored, or old ones used instead.
The command '/bin/sh -c set -ex && (echo 'deb http://deb.debian.org/debian jessie-backports main' > /etc/apt/sources.list.d/backports.list) && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y --force-yes vim-tiny libsasl2-dev libffi-dev gosu krb5-user && rm -rf /var/lib/apt/lists/* && pip install --no-cache-dir "apache-airflow[devel_hadoop, crypto]==1.9.0" psycopg2 && pip install --no-cache-dir sqlalchemy==1.1.17' returned a non-zero code: 100
Currently there are different builds for Spark-1.6 and Spark-2.1 (due to datagovsg
use case), but not for Airflow itself, which is kind of strange since this is an Airflow repository.
Considering there is at least one major differences between v1.9 and v1.10, such as the setting of system timezone (https://issues.apache.org/jira/browse/AIRFLOW-288), there should be various builds to build different Airflow versions.
Would be ideal to pair up with various Spark version builds, but at the same time try not to overly complicate the number of combinations of components.
Is it possible to have both python 2 and 3 supported at the same time?
Reasons:
COPY ../xxx
will fail)run scheduler process and web server as independent containers
Since python2 will be sunset. Any idea on how to update it to python3?
While running docker image, I am receiving the following error:
/entrypoint.sh: line 7: USER: unbound variable.
Code of entrypoint.sh.
#!/bin/bash
set -euo pipefail
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
POSTGRES_TIMEOUT=60
echo "Running as: ${USER}"
if [ "${USER}" != "root" ]; then
echo "Changing owner of files in ${AIRFLOW_HOME} to ${USER}"
chown -R "${USER}" ${AIRFLOW_HOME}
fi
set +e
Declaration of Environment variable user in Docker File:
ONBUILD ARG THEUSER=afpuser
ONBUILD ARG THEGROUP=hadoop
ONBUILD ENV USER ${THEUSER}
ONBUILD ENV GROUP ${THEGROUP}
ONBUILD RUN groupadd -r "${GROUP}" && useradd -rmg "${GROUP}" "${USER}"
support spark 2.0 and python 3.5
Here's a list of the top biggest subdirs from /:
19396 /usr/share/perl
21204 /usr/lib/python2.7/dist-packages
28280 /opt/sqoop-1.4.6.bin__hadoop-2.0.4-alpha
32544 /lib
33452 /usr/include
37848 /usr/bin
40256 /var/lib/dpkg/info
41480 /var/lib/dpkg
41648 /usr/lib/python2.7
49452 /usr/local/lib/python2.7
51416 /var/lib
53928 /var
56680 /usr/local/lib
57504 /usr/local
62036 /usr/lib/gcc/x86_64-linux-gnu
62040 /usr/lib/gcc
91716 /usr/share/locale
95188 /opt/hadoop/share/doc
100252 /usr/lib/jvm/java-7-openjdk-amd64
100260 /usr/lib/jvm
217952 /opt/hadoop/share/hadoop
241476 /usr/lib/x86_64-linux-gnu
310444 /opt/spark-1.6.1/lib
313144 /opt/hadoop/share
318940 /opt/hadoop
322208 /opt/spark-1.6.1
453724 /usr/share/doc/openjdk-7-jre-headless
472720 /usr/lib
534528 /usr/share/doc
669432 /opt
700440 /usr/share
1303788 /usr
2074236 /
Seems like it would be a great idea to save 500MB by deleting /usr/share/doc
with Mesos or Celery (So we can have more spark drivers)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.