Giter VIP home page Giter VIP logo

airflow-data's Introduction

Steps

  1. build docker image docker build .
  2. start docker compose AIRFLOW_IMAGE_NAME={hash from above} docker-compose up
  3. navigate to http://localhost:8081 log in with airflow:airflow
  4. go to Admin -> Connections and add nessie-default as a Nessie connection type (host = http://nessie:19120api/v1)
  5. go to Admin -> Connections and add spark-cluster and spark-cluster-sql (host = spark://spark and port= 7077). types are spark and spark_sql respectively
  6. go to Admin -> Connections and add aws-nessie as an aws type with user=access_key and pass=secret_key
  7. run the example_spark_operator dag

Airflow provider

  • nessie_provider.hooks.nessie_hook - Defines a Hook in Airflow and exposes a connection in the UI
  • nessie_provider.operators.create - runs pynessie to create a ref
  • nessie_provider.operatprs.merge - runs pynessie to execute a merge

Example job

See dags/dummy_dag.py

  • Create branch
  • run spark jobs to add two tables to branch
  • merge branch
  • delete branch

Still to do

  1. figure out how to realistically handle the NessieSparkSql job
  2. possibly create a sensor and create a job that uses it. Sensor could be a) wait for table to change or b) wait for commit on branch for example
  3. possibly add operators to expose more Nessie functionality
  4. correctly package and push to PyPI - correct names, correct docs, typing, black, testing etc etc
  5. add packages and env to spark sql operator on airflow github
  6. SparkSql operator is annoying, it appears you can only run one at a time as it does somehitng funky w/ a derby db. May just want to use spark-submit?
  7. is there a way to submit sql via a rest api or something? or via a pyspark job? Probably w databricks

airflow-data's People

Contributors

rymurr avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.