Giter VIP home page Giter VIP logo

dlt-files-in-repos-demo's Introduction

This repository contains a demo of using Files in Repos functionality with Databricks Delta Live Tables (DLT) to perform unit & integration testing of DLT pipelines.

The development workflow

The development workflow is organized as on following image:

DLT development workflow

More detailed description is available in the blog post Applying software development & DevOps best practices to Delta Live Table pipelines.

Setup instructions

๐Ÿšง Work in progress...

โš ๏ธ Setup instructions describe process of performing CI/CD using Azure DevOps (ADO), but similar thing could be implemented with any CI/CD technology.

There are two ways of setting up everything:

  1. using Terraform - it's the easiest way of getting everything configured in a short time. Just follow instructions in terraform/azuredevops/ folder. โš ๏ธ This doesn't include creation of release pipeline as there is no REST API and Terraform resource for it.
  2. manually - follow instructions below to create all necessary objects.

Create necessary Databricks Repos checkouts

In this example we're using three checkouts of our sample repository:

  1. Development: is used for actual development of the new code, running tests before committing the code, etc.
  2. Staging: will be used to run tests on commits to branches and/or pull requests. This checkout will be updated to the actual branch to which commit happened. We're using one checkout just for simplicity, but in real-life we'll need to create such checkouts automatically to allow multiple tests to run in parallel.
  3. Production: is used to keep the production code - this checkout always will be on the releases branch, and will be updated only when commit happens to that branch and all tests are passed.

Here is an example of repos created with Terraform:

Databricks repos

Create DLT pipelines

We need to create a few DLT pipelines for our work:

  1. for main code that is used for development - use only pipelines/DLT-Pipeline.py notebook from the development repository.
  2. (optional) for integration test that could be run as part of development - from the development repository use main code notebook (pipelines/DLT-Pipeline.py) together with integration test notebook (tests/integration/DLT-Pipeline-Test.py).
  3. for integration test running as part of CI/CD pipeline - similar to the previous item, but use the staging repository.
  4. for production pipeline - use only pipelines/DLT-Pipeline.py notebook from the production repository.

Here is an example of pipelines created with Terraform:

Databricks repos

Create Databricks cluster

If you decide to run notebooks with tests located in tests/unit-notebooks directory, you will need to create a Databricks cluster that will be used by the Nutter library. To speedup tests, attach the nutter & chispa libraries to the created cluster.

If you don't want to run these tests, comment out in the azure-pipelines.yml the block with displayName "Execute Nutter tests".

Create ADO build pipeline

๐Ÿšง Work in progress...

The ADO build pipeline consists of the two stages:

  • onPush is executed on push to any Git branch except releases branch and version tags. This stage only runs & reports unit tests results (both local & notebooks).
  • onRelease is executed only on commits to the releases branch, and in addition to the unit tests it will execute a DLT pipeline with integration test (see image).

Stages of ADO build pipeline

Create ADO release pipeline

๐Ÿšง Work in progress...

dlt-files-in-repos-demo's People

Contributors

alexott avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.