Azure Databricks CI/CD template

1. Solution Overview
2. How to use this template

1. Solution Overview

When building a project in databricks, we can start from a notebook and implement the business logic in Python or SparkSQL. Before go-production, we need to create CI/CD pipelines. To reduce the effort of build CI/CD pipelines, we build this git repository as template including sample notebooks and unit tests with CI/CD pipelines in Azure DevOps yaml files .

It is the scaffolding of Azure databricks project.

To make it easy extendable, the notebooks and python code only contain super simple logic, and the unit tests are implemented by pytest and nutter

This template focuses on solutions of CI/CD pipeline, and demonstrates to support 2 approaches of spark application implementation, "notebook job" and "spark python job". A python package is also implemented and imported to notebooks and spark python job as library.

1.1. Scope

The following list captures the scope of this template:

Sample code, they are the two job types of databricks, plus a common python library
1. Notebook Job.
2. Spark Python Job.
3. A python package as common library imported by "notebook" and "spark python"
Testing
1. pytest for common library and spark python
2. nutter for notebooks
DevOps pipelines build, test and deploy the spark jobs.

Details about how to use this sample can be found in the later sections of this document.

1.2. Architecture

The below diagram illustrates the deployment process flow followed in this template:

1.3. Technologies used

The following technologies are used to build this template:

2. How to use this template

This section holds the information about the instructions of this template.

2.1. Prerequisites

The following are the prerequisites for deploying this template :

2.2. Infrastructure as Code (IaC)

You need have 3 databricks workspace for 'develop', 'staging' and 'product'. You setup Azure Databricks services by the IaC samples from here

2.3. Project Structure

|   .gitignore
|   pytest.ini
|   README.md
|   requirements.txt
|   setup.py
|               
+---common
|   |   module_a.py
|   |   __init__.py
|   |   
|   +---tests
|            module_a_test.py
|           
+---conf
|       deployment_notebook.json
|       deployment_notebook_new_cluster.json
|       deployment_spark_python.json
|       deployment_spark_python_new_cluster.json
|       
+---devops
|   |   lib-pipelines.yml
|   |   notebook-pipelines.yml
|   |   spark-python-pipelines.yml
|   |   
|   \---template
|           create-deployment-json.yml
|           deploy-lib-job.yml
|           deploy-notebook-job.yml
|           deploy-spark-python-job.yml
|           test-lib-job.yml
|           test-notebook-job.yml
|           test-spark-python-job.yml
|       
+---notebook_jobs
|   |   main_notebook_a.py
|   |   main_notebook_b.py
|   |   main_notebook_sql.py
|   |   module_b_notebook.py
|   |   
|   \---tests
|          main_notebook_a_test.py
|          main_notebook_b_test.py
|          main_notebook_sql_test.py
|          module_b_notebook_test.py             
|       
+---spark_python_jobs
    |   main.py
    |   __init__.py
    |   
    +---tests
        +---integration
        |      main_test.py
        |      __init__.py
        |           
        \---unit
               main_test.py
               __init__.py

2.4. The Notebook Approach

This is to support the job of Notebook and it is the typical approach of databricks application. In this template, there are 4 notebooks and 4 testing notebooks based on nutter

main_notebook_a.py

This notebook imports a library named "common.module_a" and uses the "add_mount" method.
```
from common.module_a import add_mount
```
main_notebook_b.py

This notebook imports a method declared in the module_b_notebook.py.
```
%run ./module_b_notebook
```
module_b_notebook.py

This notebook has a method and is used by the notebook main_notebook_b.py.
main_notebook_sql.py

This notebook shows how to use Spark Sql to process data.

tests/main_notebook_a_test.py

It is the nutter based notebook testing, it runs the notebook as below.

%run ../main_notebook_a

It compares the expected result with the actual

class Test1Fixture(NutterFixture):
    def __init__(self):
        self.actual_df = None
        NutterFixture.__init__(self)
        
    def run_test_transform_data(self):
        self.actual_df = transform_data(df)
        
    def assertion_test_transform_data(self):
        assert(self.actual_df.collect() == expected_df.collect())

    def after_test_transform_data(self):
        print('done')

tests/main_notebook_b_test.py

It is the nutter based notebook testing, it runs the notebook as below.

%run ../main_notebook_b

It compares the expected result with the actual

class Test1Fixture(NutterFixture):
    def __init__(self):
        self.actual_df = None
        NutterFixture.__init__(self)
        
    def run_test_transform_data(self):
        self.actual_df = transform_data(df)
        
    def assertion_test_transform_data(self):
        assert(self.actual_df.collect() == expected_df.collect())

    def after_test_transform_data(self):
        print('done')

tests/module_b_notebook_test.py

It is the nutter based notebook testing, it runs the notebook as below.

%run ../module_b_notebook

It compares the expected result with the actual

class Test1Fixture(NutterFixture):
    def __init__(self):
        self.actual_df = None
        NutterFixture.__init__(self)
        
    def run_test_add_mount(self):
        self.actual_df = add_mount(df, 10)
        
    def assertion_test_add_mount(self):
        assert(self.actual_df.collect() == expected_df.collect())

    def after_test_add_mount(self):
        print('done')

tests/main_notebook_sql.py

It is the nutter based notebook testing, it runs the notebook as below.
```
dbutils.notebook.run('../main_notebook_sql', 600)  
```

2.4.1 Repository setup

The bash script below is to create a standalone git repository. You need to create a project in Azure DevOps and create a repository in the project. And replace the [your repo url] in the code below with your repository url.

mkdir [your project name]
cd [your project name]
git clone https://github.com/Azure-Samples/modern-data-warehouse-dataops.git 
cd modern-data-warehouse-dataops
git checkout single-tech/databricks-ops
git archive --format=tar single-tech/databricks-ops:single_tech_samples/databricks/sample4_ci_cd | tar -x -C ../
cd ..
rm -rf modern-data-warehouse-dataops

git init
git remote add origin [your repo url]
git add -A
git commit -m "first commit"
git push -u origin --all
git branch develop master
git branch staging develop
git branch production staging
git push -u origin develop
git push -u origin staging
git push -u origin production

After running the scripts, you can open the your repo url to check the code is pushed to the repository.

There are 3 branch in the repository:

develop branch is the code base of development
staging branch is for integration testing
production branch is for production deployment

You can find the document to set branch policy.

2.4.2 DevOps pipeline setup

In this repo, there are several yml files which are the pipelines to support the CI/CD. you need to import the yml as build pipeline.

Import ./devops/notebook-pipelines.yml as build pipeline. This pipeline tests and uploads the notebooks to databricks workspace.

Here is a post to introduce how to import a yaml file as Azure DevOps pipeline from Azure DevOps repository.

Import ./devops/lib-pipelines.yml as build pipeline. This pipeline tests and uploads the python library to databricks cluster as a library.

You need to select branch to run the pipeline for different environments.

Manually run the pipeline from develop branch to deploy the library to Databricks in develop environment
Manually run the pipeline from staging branch to deploy the library to Databricks in staging environment
Manually run the pipeline from production branch to deploy the library to Databricks in production environment

If no library is required in your notebooks project, you need remove the Import statement in notebooks.

Create 3 Variable Group as the names below.
- Databricks-dev-environment
- Databricks-stg-environment
- Databricks-prod-environment
Each variable group has 3 variables:
- databricksClusterId_[dev|stg|prod]: the id of Databricks cluster.
- databricksDomain_[dev|stg|prod]: the url of Databricks workspace.
- databricksToken_[dev|stg|prod]: the access token of Databricks.

Here are the document of how to create variable groups, and the document of how to get the token.

2.4.3 Import into DEV databricks workspace

Follow this document you can import the notebooks from the repository to databricks workspace.

2.4.4 Implement and run tests in DEV databricks workspace

Switch to develop branch.
Open one of the notebook to edit.
Open the relevant testing notebook to run and check.
Commit and push the changes to develop branch.

2.4.5 Run test with pipelines

Create a pull request from develop branch to staging branch
Complete the merge, it triggers the pipeline to run tests at staging databricks cluster.

Manually run the pipeline from staging branch

2.4.6 Deployment

Create a pull request from staging branch to production branch or directly run the pipeline on release branch.
It triggers the pipeline to run tests and import notebooks into production databricks workspace.

Manually run the pipeline from production branch

The pipeline does not create job with the notebooks.

2.5. The Spark Python Approach

This is to support the job of Spark Submit. In this approach, you can develop Spark application in local IDE and submit to Databricks cluster to run it.

2.5.1 Repository setup

Please follow 2.4.1 Repository setup

2.5.2 DevOps pipeline setup

In this repo, there are several yaml files, which are the pipelines to support the CI/CD. You need to import the yaml as build pipeline.

Import ./devops/spark-python-pipelines.yml as build pipeline. This pipeline tests and uploads the notebooks to databricks workspace.

2.5.3 Implement and run tests in VSCode

Clone the repo into your local folder and open the folder with VSCode
Setup local Spark with this document
Open a cmd terminal window and run the script below to setup the project development.

pip install -r requirements.txt

Edit main.py file in VSCode.
In cmd terminal window and run the script below to start the tests.

pytest common/tests
pytest spark_python_jobs/tests/unit

2.5.4 Run test with pipelines

Commit and push the changes to develop branch.
Create a pull request from develop branch to staging branch
Complete the merge, it will trigger the pipeline to run tests at staging databricks cluster.

Manually run the pipeline from staging branch

2.5.5 Deployment

Create a pull request from staging branch to production branch or directly run the pipeline
Complete the merge, it will trigger the pipeline to run tests and create a job in production databricks.

Manually run the pipeline from production branch

maye-msft / azure-databricks-starterkit Goto Github PK

azure-databricks-starterkit's Introduction

Azure Databricks CI/CD template

Contents

1. Solution Overview

1.1. Scope

1.2. Architecture

1.3. Technologies used

2. How to use this template

2.1. Prerequisites

2.2. Infrastructure as Code (IaC)

2.3. Project Structure

2.4. The Notebook Approach

2.4.1 Repository setup

2.4.2 DevOps pipeline setup

2.4.3 Import into DEV databricks workspace

2.4.4 Implement and run tests in DEV databricks workspace

2.4.5 Run test with pipelines

2.4.6 Deployment

2.5. The Spark Python Approach

2.5.1 Repository setup

2.5.2 DevOps pipeline setup

2.5.3 Implement and run tests in VSCode

2.5.4 Run test with pipelines

2.5.5 Deployment

Recommend Projects

Recommend Topics

Recommend Org