databricks / mlops-stacks Goto Github PK

This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.

Home Page: https://docs.databricks.com/en/dev-tools/bundles/mlops-stacks.html

License: Apache License 2.0

Python 95.42% Shell 4.58%

databricks machine-learning mlops

mlops-stacks's Introduction

Databricks MLOps Stacks

NOTE: This feature is in public preview.

This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.

Using Databricks MLOps Stacks, data scientists can quickly get started iterating on ML code for new projects while ops engineers set up CI/CD and ML resources management, with an easy transition to production. You can also use MLOps Stacks as a building block in automation for creating new data science projects with production-grade CI/CD pre-configured. More information can be found at https://docs.databricks.com/en/dev-tools/bundles/mlops-stacks.html.

The default stack in this repo includes three modular components:

Component	Description	Why it's useful
ML Code	Example ML project structure (training and batch inference, etc), with unit tested Python modules and notebooks	Quickly iterate on ML problems, without worrying about refactoring your code into tested modules for productionization later on.
ML Resources as Code	ML pipeline resources (training and batch inference jobs, etc) defined through databricks CLI bundles	Govern, audit, and deploy changes to your ML resources (e.g. "use a larger instance type for automated model retraining") through pull requests, rather than adhoc changes made via UI.
CI/CD(GitHub Actions or Azure DevOps)	GitHub Actions or Azure DevOps workflows to test and deploy ML code and resources	Ship ML code faster and with confidence: ensure all production changes are performed through automation and that only tested code is deployed to prod

See the FAQ for questions on common use cases.

ML pipeline structure and development loops

An ML solution comprises data, code, and models. These resources need to be developed, validated (staging), and deployed (production). In this repository, we use the notion of dev, staging, and prod to represent the execution environments of each stage.

An instantiated project from MLOps Stacks contains an ML pipeline with CI/CD workflows to test and deploy automated model training and batch inference jobs across your dev, staging, and prod Databricks workspaces.

Data scientists can iterate on ML code and file pull requests (PRs). This will trigger unit tests and integration tests in an isolated staging Databricks workspace. Model training and batch inference jobs in staging will immediately update to run the latest code when a PR is merged into main. After merging a PR into main, you can cut a new release branch as part of your regularly scheduled release process to promote ML code changes to production.

Develop ML pipelines

mlops_stacks_01_ml_dev.mov

Create a PR and CI

mlops_stacks_02_create_pr.mov

Merge the PR and deploy to Staging

mlops_stacks_03_merge_PR.mov

mlops_stacks_04_deploy_to_staging.mov

Deploy to Prod

mlops_stacks_05_release.mov

See this page for detailed description and diagrams of the ML pipeline structure defined in the default stack.

Using MLOps Stacks

Prerequisites

Python 3.8+
Databricks CLI >= v0.221.0

Databricks CLI contains Databricks asset bundle templates for the purpose of project creation.

Please follow the instruction to install and set up databricks CLI. Releases of databricks CLI can be found in the releases section of databricks/cli repository.

Databricks asset bundles and Databricks asset bundle templates are in public preview.

Start a new project

To create a new project, run:

databricks bundle init mlops-stacks

This will prompt for parameters for initialization. Some of these parameters are required to get started:

input_setup_cicd_and_project : If both CI/CD and the project should be set up, or only one of them.
- CICD_and_Project - Setup both CI/CD and project, the default option.
- Project_Only - Setup project only, easiest for Data Scientists to get started with.
- CICD_Only - Setup CI/CD only, likely for monorepo setups or setting up CI/CD on an already initialized project. We expect Data Scientists to specify Project_Only to get started in a development capacity, and when ready to move the project to Staging/Production, CI/CD can be set up. We expect that step to be done by Machine Learning Engineers (MLEs) who can specify CICD_Only during initialization and use the provided workflow to setup CI/CD for one or more projects.
input_root_dir: name of the root directory. When initializing with CICD_and_Project, this field will automatically be set to input_project_name.
input_cloud: Cloud provider you use with Databricks (AWS, Azure, or GCP).

Others must be correctly specified for CI/CD to work:

input_cicd_platform : CI/CD platform of choice (GitHub Actions or GitHub Actions for GitHub Enterprise Servers or Azure DevOps)
input_databricks_staging_workspace_host: URL of staging Databricks workspace, used to run CI tests on PRs and preview config changes before they're deployed to production. We encourage granting data scientists working on the current ML project non-admin (read) access to this workspace, to enable them to view and debug CI test results
input_databricks_prod_workspace_host: URL of production Databricks workspace. We encourage granting data scientists working on the current ML project non-admin (read) access to this workspace, to enable them to view production job status and see job logs to debug failures.
input_default_branch: Name of the default branch, where the prod and staging ML resources are deployed from and the latest ML code is staged.
input_release_branch: Name of the release branch. The production jobs (model training, batch inference) defined in this repo pull ML code from this branch.

Or used for project initialization:

input_project_name: name of the current project
input_read_user_group: User group name to give READ permissions to for project resources (ML jobs, integration test job runs, and machine learning resources). A group with this name must exist in both the staging and prod workspaces. Defaults to "users", which grants read permission to all users in the staging/prod workspaces. You can specify a custom group name e.g. to restrict read permissions to members of the team working on the current ML project.
input_include_models_in_unity_catalog: If selected, models will be registered to Unity Catalog. Models will be registered under a three-level namespace of <catalog>.<schema_name>.<model_name>, according the the target environment in which the model registration code is executed. Thus, if model registration code runs in the prod environment, the model will be registered to the prod catalog under the namespace <prod>.<schema>.<model_name>. This assumes that the respective catalogs exist in Unity Catalog (e.g. dev, staging and prod catalogs). Target environment names, and catalogs to be used are defined in the Databricks bundles files, and can be updated as needed.
input_schema_name: If using Models in Unity Catalog, specify the name of the schema under which the models should be registered, but we recommend keeping the name the same as the project name. We default to using the same schema_name across catalogs, thus this schema must exist in each catalog used. For example, the training pipeline when executed in the staging environment will register the model to staging.<schema_name>.<model_name>, whereas the same pipeline executed in the prod environment will register the mode to prod.<schema_name>.<model_name>. Also, be sure that the service principals in each respective environment have the right permissions to access this schema, which would be USE_CATALOG, USE_SCHEMA, MODIFY, CREATE_MODEL, and CREATE_TABLE.
input_unity_catalog_read_user_group: If using Models in Unity Catalog, define the name of the user group to grant EXECUTE (read & use model) privileges for the registered model. Defaults to "account users".
input_include_feature_store: If selected, will provide Databricks Feature Store stack components including: project structure and sample feature Python modules, feature engineering notebooks, ML resource configs to provision and manage Feature Store jobs, and automated integration tests covering feature engineering and training.
input_include_mlflow_recipes: If selected, will provide MLflow Recipes stack components, dividing the training pipeline into configurable steps and profiles.

See the generated README.md for next steps!

Customize MLOps Stacks

Your organization can use the default stack as is or customize it as needed, e.g. to add/remove components or adapt individual components to fit your organization's best practices. See the stack customization guide for more details.

FAQ

Do I need separate dev/staging/prod workspaces to use MLOps Stacks?

We recommend using separate dev/staging/prod Databricks workspaces for stronger isolation between environments. For example, Databricks REST API rate limits are applied per-workspace, so if using Databricks Model Serving, using separate workspaces can help prevent high load in staging from DOSing your production model serving endpoints.

However, you can create a single workspace stack, by supplying the same workspace URL for input_databricks_staging_workspace_host and input_databricks_prod_workspace_host. If you go this route, we recommend using different service principals to manage staging vs prod resources, to ensure that CI workloads run in staging cannot interfere with production resources.

I have an existing ML project. Can I productionize it using MLOps Stacks?

Yes. Currently, you can instantiate a new project and copy relevant components into your existing project to productionize it. MLOps Stacks is modularized, so you can e.g. copy just the GitHub Actions workflows under .github or ML resource configs under {{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/resources and {{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/databricks.yml into your existing project.

Can I adopt individual components of MLOps Stacks?

For this use case, we recommend instantiating via Databricks asset bundle templates and copying the relevant subdirectories. For example, all ML resource configs are defined under {{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/resources and {{.input_root_dir}}/{{template `project_name_alphanumeric_underscore` .}}/databricks.yml, while CI/CD is defined e.g. under .github if using GitHub Actions, or under .azure if using Azure DevOps.

Can I customize my MLOps Stack?

Yes. We provide the default stack in this repo as a production-friendly starting point for MLOps. However, in many cases you may need to customize the stack to match your organization's best practices. See the stack customization guide for details on how to do this.

Does the MLOps Stacks cover data (ETL) pipelines?

Since MLOps Stacks is based on databricks CLI bundles, it's not limited only to ML workflows and resources - it works for resources across the Databricks Lakehouse. For instance, while the existing ML code samples contain feature engineering, training, model validation, deployment and batch inference workflows, you can use it for Delta Live Tables pipelines as well.

How can I provide feedback?

Please provide feedback (bug reports, feature requests, etc) via GitHub issues.

Contributing

We welcome community contributions. For substantial changes, we ask that you first file a GitHub issue to facilitate discussion, before opening a pull request.

MLOps Stacks is implemented as a Databricks asset bundle template that generates new projects given user-supplied parameters. Parametrized project code can be found under the {{.input_root_dir}} directory.

Installing development requirements

To run tests, install actionlint, databricks CLI, npm, and act, then install the Python dependencies listed in dev-requirements.txt:

pip install -r dev-requirements.txt

Running the tests

NOTE: This section is for open-source developers contributing to the default stack in this repo. If you are working on an ML project using the stack (e.g. if you ran databricks bundle init to start a new project), see the README.md within your generated project directory for detailed instructions on how to make and test changes.

Run unit tests:

pytest tests

Run all tests (unit and slower integration tests):

pytest tests --large

Run integration tests only:

pytest tests --large-only

Previewing changes

When making changes to MLOps Stacks, it can be convenient to see how those changes affect a generated new ML project. To do this, you can create an example project from your local checkout of the repo, and inspect its contents/run tests within the project.

We provide example project configs for Azure (using both GitHub and Azure DevOps), AWS (using GitHub), and GCP (using GitHub) under tests/example-project-configs. To create an example Azure project, using Azure DevOps as the CI/CD platform, run the following from the desired parent directory of the example project:

# Note: update MLOPS_STACKS_PATH to the path to your local checkout of the MLOps Stacks repo
MLOPS_STACKS_PATH=~/mlops-stacks
databricks bundle init "$MLOPS_STACKS_PATH" --config-file "$MLOPS_STACKS_PATH/tests/example-project-configs/azure/azure-devops.json"

To create an example AWS project, using GitHub Actions for CI/CD, run:

# Note: update MLOPS_STACKS_PATH to the path to your local checkout of the MLOps Stacks repo
MLOPS_STACKS_PATH=~/mlops-stacks
databricks bundle init "$MLOPS_STACKS_PATH" --config-file "$MLOPS_STACKS_PATH/tests/example-project-configs/aws/aws-github.json"

To create an example GCP project, using GitHub Actions for CI/CD, run:

# Note: update MLOPS_STACKS_PATH to the path to your local checkout of the MLOps Stacks repo
MLOPS_STACKS_PATH=~/mlops-stacks
databricks bundle init "$MLOPS_STACKS_PATH" --config-file "$MLOPS_STACKS_PATH/tests/example-project-configs/gcp/gcp-github.json"

mlops-stacks's People

Contributors

Stargazers

Watchers

Forkers

smurching vadim yinxi-db david-tempelmann shubhm13 alinetleitao niall-turbitt isabella232 debu-projects radhakrishnang douglasmendes mingyu89 kaija tatari-tv nuwan-db ahlag djfliu-uplift sk-nagarro bigdatavik fflory emoneyadvisor andreasjaeck sjuratov huarteto michael-chi bnugent2 paulofbmarcon aowen5000 torlar daynesorvisto chssunil ulc0 ralphliang wirelessfuture jinlinhe-db mzdbc brycelin alexxx-db beam-phattra philsalm wsilveira-usp nikie trungducnguyenvn multiconsult-group sanjeev-db sanjeev5079 spiderreda hemachandrand shuxiangzhang suvojyoti gauravm999 krishnamenon22 natasha-databricks mondalankur mabreuortega bronifty menhswu steviebuchicago danilodatabricks sathya-reddy-m esenthil2018 tagar ego jimdowling qili86 julio-pimentel pranavmasariya lxizquierdom-lk brunoscaglione aureliesalmon mniels17 hjertensgaard rasheed19 shahed9338 archanaashetty pietern bokawdg boskalis-python taocao seragentp johnsonphil ilialat dangz90 arpitjasa-db johnnypan0513 christopher-danz shaotong-db shumonster timarif valeriogarsi01 sumitsahaykoantek nlauchande morningcloud rt-databricks bhanuprakashsamoju eric-golinko-db xpacifica prvnmali2017 saurabhshukla-db peleja

mlops-stacks's Issues

Feature Store component with Azure DevOps CI/CD is not supported yet

Creating formal issue to push support for feature store with ADO CI/CD

Model validation with feature store should be supported.

Mode validation uses mlflow.evaluate. Currently it doesn't support evaluating models registered with databricks feature store.

Set repo as template repository

Hello Databricks,

Given the below:

We provide the default stack in this repo as a production-friendly starting point for MLOps.

Would it be possible/more-appropriate to set this repository as a template repository?

This would mean that users could easily create private "custom stacks" (e.g. for organizations) as per your Stack Customization Guide without the need to do a complex private fork procedures.

I look forward to hearing back!

Troubleshoot Azure DevOps guide: Issues arise when there's an existing resource group in the same subscription

Problem: I configured my resource group to be mlops-stack-ado but there already exists another resource group in the same subscription.

Fix: I manually changed all resource group names and storage account names to avoid the clash, but Terraform is not happy. I eventually fixed the clashes by running terraform init -reconfigure and terraform init -migrate-state in:

databricks-config/staging
databricks-config/prod
.mlops-setup-scripts/cicd

In hindsight, it probably would have been easier to just start the project from scratch, supplying different resource group and storage account names. Hope this helps someone out there!

Model Serving not working

Model Serving is getting failed.
Reason: Requires publishing the features to an online store which is missing

Was not able to create online store as mentioned in the documentation( though we don't have specific documentation for GCP )
https://docs.databricks.com/en/machine-learning/feature-store/online-tables.html

Was not able to see the menu for online store

Please let us know if there is specific steps that needs to be followed.

@puviarasu17

Failed to read recipe configuration

Getting "recipe.yaml" file not find error while triggering CI pipeline. While this file exists in repo and if we run the notebook manually in Databricks it works without any error.

Error message:
MlflowException: Failed to read recipe configuration. Please verify that the recipe.yaml configuration file and the YAML configuration file for the selected profile are syntactically correct and that the specified profile provides all required values for template substitutions defined in recipe.yaml.

Plans to use Bicep for ML Resource Config as Code

Hi team, are you planning to provide Bicep templates for ML Resource Config as Code? You are already doing that for Terraform, would be great to have an equivalent in Bicep.

Broken: databricks bundle init https://github.com/databricks/mlops-stack

Hi,

Executing this command:

databricks bundle init https://github.com/databricks/mlops-stack

yields:

Error: default value "{{if eq .input_cloud `azure`}}https://adb-xxxx.xx.azuredatabricks.net{{else if eq .input_cloud `aws`}}https://your-staging-workspace.

Looking at the commit history this seems to be related:

1159734#diff-a8a271588ec48ba602da95509392039ce200fd9deea23c61111ef867725aff19R39

Can you please fix / provide a workaround?

Thanks

New update with mlflow experiments causing _pickle.PicklingError

Since the update to mlflow integration with hyperopt where names are automatically assigned to experiments (such as smiling-worm-674), I began getting the following error consistently when running a previously working mlflow experiment with SparkTrials().

ERROR:hyperopt-spark:trial task 0 failed, exception is
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 405.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 405.0 (TID 1472) (10.143.252.81 executor 0):
org.apache.spark.api.python.PythonException: '_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range'

However, my experiment is not doing any pickling and my code is not referenced in the full traceback, so I am not exactly sure what the issue is. I can confirm that the experiment works when using hyperopt.Trials() rather than hyperopt.SparkTrials(). Apologies for such a lengthy issue, and sorry if the issue is some simple mistake on my end!

Here is the full traceback:

Full Traceback

Traceback (most recent call last):
 File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 602, in dump
   return Pickler.dump(self, obj)
 File "/databricks/python/lib/python3.9/site-packages/patsy/origin.py", line 117, in __getstate__
   raise NotImplementedError
NotImplementedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/databricks/spark/python/pyspark/serializers.py", line 527, in dumps
   return cloudpickle.dumps(obj, pickle_protocol)
 File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps
   cp.dump(obj)
 File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 604, in dump
   if "recursion" in e.args[0]:
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/databricks/spark/python/pyspark/worker.py", line 876, in main
   process()
 File "/databricks/spark/python/pyspark/worker.py", line 868, in process
   serializer.dump_stream(out_iter, outfile)
 File "/databricks/spark/python/pyspark/serializers.py", line 329, in dump_stream
   bytes = self.serializer.dumps(vs)
 File "/databricks/spark/python/pyspark/serializers.py", line 537, in dumps
   raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range

   at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:692)
   at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:902)
   at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:884)
   at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:645)
   at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
   at scala.collection.Iterator.foreach(Iterator.scala:943)
   at scala.collection.Iterator.foreach$(Iterator.scala:943)
   at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
   at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
   at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
   at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
   at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
   at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1029)
   at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
   at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
   at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
   at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:168)
   at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:136)
   at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
   at org.apache.spark.scheduler.Task.run(Task.scala:96)
   at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:889)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1692)
   at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:892)
   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:747)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
   at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3257)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3189)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3180)
   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
   at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3180)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1414)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1414)
   at scala.Option.foreach(Option.scala:407)
   at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1414)
   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3466)
   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3407)
   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3395)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)
   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1166)
   at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2702)
   at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1027)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:411)
   at org.apache.spark.rdd.RDD.collect(RDD.scala:1025)
   at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:282)
   at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
   at sun.reflect.GeneratedMethodAccessor282.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
   at py4j.Gateway.invoke(Gateway.java:306)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
   at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
   at java.lang.Thread.run(Thread.java:748)

The following is the code that is being run in the experiments:

mlflow.start_run()

spark_trials = SparkTrials(parallelism=16)

with mlflow.start_run(run_name='test_experiment'):
  best_result = fmin(
    fn=objective, 
    space=space,
    algo=tpe.suggest,
    max_evals=1024,
    trials=spark_trials)

Hyperopt Optimization Function

def objective(args):
    
    # Initialize model pipeline
    pipe = Pipeline(steps=[
        ('selection', args['selection'])
    ])
    
    pipe.set_params(**args['params']) # Model parameters will be set here
    pipe.fit(X, y)
    penalty = pipe['selection'].penalty_
    try:
        residual = np.sum(pipe['selection']._resid) / len(pipe['selection']._resid)
    except AttributeError:
        residual = -10000
    r2 = r2_score(y, pipe.predict(X))
    score = 1 - r2
    mean_square = mean_squared_error(y, pipe.predict(X))
    mlflow.log_metric('avg_residual', residual)
    mlflow.log_metric('mean_squared_error', mean_square)
    mlflow.log_metric('penalty', penalty)
    mlflow.log_metric('r2', r2)

    print(f"Model Name: {args['selection']}: ", score)
          
    # Since we have to minimize the score, we return 1- score.
    return {'loss': score, 'status': STATUS_OK}

Here are the parameters and parameter space:

Params and Parameter Space

params = {
  'selection__fixed': hp.choice('selection.fixed', fixed_arrs),
  'selection__random': hp.choice('selection.random', random_arrs),
  'selection__intercept': hp.choice('selection.intercept', (0, 1)),
  'selection__cov': hp.choice('selection.cov', (0, 1))
  }

space = hp.choice('regressors', [
    {
    'selection':LMEBaseRegressor(group=['panel'],
                                 dependent=dependent,
                                 media=media_cols),
    'params': params
    }
  ]
)

And finally here is the regressor I am using (including because its a custom class built ontop of sklearn):

LMEBaseRegressor Class

class LMEBaseRegressor(BaseEstimator, RegressorMixin):
    """Implementation of an LME Regression for scikit."""

    def __init__(self, random=None, fixed=None,
                 group=['panel'], dependent=None,
                 intercept=0, cov=0, media=None):
        self.random = random
        self.fixed = fixed
        self.group = group
        self.dependent = dependent
        self.intercept = intercept
        self.cov = cov
        self.media = media

    def fit(self, X, y):
        """Fit the model with LME."""
        str_dep = self.dependent[0]
        str_fixed = ' + '.join(self.fixed)
        str_random = ' + '.join(self.random)
        data = pd.concat([X, y], axis=1)
        self.penalty_ = 0
        print(f"{str_dep} ~ {self.intercept} + {str_fixed}")
        print(f"{self.cov} + {str_random}")
        try:
            mixed = smf.mixedlm(f"{str_dep} ~ {self.intercept} + {str_fixed}",
                                data,
                                re_formula=f"~ {self.cov} + {str_random}",
                                groups=data['panel'],
                                use_sqrt=True)\
                .fit(method=['lbfgs'])
            self._model = mixed
            self._resid = mixed.resid
            self.coef_ = mixed.params[0:len(self.fixed)]                    
        
        except(ValueError):
            print("Cannot predict random effects from singular covariance structure.")
            self.penalty_ = 100

        except(np.linalg.LinAlgError):
            print("Linear Algebra Error: recheck base model fit or try using fewer variables.")
            self.penalty_  = 100
        return self

    def predict(self, X):
        """Take the coefficients provided from fit and multiply them by X."""
        if self.penalty_ != 0:
            return np.ones(len(X)) * -100 * self.penalty_
        return self._model.predict(X)

Unable to deploy ML Resources

Hi Team,

I'm following the instrutions to setup a demo MLOPs project using the sample code. For referece, I have successfully completed the steps detailed in

ML quickstart
MLOps setup guide

I have a wroking repo with the CICD workflow setup in Github. However, as I started to provision databricks resources following the instruction from ML resource config guide. I got the following error from the Terraform CI check.

I could run az login locally fine. Any idea what could be wrong here?

Thanks in advance for any help or tips.

Broken link in documentation

The README.md at the top level contains a broken link:

dlt-meta: an example stack with CI/CD for Delta Live Tables pipelines.

The dlt-meta project does not exist, or is not public. Can this link be adjusted?

Videos not rendering on README page

Try using video src instead.

Compute cluster[Shared] for Service Principle to execute ML related workflows from GitHub Actions

We are using the below cluster configuration in our template project created from mlops-stacks with Feature store and Unity Catalog options enabled. When we run, we are getting the below exception in feature-engineering-workflow-asset.yml when Feature store is trying to create table in Unity catalog.

Note: We have the expected 'test' catalog in our metastore and the service principal has the right access.

Cluster Configuration in template project created from mlops-stacks:

new_cluster: &new_cluster
  new_cluster:
    num_workers: 1
    spark_version: 13.3.x-gpu-ml-scala2.12
    node_type_id: g2-standard-4
    custom_tags:
      clusterSource: mlops-stack

Exception in GitHub actions:
ValueError: Catalog 'test' does not exist in the metastore.

For an exploration of this issue, we tried the same notebook in an all-purpose cluster with shared access. We get the same exception. Also, we are getting the below exception when we try the sql query: SELECT CURRENT_METASTORE();

Exception in notebook attached to all-purpose shared cluster for the above sql:
AnalysisException: [OPERATION_REQUIRES_UNITY_CATALOG] Operation CURRENT_METASTORE requires Unity Catalog enabled.

Setting the spark config spark.databricks.unityCatalog.enabled to true is not working.

Can you please suggest the correct compute config we should be using for mlops-stacks with unity catalog and feature store enabled?

r.inspect() exception with Release 2.0.1

In Databricks 12.2LTS, when running r.inspect() within training/notebooks/Train get the following exception:
Uncaught TypeError: Cannot read properties of null (reading 'querySelectorAll')

When updating requirements.txt from mlflow==2.0.1 to mlflow==2.2.2 was able to get r.inspect() to render the pipeline graphic.

`bundle init` asks for schema to use when registering a model in Unity Catalog when selected to not use Unity Catalog for Model Registry

databricks bundle init mlops-stacks prompts to set schema and privileges in Unity Catalog when I selected that I don't want to use Unity Catalog:

# other steps

Whether to use the Model Registry with Unity Catalog: no

Name of schema to use when registering a model in Unity Catalog. 
Note that this schema must already exist, and we recommend keeping the name the same as the project name as well as giving the service principals the right access. Default [my-mlops-project]: 

User group name to give EXECUTE privileges to models in Unity Catalog. A group with this name must exist in the Unity Catalog that the staging and prod workspaces can access. Default [account users]: 

Whether to include Feature Store: no

Whether to include MLflow Recipes: no

✨ Your MLOps Stack has been created in the 'my-mlops-project-no-unity' directory!

Please refer to the README.md of your project for further instructions on getting started.

Side question: will this stack work if I don't have Unity Catalog enabled at all in my workspace?

Generate AAD Token stage of integration_test does not fail with 404 error

I hadn't run both of the bootstrap.py files, which led to my integration tests failing. In the Generate AAD Token step, I got "curl: (22) The requested URL returned error: 404 Not Found" but this step did not fail. The expected behavior would be to fail this step if no AAD token is created

databricks bundle init mlops-stacks // Error: variable "input_cloud" not defined

Hello!
Databricks CLI v0.211.0.
Python 3.10.0

When I trigger the initialization of the project with databricks bundle init mlops-stacks

I received the following message:

Welcome to MLOps Stacks. For detailed information on project generation, see the README at https://github.com/databricks/mlops-stacks/blob/main/README.md.
Error: variable "input_cloud" not defined

comments not matching commands in the CI workflow

these lines

should be addressed. comments say deploy in Staging workspace but the command is to deploy to Test workspace. the comment "Deploy Bundle to Test Deployment Target in Staging Workspace" is also very confusing.

The same issue for the GitHub Action workflow files here and here

Adding Data Monitoring for Databricks

Hi
Firstly, thanks a lot for this detailed implementation. It definitely looks very interesting and also quite extensive. I was just curious are there any plans for integrating Data monitoring for this template. I am interested in understanding how this can be done at a template level and would love some insights from you regarding this.

az login support for job creation

when using az login instead of DATABRICK_TOKEN.
databricks bundle deploy is able to upload code correctly on workspace but when comes to resource creation it fails with a terraform error like this :

Uploaded bundle files at /Users/25f671ac-9475-4c44-9856-524089e85c8f/.bundle/first-proj/staging/files!

Starting resource deployment
Error: terraform apply: exit status 1

Error: cannot create job: default auth: cannot configure default credentials. Config: host=https://adb-2827397339834432.12.azuredatabricks.net/, azure_use_msi=true, azure_client_id=25f671ac-9475-4c44-9856-524089e85c8f, azure_tenant_id=6e51e1ad-c54b-4b39-b598-0ffe9ae68fef. Env: DATABRICKS_HOST, ARM_USE_MSI, ARM_CLIENT_ID, ARM_TENANT_ID

  with databricks_job.batch_inference_job,
  on bundle.tf.json line 44, in resource.databricks_job.batch_inference_job:
  44:       },


Error: cannot create job: default auth: cannot configure default credentials. Config: host=https://adb-2827397339834432.12.azuredatabricks.net/, azure_use_msi=true, azure_client_id=25f671ac-9475-4c44-9856-524089e85c8f, azure_tenant_id=6e51e1ad-c54b-4b39-b598-0ffe9ae68fef. Env: DATABRICKS_HOST, ARM_USE_MSI, ARM_CLIENT_ID, ARM_TENANT_ID

  with databricks_job.model_training_job,
  on bundle.tf.json line 122, in resource.databricks_job.model_training_job:
 122:       },


Error: cannot create job: default auth: cannot configure default credentials. Config: host=https://adb-2827397339834432.12.azuredatabricks.net/, azure_use_msi=true, azure_client_id=25f671ac-9475-4c44-9856-524089e85c8f, azure_tenant_id=6e51e1ad-c54b-4b39-b598-0ffe9ae68fef. Env: DATABRICKS_HOST, ARM_USE_MSI, ARM_CLIENT_ID, ARM_TENANT_ID

  with databricks_job.write_feature_table_job,
  on bundle.tf.json line 178, in resource.databricks_job.write_feature_table_job:
 178:       }


Error: cannot create mlflow experiment: default auth: cannot configure default credentials. Config: host=https://adb-2827397339834432.12.azuredatabricks.net/, azure_use_msi=true, azure_client_id=25f671ac-9475-4c44-9856-524089e85c8f, azure_tenant_id=6e51e1ad-c54b-4b39-b598-0ffe9ae68fef. Env: DATABRICKS_HOST, ARM_USE_MSI, ARM_CLIENT_ID, ARM_TENANT_ID

  with databricks_mlflow_experiment.experiment,
  on bundle.tf.json line 183, in resource.databricks_mlflow_experiment.experiment:
 183:       }


Error: cannot create mlflow model: default auth: cannot configure default credentials. Config: host=https://adb-2827397339834432.12.azuredatabricks.net/, azure_use_msi=true, azure_client_id=25f671ac-9475-4c44-9856-524089e85c8f, azure_tenant_id=6e51e1ad-c54b-4b39-b598-0ffe9ae68fef. Env: DATABRICKS_HOST, ARM_USE_MSI, ARM_CLIENT_ID, ARM_TENANT_ID

  with databricks_mlflow_model.model,
  on bundle.tf.json line 189, in resource.databricks_mlflow_model.model:
 189:       }```

Command line command described in ml-developer-guide-fs.md fails

The command pip install -I -r requirements.txt -r test-requirements.txt described in ml-developer-guide-fs.md does not work because the requirements.txt and the test-requirements.txt are created in different folders.

Databricks group name conflict

When using the same workspace to mimic staging and prod, .mlops-setup-scripts/cicd/main-azure.tf attempts to create 2 groups with the same name, causing an error.

Fix is to make display_name unique for prod and staging.

Test2

description

Need an additional Step added for Preprocessing Data after Ingest

Need a step that allows users to preprocess the data after ingest.

Example use case: I needed to sort my training data based on some date column prior to splitting.

Can't register a new version of a model

Hi there,

I'm currently trying to implement this stack at my workplace and facing an issue that I'd like to understand if I'm doing something wrong or if it's a configuration error.

Since we only have 2 workspaces, one for prod and the other for QA and development, with a shared Unity catalog, my idea was to configure the Unity catalog with the name "ml-ops." Then, in the schema, use the model's name, in this case, "prometheus," and within the registry of each schema for each model, name the models as follows: prod-prometheus-model, staging-prometheus-model, and dev-prometheus-model.

For this, I made the following modifications in the code of the following files:

ml-artifacts-asset.yml

resources:
  registered_models:
    model:
      name: ${bundle.target}-${var.model_name}
      catalog_name: ml-ops
      schema_name: prometheus-model
      <<: *grants
      depends_on:
        - resources.jobs.model_training_job.id
        - resources.jobs.batch_inference_job.id

databricks.yml

bundle:
  name: ${bundle.target}-${var.model_name}

variables:
  experiment_name:
    description: Experiment name for model training.
    default: /Users/${workspace.current_user.userName}/${bundle.target}-prometheus-experiment
  model_name:
    description: Model name for model training.
    default: prometheus-model

model-workflow-asset.yml

resources:
  jobs:
    model_training_job:
      name: ${bundle.target}-${var.model_name}-model-training-job
      job_clusters:
        - job_cluster_key: model_training_job_cluster
          <<: *new_cluster
      tasks:
        - task_key: Train
          job_cluster_key: model_training_job_cluster
          notebook_task:
            notebook_path: ../training/notebooks/Train.py
            base_parameters:
              env: ${bundle.target}
              # TODO: Update training_data_path
              training_data_path: /databricks-datasets/nyctaxi-with-zipcodes/subsampled
              experiment_name: ${var.experiment_name}
              model_name: ml-ops.${var.model_name}.${bundle.target}-${var.model_name}
              # git source information of the current ML asset deployment. It will be persisted as part of the workflow run
              git_source_info: url:${bundle.git.origin_url}; branch:${bundle.git.branch}; commit:${bundle.git.commit}

However, after the first deployment, when the CI/CD pipeline runs:

databricks bundle deploy -t staging

I get the following error:

Updating deployment state...
Error: terraform apply: exit status 1

Error: cannot create registered model: Function or Model 'ml-ops.prometheus-model.staging-prometheus-model' already exists

  with databricks_registered_model.model,
  on bundle.tf.json line 188, in resource.databricks_registered_model.model:
 188:       }

Besides that, everything runs perfectly, and I can serve the model without a trouble. Also, I'm using the demo model. I haven't implemented our own yet.

I'm not sure if I'm doing something wrong. Any guidance would be appreciated.

Error at deploying new model

Hi, it's me again!

I'm currently facing an issue where whenever I change the model name, it gives me this error:

What I'm changing:

bundle:
  name: age

variables:
  experiment_name:
    description: Experiment name for the model training.
    default: /Users/${workspace.current_user.userName}/${bundle.target}-${var.model_name}-experiment
  model_name:
    description: Model name for the model training.
    default: age-model

The error

Deploying resources...
Updating deployment state...
Error: terraform apply: exit status 1

Error: unknown is not fully supported yet

  with databricks_grants.registered_model_model,
  on bundle.tf.json line 25, in resource.databricks_grants.registered_model_model:
  25:       }

If I use my old model who is already deploy with the bundle, it works perfectly. I'm using the other repo as a base, theorically it should work the same but with another name am I right?

GCP Support

In the template creation, I could only see AWS/Azure options. What should we do for Databricks on GCP?
Please advise. Thanks.

This is a test

Can you please check out this test issue I have with the stack?

Inconsistent formatting for strings

Line 140 has f strings with no replacement, then some parts of file uses concatenation others f-strngs. Replaced with f-strings for consistency.

#60

monorepo doesn't work as expected

It's not possible to create more than one project in the same repository.

Reproducibility steps:

Create a project project-1 with databricks bundle init mlops-stacks
For the input_root_dir variable, I chose the default value of my-mlops-project. It worked ok.
Create a second project, project-2, keeping the same value for input_root_dir (my-mlops-project)
That shows an error because it's trying to write on already existing files:

Error: failed to initialize template, one or more files already exist: my-mlops-project.gitignore

I am using the last available version of databricks-cli (0.210.1)

Test PagerDuty Integration

Troubleshoot AzureOps Guide: grpc: error while marshaling: string field contains invalid UTF-8

Error: Error: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 after running python .mlops-setup-scripts/terraform/bootstrap.py and being prompted for region.

Fix: this seems to be a problem with Azure CLI and how it interacts with Terraform. Please see this Terraform GitHub issue for fixes.

Error 429 in CI/CD pipeline in ML Code Tests for my-mlops-project

I set up CI/CD configuration and try to run a pipeline.
However, it fails with the following error at the time of deployment.

Run databricks bundle deploy -e test --log-level DEBUG

time=2023-05-25T06:18:49.717Z level=DEBUG source=client.go:255 msg="POST /api/2.0/workspace-files/import-file/Users/***/.bundle/my-mlops-project/test/files/training/steps/transform.py?overwrite=true\n[non-JSON document of 2272 bytes]. \"\"\"\nThis module defines the following routines used by the 'transform' step of the regression re... (2176 more bytes)\n< HTTP/2.0 429 Too Many Requests (Error: Current request has to be retried)\n" mutator=deploy mutator=deferred mutator=files.Upload sdk=true
time=2023-05-25T06:18:49.738Z level=DEBUG source=client.go:255 msg="POST /api/2.0/workspace-files/import-file/Users/***/.bundle/my-mlops-project/test/files/validation/README.md?overwrite=true\n[non-JSON document of 188 bytes]. # Model Validation\nTo enable model validation as part of scheduled databricks workflow, please r... (92 more bytes)\n< HTTP/2.0 429 Too Many Requests (Error: Current request has to be retried)\n" mutator=deploy mutator=deferred mutator=files.Upload sdk=true
time=2023-05-25T06:18:49.740Z level=DEBUG source=client.go:255 msg="POST /api/2.0/workspace-files/import-file/Users/***/.bundle/my-mlops-project/test/files/utils.py?overwrite=true\n[non-JSON document of 949 bytes]. \"\"\"\nThis module contains utils shared between different notebooks\n\"\"\"\nimport json\nimport mlflow\n... (853 more bytes)\n< HTTP/2.0 429 Too Many Requests (Error: Current request has to be retried)\n" mutator=deploy mutator=deferred mutator=files.Upload sdk=true
time=2023-05-25T06:18:49.755Z level=DEBUG source=client.go:255 msg="POST /api/2.0/workspace-files/import-file/Users/***/.bundle/my-mlops-project/test/files/validation/notebooks/ModelValidation.py?overwrite=true\n[non-JSON document of 12405 bytes]. # Databricks notebook source\n###################################################################... (12309 more bytes)\n< HTTP/2.0 429 Too Many Requests (Error: Current request has to be retried)\n" mutator=deploy mutator=deferred mutator=files.Upload sdk=true
time=2023-05-25T06:18:49.777Z level=DEBUG source=client.go:255 msg="POST /api/2.0/workspace-files/import-file/Users/***/.bundle/my-mlops-project/test/files/validation/validation.py?overwrite=true\n[non-JSON document of 2053 bytes]. import numpy as np\nfrom mlflow.models import make_metric, MetricThreshold\n\n# Custom metrics to b... (1957 more bytes)\n< HTTP/2.0 429 Too Many Requests (Error: Current request has to be retried)\n" mutator=deploy mutator=deferred mutator=files.Upload sdk=true
time=2023-05-25T06:18:50.160Z level=DEBUG source=client.go:255 msg="POST /api/2.0/workspace-files/import-file/Users/***/.bundle/my-mlops-project/test/files/training/data/sample.parquet?overwrite=true\n[non-JSON document of 232389 bytes]. PAR1\x15\x04\x15\xc0\xe0\t\x15\xac\xde\bL\x15\x88\x9c\x01\x15\x04\x12\x00\x00\xa0\xf0\x04\xb0@t[\xbc\xad+\x05\x00@\xa7\xa1\xf5\xaa+\x05\x00\x80&\x94%!\x1f(MISSING)+\x05\x00\xc0\x96#^\x97+\x05\x00\x00c\x14gm,\x05\x00\xc0?1\f\x9c\x01(l\xaa\xde\x05\x14,\x05\x00@\xe8\x86\x1d\x11,\x05\x00\x80\xe1kY\xdb... (232293 more bytes)\n< HTTP/2.0 200 OK\n" mutator=deploy mutator=deferred mutator=files.Upload sdk=true
time=2023-05-25T06:18:51.492Z level=DEBUG source=client.go:255 msg="POST /api/2.0/workspace-files/import-file/Users/***/.bundle/my-mlops-project/test/files/utils.py?overwrite=true\n[non-JSON document of 949 bytes]. \"\"\"\nThis module contains utils shared between different notebooks\n\"\"\"\nimport json\nimport mlflow\n... (853 more bytes)\n< HTTP/2.0 200 OK\n" mutator=deploy mutator=deferred mutator=files.Upload sdk=true
...
Error: terraform apply: exit status 1

Error: failed to read schema for databricks_permissions.job_model_training_job in registry.terraform.io/databricks/databricks: failed to instantiate provider "registry.terraform.io/databricks/databricks" to obtain schema: Unrecognized remote plugin message: 

This usually means that the plugin is either invalid or simply
needs to be recompiled to support the latest protocol.


time=2023-05-25T06:18:54.787Z level=ERROR source=root.go:96 msg="failed execution" exit_code=1 error="terraform apply: exit status 1\n\nError: failed to read schema for databricks_permissions.job_model_training_job in registry.terraform.io/databricks/databricks: failed to instantiate provider \"registry.terraform.io/databricks/databricks\" to obtain schema: Unrecognized remote plugin message: \n\nThis usually means that the plugin is either invalid or simply\nneeds to be recompiled to support the latest protocol.\n\n"

Define schemas as a resource

How should we define schemas as a resource? I had some difficulties with creating a schema. Does DAB support schemas as a resource type?

resources:
  schemas:
    schema:
      name: my-mlops-project
      catalog_name: ${bundle.target}
      comment: Schema for the "example-project" ML Project for ${bundle.target} deployment target.

Order of selecting configuration parameters may be improved

Seems like the order of menu items for select_cloud() and select_cicd_platform() should be reversed.
aws is alphabetically first before azure and if these are reversed the user can choose 1 -- 1 or 2 -- 2 for a more "native" pairing.

Select cloud:
1 - azure
2 - aws
Choose from 1, 2 [1]:
Select cicd_platform:
1 - GitHub Actions
2 - Azure DevOps
Choose from 1, 2 [1]:

Jenkins example and python package intake

Hello,

Thanks for creating this stack. I have two questions as below:

Is there an example with Jenkins as the CI/CD backend?
If I already have a project as a python package, how can I add it to this stack?
Should we consider a python package structure instead of scripts inside this stack?

Thank you!

Not possible to generate the bundle MlOps template

Hello, i was looking for some guidelines for using Databricks bundle for MlOps use cases. I came across this documentation :

Unfortunately, I'm not able to generate the template with

databricks bundle init mlops-stacks

I've got the following error:

Error: open mlops-stacks/databricks_template_schema.json: no such file or directory

If anyone can help me, I'd appreciated
Thank you :)

packaging of project

Hey, since for example many teams within organizations does not have access to public github, will u package this repo in order to initialize project structure and everything

Add Google Cloud Support

Description

Today the MLOps Stack supports Azure and AWS, I'd like to add support to Google Cloud Platform.

Features

Users generate ML project from the MLOps Stack project templates which integrated with Databricks workspaces running on Google Cloud. Initial features including:

GitHub Actions as the CI/CD platform.
Ability to generate ML project from the cookiecutter template. Which allows the user to configure:
- ML Project
- CI/CD platform
- Choose Google Cloud as the cloud provider
- Enable Feature Store
- Production / Staging workspaces
- ML experiment parent directory
- Github branches
Integration between GitHub Actions and Google Cloud.
- Authentication through Workload Identity Pool.
- Setup and configure required Google Cloud permissions / roles through Terraform.
Add required configuration and setup instructions to the readme.

Databricks CLI v0.212.0 breaks MLOps Stacks bundle validation

After upgrading to the latest version of the Databricks CLI (v0.212.0), we are experiencing issues with MLOps Stacks bundle validation. Specifically, attempting to validate a bundle now returns the following error:

Error: failed to load /databricks-resources/feature-tables-workflow-resource.yml: error unmarshaling JSON: json: cannot unmarshal object into Go struct field Root.permissions of type []resources.Permission

The error occurs in other workflow yaml files too where there are permissions set. This was not an issue in previous versions of the CLI, and we suspect that the update has caused some sort of compatibility problem.

Steps to Reproduce:

Install the latest version of the Databricks CLI (v0.212.0)
Attempt to validate an MLOps Stacks bundle
Observe the above error message

Expected Result:
MLOps Stacks bundle should validate without issue.

Actual Result:
The validation process fails with the error message described above.

Environment:

Databricks CLI v0.212.0
MLOps Stacks bundle validation

Bug: need to lowercase the project name when passing into azurerm_storage_account

Error when bootstrapping Terraform backend setup with ./mlops-setup-scripts/terraform module where the package name contains capital letters. azurerm_storage_account attribute name must only consist of lowercase letters. We should lowercase the project name passed here

Forbidden access to public github

Hey, my organization is using github enterprise, access to public github is denied. I am sure this is case for a lot of companies, can u make it installable from pypi, so i dont have to download zipped repo.

databricks bundle do not use a proxy

I'm using a proxy in github action runner by setting these env variables

http_proxy
https_proxy
no_proxy
HTTP_PROXY
HTTPS_PROXY
NO_PROXY

I believe that "databricks bundle" command do not use the proxy
I get this error
Error: error downloading Terraform: Get "https://releases.hashicorp.com/terraform/index.json": Forbidden
when running databricks bundle validate -e test