databricks-demos / dbdemos Goto Github PK

View Code? Open in Web Editor NEW

250.0 250.0 77.0 183.59 MB

Demos to implement your Databricks Lakehouse

License: Other

Shell 0.03% Python 11.80% HTML 88.16%

dbdemos's People

Contributors

Stargazers

Watchers

Forkers

a0x8o clementla-db bigdatavik bachwehbi kenniy-db juanlamadrid20 carrossoni sjuratov aowen5000 cchalc 0xbadidea caiomsouza qixwang pavithrarao91 srghat ronguerrero kalirajg04 rodrigoparadav vikram9503 rkj13780 tariqmusa jcarlosneto linxtong zhangabner skumarrm anasanchezss9 sp496 ikennanwosu andrewgreiner jenny-zhu-databricks deepikakiran vinayemmadi noahsa-db db-caioishizaka zosbhai tcotten parijatbhi-db vinod221 myereddy cwm3 caitlinstrong abhiagar2019 sqltj noahsommerfeld parthasarathi0317 pawaritl pierredoescare bazzazzadeh godstaker sanyaade-projects moacybarros l16by atanu-c pradeepsingh87 flaviomalavazi anil2799 ksurampudi domd7 boydzd cgripaldo shaik73 nskselva brahmawritings shodix aktaseren ievsantillan shafaypro annihaltor99 inyunch davesee contactvictor cleancoding109 marcelo-hiltner vladsagot nhat416 jacksonisaac

dbdemos's Issues

dbdemos-uc-04-audit-log

I want to enable Unity Catalog diagnostic settings in my Azure Databricks Service but I can't access the monitoring menu even when I am the owner of the resource.
Do you have any idea of what could be causing this?

Autoloader Demo: stop_all_streams() failing as 'stream_count' is not defined

#The function below references 'stream_count' which is not defined. Likely should be = len(streams)

# Function to stop all streaming queries 
def stop_all_streams(start_with = ""):
  streams = get_active_streams(start_with)
  if len(streams) > 0:
    print(f"Stopping {stream_count} streams")
    for s in streams:
        try:
            s.stop()
        except:
            pass
    print(f"All stream stopped (starting with: {start_with}.")

DLT CDC demo reference a DLT pipeline in e2 demo environment

CLUSTER_CREATION_ERROR when installing demos on Workspaces with the Compliance Security Profile enabled

When installing a dbdemo in a HIPAA-compliant environment, the following error is thrown, and the cluster is not created.

WARN: couldn't create the cluster for the demo: {'error_code': 'INVALID_PARAMETER_VALUE', 'message': 'Workspace restricted to instance types that encrypt in transit. Please specify one such driver node type'}

It happened when I tried to install the delta-lake demo on a HIPAA-compliant workspace on AWS that enforces Nitro instance types.

Code:

%pip install dbdemos

import dbdemos
dbdemos.list_demos()

dbdemos.install('delta-lake')

The notebooks are created without problems, and the demo runs smoothly in a user-created cluster.

Unable to locate notebook _resources/00-init in the git repo for Dolly chatbot

I'm trying to run the chat bot as mentioned in the demos. However, I encountered an issue with the langchain pipeline not supporting summarization yet. As a temporary fix, the companion notebook mentions the addition of the following code snippet:

hf_summary = HuggingFacePipeline_WithSummarization(pipeline=pipe_summary)

The notebook refers to a file named _resources/00-init, but I am unable to find it in the git repository. Could you please guide me on where to find the notebook or provide an alternative solution to address the issue?

Thank you for your assistance!

Make batch summarizer pipeline in llm-dolly-chatbot run on GPU

Existing code in 02-Data-preparation notebook of llm-dolly-chatbot demo has two issues:

@pandas_udf("string")
def summarize(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
    # Load the model for summarization
    summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
    def summarize_txt(text):
      if len(text) > 400:
        return summarizer(text)
      return text

    for serie in iterator:
        # get a summary for each row
        yield serie.apply(summarize_txt)

It doesn't utilize the GPU
The summarization pipeline returns values of type List[Dict], which fails in the write operation

Recommend replacing with

@pandas_udf("string")
def summarize(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
    # Load the model for summarization
    summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device=0)
    def summarize_txt(text):
      if len(text) > 400:
        return summarizer(text)[0]['summary_text']
      return text

    for serie in iterator:
        # get a summary for each row
        yield serie.apply(summarize_txt)

This results in GPU utilization around 40% - probably low because we're using a batch size of 1, but definitely faster than using no GPU. The entire job runs in 16 minutes on the g5.4xlarge cluster created by dbdemos

uc-04-system-tables: can't view _enable_system_tables notebook, 404

on the page https://www.dbdemos.ai/demo-notebooks.html?demoName=uc-04-system-tables, when I click the link for the _enable_system_tables notebook, it takes me to https://www.dbdemos.ai/minisite/uc-04-system-tables/$./_enable_system_tables and gives a 404

installation fails on standard non premium workspace (dlt/dbsql not available)

create_cluster bugs

latest release added a bug in create_cluster with the new display output

IoT Platform demo job task register_ml_model fails trying to display confusion matrix

Running on Azure Databricks Premium I am able to import the lakehouse-iot-platform demo and related assets successfully.

The dbdemos_lakehouse_iot_turbine_init job run starts automatically as it should, but fails on the register_ml_model task with the following error
FileNotFoundError: [Errno 2] No such file or directory: '/local_disk0/tmp/d5ae264a/confusion_matrix.png'

The job runs by default on a job cluster. Swapping the job to run on the automatically created all-purpose cluster fixes the problem. As an alternative workaround commenting out displaying the confusion matrix also works.

Notebooks links are not working in chatbot demo

https://www.dbdemos.ai/demo-notebooks.html?demoName=llm-dolly-chatbot
Notebook links are giving 404 error in above tutorial.

dbdemos_lakehouse_churn_init_<your_name> failed with Notebook Exception - "com.databricks.WorkflowException: com.databricks.NotebookExecutionException: FAILED"

while running the lakehouse churn job, the job got failed with notebook exception -
com.databricks.WorkflowException: com.databricks.NotebookExecutionException: FAILED

From DLT pipeline the error code is pasted below for reference.

Update 0507a1 is FAILED.

java.lang.RuntimeException: Failed to execute python command for notebook '/Users/[email protected]/databricks_demo/lakehouse-retail-c360/01-Data-ingestion/01.2-DLT-churn-Python-UDF' with id RunnableCommandId(4793992086333503899) and error AnsiResult(---------------------------------------------------------------------------
RestException Traceback (most recent call last)
File :6
2 import mlflow
3 # Stage/version
4 # Model name |
5 # | |
----> 6 predict_churn_udf = mlflow.pyfunc.spark_udf(spark, "models:/dbdemos_customer_churn/Production")
7 spark.udf.register("predict_churn", predict_churn_udf)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/pyfunc/init.py:996, in spark_udf(spark, model_uri, result_type, env_manager)
989 if not any(isinstance(elem_type, x) for x in supported_types):
990 raise MlflowException(
991 message="Invalid result_type '{}'. Result type can only be one of or an array of one "
992 "of the following types: {}".format(str(elem_type), str(supported_types)),
993 error_code=INVALID_PARAMETER_VALUE,
994 )
--> 996 local_model_path = _download_artifact_from_uri(
997 artifact_uri=model_uri,
998 output_path=_create_model_downloading_tmp_dir(should_use_nfs),
999 )
1001 if env_manager == _EnvManager.LOCAL:
1002 # Assume spark executor python environment is the same with spark driver side.
1003 _warn_dependency_requirement_mismatches(local_model_path)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/tracking/artifact_utils.py:100, in _download_artifact_from_uri(artifact_uri, output_path)
94 """
95 :param artifact_uri: The absolute URI of the artifact to download.
96 :param output_path: The local filesystem path to which to download the artifact. If unspecified,
97 a local output path will be created.
98 """
99 root_uri, artifact_path = _get_root_uri_and_artifact_path(artifact_uri)
--> 100 return get_artifact_repository(artifact_uri=root_uri).download_artifacts(
101 artifact_path=artifact_path, dst_path=output_path
102 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/artifact_repository_registry.py:114, in get_artifact_repository(artifact_uri)
104 def get_artifact_repository(artifact_uri):
105 """Get an artifact repository from the registry based on the scheme of artifact_uri
106
107 :param artifact_uri: The artifact store URI. This URI is used to select which artifact
(...)
112 requirements.
113 """
--> 114 return _artifact_repository_registry.get_artifact_repository(artifact_uri)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/artifact_repository_registry.py:72, in ArtifactRepositoryRegistry.get_artifact_repository(self, artifact_uri)
65 if repository is None:
66 raise MlflowException(
67 "Could not find a registered artifact repository for: {}. "
68 "Currently registered schemes are: {}".format(
69 artifact_uri, list(self._registry.keys())
70 )
71 )
---> 72 return repository(artifact_uri)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/models_artifact_repo.py:42, in ModelsArtifactRepository.init(self, artifact_uri)
37 self.repo = UnityCatalogModelsArtifactRepository(
38 artifact_uri=artifact_uri, registry_uri=registry_uri
39 )
40 elif is_using_databricks_registry(artifact_uri):
41 # Use the DatabricksModelsArtifactRepository if a databricks profile is being used.
---> 42 self.repo = DatabricksModelsArtifactRepository(artifact_uri)
43 else:
44 uri = ModelsArtifactRepository.get_underlying_uri(artifact_uri)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/databricks_models_artifact_repo.py:63, in DatabricksModelsArtifactRepository.init(self, artifact_uri)
59 self.databricks_profile_uri = (
60 get_databricks_profile_uri_from_artifact_uri(artifact_uri) or mlflow.get_registry_uri()
61 )
62 client = MlflowClient(registry_uri=self.databricks_profile_uri)
---> 63 self.model_name, self.model_version = get_model_name_and_version(client, artifact_uri)
64 # Use an isolated thread pool executor for chunk uploads/downloads to avoid a deadlock
65 # caused by waiting for a chunk-upload/download task within a file-upload/download task.
66 # See https://superfastpython.com/threadpoolexecutor-deadlock/#Deadlock_1_Submit_and_Wait_for_a_Task_Within_a_Task
67 # for more details
68 self.chunk_thread_pool = self._create_thread_pool()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/utils/models.py:94, in get_model_name_and_version(client, models_uri)
92 if model_alias is not None:
93 return model_name, client.get_model_version_by_alias(model_name, model_alias).version
---> 94 return model_name, str(_get_latest_model_version(client, model_name, model_stage))

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/utils/models.py:32, in _get_latest_model_version(client, name, stage)
27 def _get_latest_model_version(client, name, stage):
28 """
29 Returns the latest version of the stage if stage is not None. Otherwise return the latest of all
30 versions.
31 """
---> 32 latest = client.get_latest_versions(name, None if stage is None else [stage])
33 if len(latest) == 0:
34 stage_str = "" if stage is None else f" and stage '{stage}'"

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/tracking/client.py:2425, in MlflowClient.get_latest_versions(self, name, stages)
2353 def get_latest_versions(self, name: str, stages: List[str] = None) -> List[ModelVersion]:
2354 """
2355 Latest version models for each requests stage. If no stages provided, returns the
2356 latest version for each stage.
(...)
2423 current_stage: None
2424 """
-> 2425 return self._get_registry_client().get_latest_versions(name, stages)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/tracking/_model_registry/client.py:140, in ModelRegistryClient.get_latest_versions(self, name, stages)
130 def get_latest_versions(self, name, stages=None):
131 """
132 Latest version models for each requests stage. If no stages provided, returns the
133 latest version for each stage.
(...)
138 :return: List of :py:class:mlflow.entities.model_registry.ModelVersion objects.
139 """
--> 140 return self.store.get_latest_versions(name, stages)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/model_registry/rest_store.py:169, in RestStore.get_latest_versions(self, name, stages)
159 """
160 Latest version models for each requested stage. If no stages argument is provided,
161 returns the latest version for each stage.
(...)
166 :return: List of :py:class:mlflow.entities.model_registry.ModelVersion objects.
167 """
168 req_body = message_to_json(GetLatestVersions(name=name, stages=stages))
--> 169 response_proto = self._call_endpoint(GetLatestVersions, req_body, call_all_endpoints=True)
170 return [
171 ModelVersion.from_proto(model_version)
172 for model_version in response_proto.model_versions
173 ]

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/model_registry/base_rest_store.py:39, in BaseRestStore._call_endpoint(self, api, json_body, call_all_endpoints, extra_headers)
37 if call_all_endpoints:
38 endpoints = self._get_all_endpoints_from_method(api)
---> 39 return call_endpoints(
40 self.get_host_creds(), endpoints, json_body, response_proto, extra_headers
41 )
42 else:
43 endpoint, method = self._get_endpoint_from_method(api)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:217, in call_endpoints(host_creds, endpoints, json_body, response_proto, extra_headers)
215 except RestException as e:
216 if e.error_code != ErrorCode.Name(ENDPOINT_NOT_FOUND) or i == len(endpoints) - 1:
--> 217 raise e

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:212, in call_endpoints(host_creds, endpoints, json_body, response_proto, extra_headers)
210 for i, (endpoint, method) in enumerate(endpoints):
211 try:
--> 212 return call_endpoint(
213 host_creds, endpoint, method, json_body, response_proto, extra_headers
214 )
215 except RestException as e:
216 if e.error_code != ErrorCode.Name(ENDPOINT_NOT_FOUND) or i == len(endpoints) - 1:

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:201, in call_endpoint(host_creds, endpoint, method, json_body, response_proto, extra_headers)
199 call_kwargs["json"] = json_body
200 response = http_request(**call_kwargs)
--> 201 response = verify_rest_response(response, endpoint)
202 js_dict = json.loads(response.text)
203 parse_dict(js_dict=js_dict, message=response_proto)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:133, in verify_rest_response(response, endpoint)
131 if response.status_code != 200:
132 if _can_parse_as_json_object(response.text):
--> 133 raise RestException(json.loads(response.text))
134 else:
135 base_msg = "API request to endpoint {} failed with error code {} != 200".format(
136 endpoint,
137 response.status_code,
138 )

RestException: RESOURCE_DOES_NOT_EXIST: RegisteredModel 'dbdemos_customer_churn' does not exist. It might have been deleted.,None,Map(),Map(),List(),List(),Map())
at com.databricks.pipelines.execution.core.languages.PythonRepl.runCmd(PythonRepl.scala:335)
at com.databricks.pipelines.execution.service.PipelineRunnable$.$anonfun$loadPythonGraph$8(PipelineGraphLoader.scala:597)
at com.databricks.pipelines.execution.service.PipelineRunnable$.$anonfun$loadPythonGraph$8$adapted(PipelineGraphLoader.scala:595)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at com.databricks.pipelines.execution.service.PipelineRunnable$.$anonfun$loadPythonGraph$7(PipelineGraphLoader.scala:595)
at com.databricks.pipelines.execution.service.PipelineRunnable$.$anonfun$loadPythonGraph$7$adapted(PipelineGraphLoader.scala:572)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:193)
at com.databricks.pipelines.execution.service.PipelineRunnable$.loadPythonGraph(PipelineGraphLoader.scala:572)
at com.databricks.pipelines.execution.service.PipelineGraphLoader.loadGraph(PipelineGraphLoader.scala:324)
at com.databricks.pipelines.execution.service.PipelineGraphLoader.loadGraph(PipelineGraphLoader.scala:205)
at com.databricks.pipelines.execution.service.DLTComputeRunnableContext.loadGraph(DLTComputeRunnableContext.scala:96)
at com.databricks.pipelines.execution.core.UpdateExecution.$anonfun$initializeAndLoadGraph$1(UpdateExecution.scala:364)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.pipelines.execution.core.monitoring.DeltaPipelinesUsageLogging.$anonfun$recordPipelinesOperation$3(DeltaPipelinesUsageLogging.scala:118)
at com.databricks.pipelines.common.monitoring.OperationStatusReporter.executeWithPeriodicReporting(OperationStatusReporter.scala:120)
at com.databricks.pipelines.common.monitoring.OperationStatusReporter$.executeWithPeriodicReporting(OperationStatusReporter.scala:160)
at com.databricks.pipelines.execution.core.monitoring.DeltaPipelinesUsageLogging.$anonfun$recordPipelinesOperation$6(DeltaPipelinesUsageLogging.scala:137)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:555)
at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:650)
at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:671)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:412)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:158)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:410)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:407)
at com.databricks.pipelines.execution.core.monitoring.PublicLogging.withAttributionContext(DeltaPipelinesUsageLogging.scala:25)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:455)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:440)
at com.databricks.pipelines.execution.core.monitoring.PublicLogging.withAttributionTags(DeltaPipelinesUsageLogging.scala:25)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:645)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:564)
at com.databricks.pipelines.execution.core.monitoring.PublicLogging.recordOperationWithResultTags(DeltaPipelinesUsageLogging.scala:25)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:555)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:525)
at com.databricks.pipelines.execution.core.monitoring.PublicLogging.recordOperation(DeltaPipelinesUsageLogging.scala:25)
at com.databricks.pipelines.execution.core.monitoring.PublicLogging.recordOperation0(DeltaPipelinesUsageLogging.scala:60)
at com.databricks.pipelines.execution.core.monitoring.DeltaPipelinesUsageLogging.$anonfun$recordPipelinesOperation$1(DeltaPipelinesUsageLogging.scala:130)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at com.databricks.pipelines.execution.core.monitoring.DeltaPipelinesUsageLogging.recordPipelinesOperation(DeltaPipelinesUsageLogging.scala:107)
at com.databricks.pipelines.execution.core.monitoring.DeltaPipelinesUsageLogging.recordPipelinesOperation$(DeltaPipelinesUsageLogging.scala:102)
at com.databricks.pipelines.execution.core.UpdateExecution.recordPipelinesOperation(UpdateExecution.scala:55)
at com.databricks.pipelines.execution.core.UpdateExecution.executeStage(UpdateExecution.scala:257)
at com.databricks.pipelines.execution.core.UpdateExecution.initializeAndLoadGraph(UpdateExecution.scala:360)
at com.databricks.pipelines.execution.core.UpdateExecution.executeUpdate(UpdateExecution.scala:344)
at com.databricks.pipelines.execution.core.UpdateExecution.$anonfun$start$3(UpdateExecution.scala:126)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.pipelines.execution.core.WorkloadAttributionContextUtils$.runWithDLTWorkloadTags(WorkloadAttributionContextUtils_DBR_12_Minus.scala:6)
at com.databricks.pipelines.execution.core.UpdateExecution.$anonfun$start$1(UpdateExecution.scala:122)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.pipelines.execution.core.UCContextCompanion$OptionUCContextHelper.runWithNewUCSIfAvailable(BaseUCContext.scala:283)
at com.databricks.pipelines.execution.core.UpdateExecution.start(UpdateExecution.scala:119)
at com.databricks.pipelines.execution.service.ExecutionBackend$$anon$2.$anonfun$run$2(ExecutionBackend.scala:670)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.pipelines.execution.core.CommandContextUtils$.withCommandContext(CommandContextUtils.scala:47)
at com.databricks.pipelines.execution.service.ExecutionBackend$$anon$2.run(ExecutionBackend.scala:670)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:114)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.IdentityClaim$.withClaim(IdentityClaim.scala:48)
at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$4(SparkThreadLocalForwardingThreadPoolExecutor.scala:77)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:41)
at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:76)
at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:62)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:111)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:114)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

@QuentinAmbard - Please let me know if you could help with the mentioned issue, or with any questions if I can.

UC lineage demo images are outdated

Need to update the lineage to the latest UI version

typo in uc-1-acl

Notebook: 00-UC-Table-ACL
Cmd 15:

Current
COPY INTO uc_acl.customers FROM '/demos/uc/users' FILEFORMAT=JSON

Fix
COPY INTO uc_acl.customers FROM '/dbdemos/uc/users' FILEFORMAT=JSON

uc-04-system-tables: resources don't get placed if install fails

I'm trying to deploy to a workspace where we have a shared UC, and no one is supposed to write to the main catalog. When I run the install command, I get an error that says that I don't have permissions to the main catalog. I would like to modify the scripts to deploy to my own catalog, but that doesn't cause the install to succeed. Instead, new sql scripts are placed and run.

I would also be okay with just running the notebooks, but that's not possible because they reference a resources folder that isn't being placed. If the resources were placed before trying to run the install, I would be able to hack this up myself.

Is it possible for me to just copy the resources from somewhere?

Uninstall demo following use?

Is there a recommended way of removing the artifacts created once the demo is complete?

02-Data-preparation note cell "Review our raw Q&A dataset" FAILS , can't figure out workaround!

The following line (3) fails :

raw_gardening = spark.read.format("xml").option("rowTag", "row").load(f"{gardening_raw_path}/Posts.xml")

with error:
org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find data source: xml. Please find packages at https://spark.apache.org/third-party-projects.html`.`

load_in_8bit Dolly demo for smaller GPUs code incorrect

In the Dolly demo, notebook 03-Q&A-prompt-engineering-for-dolly, there is sample code provided (originally commented out) that looks like:

# Note: if you use dolly 12B or smaller model but a GPU with less than 24GB RAM, use 8bit. This requires %pip install bitsandbytes
# instruct_pipeline = pipeline(model=model_name, load_in_8bit=True, trust_remote_code=True, device_map="auto")

However, the correct way to pass the load_in_8bit param according to the Databricks Dolly Docs is as:
instruct_pipeline = pipeline(model=model_name, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", return_full_text=True, max_new_tokens=256, top_p=0.95, top_k=50, model_kwargs={'load_in_8bit': True})

05_job_staging_validation - last cell has typo

The first if statement should have in that body "Archive" instead of "Staging"

Data file creation issue writing to dbfs for demo: "Unit Testing Delta Live Table (DLT) for production-grade pipelines"

"FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/demos/retail/customers/test/spend_csv/spend.csv'"

Solution that worked for me was to use the following instead of "with open" to write files to dbfs

dbutils.fs.put('/demos/retail/customers/test/spend_csv/spend.csv', spend_csv)
dbutils.fs.put('/demos/retail/customers/test/users_json/users.json', users_json)

Provision Demo with SPN instead of PAT Token

Currently PAT tokens are need to exist in order for the provisioning of the demos to function properly. At Scale we will need an SPN authorization. Most enterprise systems are moving away from PAT auth.

pandas on spark looking for wrong resource notebook

After running dbdemos.install('pandas-on-spark', path='./', overwrite = True) it looks like the initialization is pointing to a missing notebook:

dbdemos-uc-05-upgrade: Unable to access s3a bucket

Is it normal that I can't access it? I have trouble accessing all the aws buckets in several demos

I receive the following error:

Missing image for Spark Streaming Advanced tile

dbdemos/dbdemos/dbdemos.py

Line 147 in a489a7c

  <img class="dbdemo_logo" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/resources/{demo.name}.png" /> 

Refers to https://github.com/QuentinAmbard/databricks-demo/raw/main/resources/spark-streaming-sessionization.png but that path returns a 404

Chatbot keeps answering questions from out of scope data

Dolly model keeps on answering questions that are not part of the input documents. For example, I can ask: why do Canadians play hockey? where is Toronto? where are brothels in Toronto? It keeps answering all of these questions instead of saying I do not know. Here is the prompt from the notebook.

template = """You are a chatbot having a conversation with a human. Your are asked to answer gardening questions and help cultivating plants.
Given the following extracted parts of a long document and a question, answer the user question. If you don't know, say that you do not know.

{context}

{chat_history}

{human_input}

Response:
"""

I did try to change it a bit but still does not work. How to fix this?

template = """You are a chatbot having a conversation with a human. Your are asked to answer gardening questions and help cultivating plants.
Answer from the following documents otherwise say I do not know.

{context}

{chat_history}

{human_input}

Response:
"""

installation fails on community edition

dlt-loans issues

In the last session “Tracking data quality” in 02-DLT-Loan-pipeline-PYTHON notebook, the reference link seems wrong:

The how to access your DLT metrics link should be referring to "$./03-Log-Analysis" instead of "$./02-Log-Analysis".
The data quality dashboard example link sent to a dashboard that’s been trashed

How can I solve this error and could I ask for more tutorial?

Error at "bundler.start_and_wait_bundle_jobs(force_execution = False)" step

uc-04-system-tables - initialization job failure

Unable to view the dashboard due to initialization job failure

magic command UDF and DLT issue

in the lakehouse-iot-platform wind turbine demo
A DLT pipeline was created and points to two notebooks both having magic commands.
the magic command tied to a UDFs to install ML model.
Are there detailed steps to install the model outside the DLT pipeline and set the needed parameters?
Also there are %pip libraries to install in the '01.2' notebook - looking for guidance

See Notebooks, '01.1-DLT-Wind-Turbine-SQL' and '01.2-DLT-Wind-Turbine-SQL-UDF'

Expected behavior : As expected, the DLT pipeline fails on the magic commands and when commented out the code is using a UDF that was not registered due to the magic command.

Current Workaround in progress: Using SQL UDF. No workaround for libraries needed for ML model

Thanks in advance.

Deploying langchain pipeline in MLFlow not working

I have followed the "Build your Chat Bot with Dolly" demo and successfully logged an MLFlow model.

I am facing issues when loading the model again, following the demo I use:

def load_model_and_answer(similar_docs, question): 
  # Note: this will load the model once more in memory
  # Load the langchain pipeline & run inferences
  model_uri = 'runs:/2dc17c4b80994a3b9d300d199f6978a1/model'

  chain = mlflow.pyfunc.load_model(model_uri)
  chain.predict({"input_documents": similar_docs, "human_input": question})

To call the logged model I do the following:

hf_embed = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

chroma_db = Chroma(collection_name="documentation_docs", embedding_function=hf_embed, persist_directory= gardening_vector_db_path)

question = "Why does this not work?"
similar_docs = chroma_db.similarity_search(question, k=1)

load_model_and_answer(similar_docs = similar_docs, question = question)

Failing with:

MlflowException: Incompatible input types for column input_documents. Can not safely convert float64 to <U0.

Any suggestions for solving this?

Requirements:
mlflow==2.3
configparser==5.2.0
gunicorn==20.1.0
langchain==0.0.199
numpy==1.24.3
pyyaml==6.0
requests==2.28.1
sqlalchemy==2.0.16
tornado==6.1

Model serving in DB on GCP for lakehouse-iot

Getting an error while calling model serving endpoint from the UI

Unrecognized content type parameters: format. IMPORTANT: The MLflow Model scoring protocol has changed in MLflow version 2.0. If you are seeing this error, you are likely using an outdated scoring request format. To resolve the error, either update your request format or adjust your MLflow Model's requirements file to specify an older version of MLflow (for example, change the 'mlflow' requirement specifier to 'mlflow==1.30.0'). If you are making a request using the MLflow client (e.g. via mlflow.pyfunc.spark_udf()), upgrade your MLflow client to a version >= 2.0 in order to use the new request format. For more information about the updated MLflow Model scoring protocol in MLflow 2.0, see https://mlflow.org/docs/latest/models.html#deploy-mlflow-models.

Tried calling the real-time endpoint from the notebook. Get: TypeError: Object of type Timestamp is not JSON serializable

llm-dolly-chatbot demo is broken at Chroma step

I'm simply running each cell in order for the demo notebooks that come with "llm-dolly-chatbot" and running into an Index error related to Chroma.

Cell 10 on Notebook 3 yields error: Index not found, please create an instance before querying

This is the code that generates the error:

def get_similar_docs(question, similar_doc_count):
  return db.similarity_search(question, k=similar_doc_count)

# Let's test it with blackberries:
for doc in get_similar_docs("how to grow blackberry?", 2):
  print(doc.page_content)

Please fix, thanks!

DLT Pipeline failed during Initialization

DLT Pipeline failed during Initialization. Below is the error.

java.lang.RuntimeException: Failed to execute python command for notebook '/Users/[email protected]/lakehouse-iot-platform/01-Data-ingestion/01.2-DLT-Wind-Turbine-SQL-UDF' with id RunnableCommandId(5811638086382625936) and error AnsiResult(---------------------------------------------------------------------------
RestException Traceback (most recent call last)
File :6
2 import mlflow
3 # Stage/version
4 # Model name |
5 # | |
----> 6 predict_churn_udf = mlflow.pyfunc.spark_udf(spark, "models:/dbdemos_turbine_maintenance/Production", "string") #use env_manager='virtualenv' to load the model env (instead of pip install the libs)
7 spark.udf.register("predict_maintenance", predict_churn_udf)

Kindly let me know what is the issue and how to fix it.

Use of Unity Catalog in 'delta-lake' demo

Hello there!

I just installed the 'delta-lake' demo in our (unity catalog enabled) workspace.

When running %run ./_resources/00-setup $reset_all_data=$reset_all_data everything worked, but not as expected, as i got a "Unity Catalog seems not to be enabled[...]" type message and well, unity catalog was not used.

I looked into the code and then found out that the access mode of the created cluster (custom) prevents Unity Catalog from being used. Maybe that is mentioned somewehere and I didn't read properly, but just in case a little heads-up. After setting the created cluster to single-user, everything worked fine and a catalog was created.

Cheers,
Thomas

PS: Everything else worked like a charm with your demos so far!

After I bundled and packaged my demo how can I install and run my demo?

fsi fraud demo doesn't setup proper permission

we need to change the database to all user to have the demo available for multiple

endpoint creation fails if another endpoint exist with the same name

Couldn't create endpoint. Creation response: {'error_code': 'INVALID_PARAMETER_VALUE', 'message': 'SQL warehouse with name dbdemos-shared-endpoint already exists'}
ERROR: couldn't create or get a SQL endpoint for dbdemos. Do you have permission? Trying to import the dashboard without endoint (import will pick the first available if any)
Unauthorized call. Check your PAT token {"message": "You don't have permission to this resource."}

Fix:

grant access to all users ?
fallback to another endpoint having the user name

02-Data-preparation note cell "Extract the dataset using sh command" FAILS

Usign databricks dbdemos-lakehouse-c360 follwing error from cell:

rm: cannot remove '/dbfs/dbdemos/product/llm/gardening/raw': No such file or directory

shell script is not in the expected directory (/tmp/gardening) when executing

cp -f Posts.xml /dbfs/dbdemos/product/llm/gardening/raw

I fixed it by adding new cell immediately after with following commands:

%sh
pwd
mkdir -p /dbfs/dbdemos/product/llm/gardening/raw
cp -f /tmp/gardening/Posts.xml /dbfs/dbdemos/product/llm/gardening/raw

Installing demo lakehouse-retail-churn under /Users/[email protected]... KeyError: 'cluster_id'

%pip install https://github.com/databricks-demos/dbdemos/raw/main/release/dbdemos-0.1-py3-none-any.whl --force

followed by

import dbdemos
dbdemos.list_demos()

dbdemos.install('lakehouse-retail-churn')

results in ...

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<command-4346789072647810> in <cell line: 1>()
----> 1 dbdemos.install('lakehouse-retail-churn')

/local_disk0/.ephemeral_nfs/envs/pythonEnv-35024614-92d6-4db2-880d-1375798b9e17/lib/python3.9/site-packages/dbdemos/dbdemos.py in install(demo_name, path, overwrite, username, pat_token, workspace_url, skip_dashboards, cloud)
    185 def install(demo_name, path = None, overwrite = False, username = None, pat_token = None, workspace_url = None, skip_dashboards = False, cloud = "AWS"):
    186     installer = Installer(username, pat_token, workspace_url, cloud)
--> 187     installer.install_demo(demo_name, path, overwrite, skip_dashboards = skip_dashboards)
    188 
    189 

/local_disk0/.ephemeral_nfs/envs/pythonEnv-35024614-92d6-4db2-880d-1375798b9e17/lib/python3.9/site-packages/dbdemos/installer.py in install_demo(self, demo_name, install_path, overwrite, update_cluster_if_exists, skip_dashboards)
    156         self.tracker.track_install(demo_conf.category, demo_name)
    157         self.get_current_username()
--> 158         cluster_id, cluster_name = self.load_demo_cluster(demo_name, demo_conf, update_cluster_if_exists)
    159         pipeline_ids = self.load_demo_pipelines(demo_name, demo_conf)
    160         dashboards = [] if skip_dashboards else self.install_dashboards(demo_conf, install_path)

/local_disk0/.ephemeral_nfs/envs/pythonEnv-35024614-92d6-4db2-880d-1375798b9e17/lib/python3.9/site-packages/dbdemos/installer.py in load_demo_cluster(self, demo_name, demo_conf, update_cluster_if_exists)
    563         if existing_cluster == None:
    564             cluster = self.db.post("2.0/clusters/create", json = cluster_conf)
--> 565             cluster_conf["cluster_id"] = cluster["cluster_id"]
    566         else:
    567             cluster_conf["cluster_id"] = existing_cluster["cluster_id"]

KeyError: 'cluster_id'

DLT Dataset (Lending Tree)?

I tried to use the full DLT demo, but ran into a few problems.

Had to manually alter the cluster it tried to create to use a cluster policy since required by my org. (Resolved, but would be nice if aspects like that were easily configurable with args)
Failed to load the lending tree dataset csv. I couldn't find it anywhere in the workspace samples or in the repo. I looked on Kaggle, but there are too many potential matches. No idea which of the lending tree data sets it is expecting.

automl run should disable cache when exp is deleted

TODO: test if the run still exists when getting it from cache

uc-04-system-tables uses hardcoded UUID

Every time I attempt to install the demo uc-04-system-tables I receive the following exception:

Installation error: '49187002-01a8-4abf-b140-7795957779b3'
Couldn't create or update a dashboard.
If this is a permission error, we recommend you to search the existing dashboard and delete it manually.
You can skip the dashboard installation with skip_dashboards = True:
dbdemos.install('uc-04-system-tables', skip_dashboards = True)

Of course skip_dashboards=True does allow the rest of the demo to install successfully, but the dashboards are the most important part of this demo. I tried deleting the entire folder for the dashboards multiple times.

It seems that the ID 49187002-01a8-4abf-b140-7795957779b3 is hard-coded somewhere in the demo?

Tail stacktrace:

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-fe1d2c16-4416-444d-be18-03dc0e27f7a5/lib/python3.10/site-packages/dbsqlclone/utils/load_dashboard.py:234, in duplicate_dashboard.<locals>.load_widget(widget)
    232     query_id = widget["visualization"]["query"]["id"]
    233     visualization_id = widget["visualization"]["id"]
--> 234     visualization_id_clone = dashboard_state["queries"][query_id]["visualizations"][visualization_id]
    235 data = {
    236     "dashboard_id": new_dashboard["id"],
    237     "visualization_id": visualization_id_clone,
   (...)
    240     "width": widget["width"]
    241 }
    242 time.sleep(sleep_between_queries)

KeyError: '49187002-01a8-4abf-b140-7795957779b3'

I viewed the dashboard JSON from within the site-packages bundles and confirmed the 4918... UUID appears as the ID of one of the queries:

...
          "id": "49187002-01a8-4abf-b140-7795957779b3",
          "name": "System Tables - top workspaces",
...

Dolly demo: "Notebook not found" error

I'd like to work through the dolly-llm-chatbot demo, but I'm stuck on the second code block:

Executing this...
%run ./_resources/00-init $catalog=hive_metastore $db=dbdemos_llm

...results in this:

Notebook not found: Users/<me>/<my-notebook>/_resources/00-init. Notebooks can be specified via a relative path (./Notebook or ../folder/Notebook) or via an absolute path (/Abs/Path/to/Notebook). Make sure you are specifying the path correctly.

I'd prefer to not install any additional packages/etc. I just want to experiment with the demo code itself.

How do I get past this error?

Thank you

Wrong hyperlink text

Hi @QuentinAmbard

Great work on this dbdemos.ai beauty.

I was using dbdemos.install("dlt-loans")
And as you can see in the below picture for the python version of the notebook the hyperlink is named as SQL like below

02-DLT-Loan-pipeline-PYTHON: DLT pipeline definition (SQL)

The typo in the Text needs to be fixed it should be

02-DLT-Loan-pipeline-PYTHON: DLT pipeline definition (PYTHON)

Incorrect URL redirection - Page Not Found (error 404)

Hi @QuentinAmbard

I was going through the Build your Chat Bot with Dolly demo and I encountered an issue when trying to access the notebooks hyperlinked in the tutorial (image below is one of the three notebooks mentioned in the tutorial).

The links redirect us to:
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/$./02-Data-preparation
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/$./03-Q&A-prompt-engineering-for-dolly
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/$./04-chat-bot-prompt-engineering-dolly

The $./ should be removed as the correct URLs are:
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/02-Data-preparation
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/03-Q&A-prompt-engineering-for-dolly
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/04-chat-bot-prompt-engineering-dolly

uc-04-system-tables: Dashboards not getting created

I have premium databricks workspace and is able to install uc-04-system-tables demo without dashboard by below command.
dbdemos.install('uc-04-system-tables', use_current_cluster = True, overwrite= True, skip_dashboards = True)

I have to install it outside of main catalog (Not accessible), so had skipped dashboard creation, how can I create dashboard after installation? Is it possible to have queries + dashboard available as resources? which I can import?

Demo installation stops when I run below command...
dbdemos.install('uc-04-system-tables', use_current_cluster = True, overwrite= False, skip_dashboards = False)