databricks-demos / dbdemos Goto Github PK
View Code? Open in Web Editor NEWDemos to implement your Databricks Lakehouse
License: Other
Demos to implement your Databricks Lakehouse
License: Other
#The function below references 'stream_count' which is not defined. Likely should be = len(streams)
# Function to stop all streaming queries
def stop_all_streams(start_with = ""):
streams = get_active_streams(start_with)
if len(streams) > 0:
print(f"Stopping {stream_count} streams")
for s in streams:
try:
s.stop()
except:
pass
print(f"All stream stopped (starting with: {start_with}.")
When installing a dbdemo in a HIPAA-compliant environment, the following error is thrown, and the cluster is not created.
WARN: couldn't create the cluster for the demo: {'error_code': 'INVALID_PARAMETER_VALUE', 'message': 'Workspace restricted to instance types that encrypt in transit. Please specify one such driver node type'}
It happened when I tried to install the delta-lake demo on a HIPAA-compliant workspace on AWS that enforces Nitro instance types.
Code:
%pip install dbdemos
import dbdemos
dbdemos.list_demos()
dbdemos.install('delta-lake')
The notebooks are created without problems, and the demo runs smoothly in a user-created cluster.
I'm trying to run the chat bot as mentioned in the demos. However, I encountered an issue with the langchain pipeline not supporting summarization yet. As a temporary fix, the companion notebook mentions the addition of the following code snippet:
hf_summary = HuggingFacePipeline_WithSummarization(pipeline=pipe_summary)
The notebook refers to a file named _resources/00-init, but I am unable to find it in the git repository. Could you please guide me on where to find the notebook or provide an alternative solution to address the issue?
Thank you for your assistance!
Existing code in 02-Data-preparation notebook of llm-dolly-chatbot demo has two issues:
@pandas_udf("string")
def summarize(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
# Load the model for summarization
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
def summarize_txt(text):
if len(text) > 400:
return summarizer(text)
return text
for serie in iterator:
# get a summary for each row
yield serie.apply(summarize_txt)
Recommend replacing with
@pandas_udf("string")
def summarize(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
# Load the model for summarization
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device=0)
def summarize_txt(text):
if len(text) > 400:
return summarizer(text)[0]['summary_text']
return text
for serie in iterator:
# get a summary for each row
yield serie.apply(summarize_txt)
This results in GPU utilization around 40% - probably low because we're using a batch size of 1, but definitely faster than using no GPU. The entire job runs in 16 minutes on the g5.4xlarge cluster created by dbdemos
on the page https://www.dbdemos.ai/demo-notebooks.html?demoName=uc-04-system-tables, when I click the link for the _enable_system_tables notebook, it takes me to https://www.dbdemos.ai/minisite/uc-04-system-tables/$./_enable_system_tables and gives a 404
latest release added a bug in create_cluster with the new display output
Running on Azure Databricks Premium I am able to import the lakehouse-iot-platform demo and related assets successfully.
The dbdemos_lakehouse_iot_turbine_init job run starts automatically as it should, but fails on the register_ml_model task with the following error
FileNotFoundError: [Errno 2] No such file or directory: '/local_disk0/tmp/d5ae264a/confusion_matrix.png'
The job runs by default on a job cluster. Swapping the job to run on the automatically created all-purpose cluster fixes the problem. As an alternative workaround commenting out displaying the confusion matrix also works.
https://www.dbdemos.ai/demo-notebooks.html?demoName=llm-dolly-chatbot
Notebook links are giving 404 error in above tutorial.
while running the lakehouse churn job, the job got failed with notebook exception -
com.databricks.WorkflowException: com.databricks.NotebookExecutionException: FAILED
From DLT pipeline the error code is pasted below for reference.
Update 0507a1 is FAILED.
java.lang.RuntimeException: Failed to execute python command for notebook '/Users/[email protected]/databricks_demo/lakehouse-retail-c360/01-Data-ingestion/01.2-DLT-churn-Python-UDF' with id RunnableCommandId(4793992086333503899) and error AnsiResult(---------------------------------------------------------------------------
RestException Traceback (most recent call last)
File :6
2 import mlflow
3 # Stage/version
4 # Model name |
5 # | |
----> 6 predict_churn_udf = mlflow.pyfunc.spark_udf(spark, "models:/dbdemos_customer_churn/Production")
7 spark.udf.register("predict_churn", predict_churn_udf)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/pyfunc/init.py:996, in spark_udf(spark, model_uri, result_type, env_manager)
989 if not any(isinstance(elem_type, x) for x in supported_types):
990 raise MlflowException(
991 message="Invalid result_type '{}'. Result type can only be one of or an array of one "
992 "of the following types: {}".format(str(elem_type), str(supported_types)),
993 error_code=INVALID_PARAMETER_VALUE,
994 )
--> 996 local_model_path = _download_artifact_from_uri(
997 artifact_uri=model_uri,
998 output_path=_create_model_downloading_tmp_dir(should_use_nfs),
999 )
1001 if env_manager == _EnvManager.LOCAL:
1002 # Assume spark executor python environment is the same with spark driver side.
1003 _warn_dependency_requirement_mismatches(local_model_path)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/tracking/artifact_utils.py:100, in _download_artifact_from_uri(artifact_uri, output_path)
94 """
95 :param artifact_uri: The absolute URI of the artifact to download.
96 :param output_path: The local filesystem path to which to download the artifact. If unspecified,
97 a local output path will be created.
98 """
99 root_uri, artifact_path = _get_root_uri_and_artifact_path(artifact_uri)
--> 100 return get_artifact_repository(artifact_uri=root_uri).download_artifacts(
101 artifact_path=artifact_path, dst_path=output_path
102 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/artifact_repository_registry.py:114, in get_artifact_repository(artifact_uri)
104 def get_artifact_repository(artifact_uri):
105 """Get an artifact repository from the registry based on the scheme of artifact_uri
106
107 :param artifact_uri: The artifact store URI. This URI is used to select which artifact
(...)
112 requirements.
113 """
--> 114 return _artifact_repository_registry.get_artifact_repository(artifact_uri)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/artifact_repository_registry.py:72, in ArtifactRepositoryRegistry.get_artifact_repository(self, artifact_uri)
65 if repository is None:
66 raise MlflowException(
67 "Could not find a registered artifact repository for: {}. "
68 "Currently registered schemes are: {}".format(
69 artifact_uri, list(self._registry.keys())
70 )
71 )
---> 72 return repository(artifact_uri)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/models_artifact_repo.py:42, in ModelsArtifactRepository.init(self, artifact_uri)
37 self.repo = UnityCatalogModelsArtifactRepository(
38 artifact_uri=artifact_uri, registry_uri=registry_uri
39 )
40 elif is_using_databricks_registry(artifact_uri):
41 # Use the DatabricksModelsArtifactRepository if a databricks profile is being used.
---> 42 self.repo = DatabricksModelsArtifactRepository(artifact_uri)
43 else:
44 uri = ModelsArtifactRepository.get_underlying_uri(artifact_uri)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/databricks_models_artifact_repo.py:63, in DatabricksModelsArtifactRepository.init(self, artifact_uri)
59 self.databricks_profile_uri = (
60 get_databricks_profile_uri_from_artifact_uri(artifact_uri) or mlflow.get_registry_uri()
61 )
62 client = MlflowClient(registry_uri=self.databricks_profile_uri)
---> 63 self.model_name, self.model_version = get_model_name_and_version(client, artifact_uri)
64 # Use an isolated thread pool executor for chunk uploads/downloads to avoid a deadlock
65 # caused by waiting for a chunk-upload/download task within a file-upload/download task.
66 # See https://superfastpython.com/threadpoolexecutor-deadlock/#Deadlock_1_Submit_and_Wait_for_a_Task_Within_a_Task
67 # for more details
68 self.chunk_thread_pool = self._create_thread_pool()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/utils/models.py:94, in get_model_name_and_version(client, models_uri)
92 if model_alias is not None:
93 return model_name, client.get_model_version_by_alias(model_name, model_alias).version
---> 94 return model_name, str(_get_latest_model_version(client, model_name, model_stage))
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/artifact/utils/models.py:32, in _get_latest_model_version(client, name, stage)
27 def _get_latest_model_version(client, name, stage):
28 """
29 Returns the latest version of the stage if stage is not None. Otherwise return the latest of all
30 versions.
31 """
---> 32 latest = client.get_latest_versions(name, None if stage is None else [stage])
33 if len(latest) == 0:
34 stage_str = "" if stage is None else f" and stage '{stage}'"
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/tracking/client.py:2425, in MlflowClient.get_latest_versions(self, name, stages)
2353 def get_latest_versions(self, name: str, stages: List[str] = None) -> List[ModelVersion]:
2354 """
2355 Latest version models for each requests stage. If no stages
provided, returns the
2356 latest version for each stage.
(...)
2423 current_stage: None
2424 """
-> 2425 return self._get_registry_client().get_latest_versions(name, stages)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/tracking/_model_registry/client.py:140, in ModelRegistryClient.get_latest_versions(self, name, stages)
130 def get_latest_versions(self, name, stages=None):
131 """
132 Latest version models for each requests stage. If no stages
provided, returns the
133 latest version for each stage.
(...)
138 :return: List of :py:class:mlflow.entities.model_registry.ModelVersion
objects.
139 """
--> 140 return self.store.get_latest_versions(name, stages)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/model_registry/rest_store.py:169, in RestStore.get_latest_versions(self, name, stages)
159 """
160 Latest version models for each requested stage. If no stages
argument is provided,
161 returns the latest version for each stage.
(...)
166 :return: List of :py:class:mlflow.entities.model_registry.ModelVersion
objects.
167 """
168 req_body = message_to_json(GetLatestVersions(name=name, stages=stages))
--> 169 response_proto = self._call_endpoint(GetLatestVersions, req_body, call_all_endpoints=True)
170 return [
171 ModelVersion.from_proto(model_version)
172 for model_version in response_proto.model_versions
173 ]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/store/model_registry/base_rest_store.py:39, in BaseRestStore._call_endpoint(self, api, json_body, call_all_endpoints, extra_headers)
37 if call_all_endpoints:
38 endpoints = self._get_all_endpoints_from_method(api)
---> 39 return call_endpoints(
40 self.get_host_creds(), endpoints, json_body, response_proto, extra_headers
41 )
42 else:
43 endpoint, method = self._get_endpoint_from_method(api)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:217, in call_endpoints(host_creds, endpoints, json_body, response_proto, extra_headers)
215 except RestException as e:
216 if e.error_code != ErrorCode.Name(ENDPOINT_NOT_FOUND) or i == len(endpoints) - 1:
--> 217 raise e
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:212, in call_endpoints(host_creds, endpoints, json_body, response_proto, extra_headers)
210 for i, (endpoint, method) in enumerate(endpoints):
211 try:
--> 212 return call_endpoint(
213 host_creds, endpoint, method, json_body, response_proto, extra_headers
214 )
215 except RestException as e:
216 if e.error_code != ErrorCode.Name(ENDPOINT_NOT_FOUND) or i == len(endpoints) - 1:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:201, in call_endpoint(host_creds, endpoint, method, json_body, response_proto, extra_headers)
199 call_kwargs["json"] = json_body
200 response = http_request(**call_kwargs)
--> 201 response = verify_rest_response(response, endpoint)
202 js_dict = json.loads(response.text)
203 parse_dict(js_dict=js_dict, message=response_proto)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-1810d376-b464-4920-b311-abfa3f12b7f8/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:133, in verify_rest_response(response, endpoint)
131 if response.status_code != 200:
132 if _can_parse_as_json_object(response.text):
--> 133 raise RestException(json.loads(response.text))
134 else:
135 base_msg = "API request to endpoint {} failed with error code {} != 200".format(
136 endpoint,
137 response.status_code,
138 )
RestException: RESOURCE_DOES_NOT_EXIST: RegisteredModel 'dbdemos_customer_churn' does not exist. It might have been deleted.,None,Map(),Map(),List(),List(),Map())
at com.databricks.pipelines.execution.core.languages.PythonRepl.runCmd(PythonRepl.scala:335)
at com.databricks.pipelines.execution.service.PipelineRunnable$.$anonfun$loadPythonGraph$8(PipelineGraphLoader.scala:597)
at com.databricks.pipelines.execution.service.PipelineRunnable$.$anonfun$loadPythonGraph$8$adapted(PipelineGraphLoader.scala:595)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at com.databricks.pipelines.execution.service.PipelineRunnable$.$anonfun$loadPythonGraph$7(PipelineGraphLoader.scala:595)
at com.databricks.pipelines.execution.service.PipelineRunnable$.$anonfun$loadPythonGraph$7$adapted(PipelineGraphLoader.scala:572)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:193)
at com.databricks.pipelines.execution.service.PipelineRunnable$.loadPythonGraph(PipelineGraphLoader.scala:572)
at com.databricks.pipelines.execution.service.PipelineGraphLoader.loadGraph(PipelineGraphLoader.scala:324)
at com.databricks.pipelines.execution.service.PipelineGraphLoader.loadGraph(PipelineGraphLoader.scala:205)
at com.databricks.pipelines.execution.service.DLTComputeRunnableContext.loadGraph(DLTComputeRunnableContext.scala:96)
at com.databricks.pipelines.execution.core.UpdateExecution.$anonfun$initializeAndLoadGraph$1(UpdateExecution.scala:364)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.pipelines.execution.core.monitoring.DeltaPipelinesUsageLogging.$anonfun$recordPipelinesOperation$3(DeltaPipelinesUsageLogging.scala:118)
at com.databricks.pipelines.common.monitoring.OperationStatusReporter.executeWithPeriodicReporting(OperationStatusReporter.scala:120)
at com.databricks.pipelines.common.monitoring.OperationStatusReporter$.executeWithPeriodicReporting(OperationStatusReporter.scala:160)
at com.databricks.pipelines.execution.core.monitoring.DeltaPipelinesUsageLogging.$anonfun$recordPipelinesOperation$6(DeltaPipelinesUsageLogging.scala:137)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:555)
at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:650)
at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:671)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:412)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:158)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:410)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:407)
at com.databricks.pipelines.execution.core.monitoring.PublicLogging.withAttributionContext(DeltaPipelinesUsageLogging.scala:25)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:455)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:440)
at com.databricks.pipelines.execution.core.monitoring.PublicLogging.withAttributionTags(DeltaPipelinesUsageLogging.scala:25)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:645)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:564)
at com.databricks.pipelines.execution.core.monitoring.PublicLogging.recordOperationWithResultTags(DeltaPipelinesUsageLogging.scala:25)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:555)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:525)
at com.databricks.pipelines.execution.core.monitoring.PublicLogging.recordOperation(DeltaPipelinesUsageLogging.scala:25)
at com.databricks.pipelines.execution.core.monitoring.PublicLogging.recordOperation0(DeltaPipelinesUsageLogging.scala:60)
at com.databricks.pipelines.execution.core.monitoring.DeltaPipelinesUsageLogging.$anonfun$recordPipelinesOperation$1(DeltaPipelinesUsageLogging.scala:130)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at com.databricks.pipelines.execution.core.monitoring.DeltaPipelinesUsageLogging.recordPipelinesOperation(DeltaPipelinesUsageLogging.scala:107)
at com.databricks.pipelines.execution.core.monitoring.DeltaPipelinesUsageLogging.recordPipelinesOperation$(DeltaPipelinesUsageLogging.scala:102)
at com.databricks.pipelines.execution.core.UpdateExecution.recordPipelinesOperation(UpdateExecution.scala:55)
at com.databricks.pipelines.execution.core.UpdateExecution.executeStage(UpdateExecution.scala:257)
at com.databricks.pipelines.execution.core.UpdateExecution.initializeAndLoadGraph(UpdateExecution.scala:360)
at com.databricks.pipelines.execution.core.UpdateExecution.executeUpdate(UpdateExecution.scala:344)
at com.databricks.pipelines.execution.core.UpdateExecution.$anonfun$start$3(UpdateExecution.scala:126)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.pipelines.execution.core.WorkloadAttributionContextUtils$.runWithDLTWorkloadTags(WorkloadAttributionContextUtils_DBR_12_Minus.scala:6)
at com.databricks.pipelines.execution.core.UpdateExecution.$anonfun$start$1(UpdateExecution.scala:122)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.pipelines.execution.core.UCContextCompanion$OptionUCContextHelper.runWithNewUCSIfAvailable(BaseUCContext.scala:283)
at com.databricks.pipelines.execution.core.UpdateExecution.start(UpdateExecution.scala:119)
at com.databricks.pipelines.execution.service.ExecutionBackend$$anon$2.$anonfun$run$2(ExecutionBackend.scala:670)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.pipelines.execution.core.CommandContextUtils$.withCommandContext(CommandContextUtils.scala:47)
at com.databricks.pipelines.execution.service.ExecutionBackend$$anon$2.run(ExecutionBackend.scala:670)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:114)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.IdentityClaim$.withClaim(IdentityClaim.scala:48)
at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$4(SparkThreadLocalForwardingThreadPoolExecutor.scala:77)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:41)
at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:76)
at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:62)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:111)
at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:114)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
@QuentinAmbard - Please let me know if you could help with the mentioned issue, or with any questions if I can.
Need to update the lineage to the latest UI version
Notebook: 00-UC-Table-ACL
Cmd 15:
Current
COPY INTO uc_acl.customers FROM '/demos/uc/users' FILEFORMAT=JSON
Fix
COPY INTO uc_acl.customers FROM '/dbdemos/uc/users' FILEFORMAT=JSON
I'm trying to deploy to a workspace where we have a shared UC, and no one is supposed to write to the main catalog. When I run the install command, I get an error that says that I don't have permissions to the main catalog. I would like to modify the scripts to deploy to my own catalog, but that doesn't cause the install to succeed. Instead, new sql scripts are placed and run.
I would also be okay with just running the notebooks, but that's not possible because they reference a resources folder that isn't being placed. If the resources were placed before trying to run the install, I would be able to hack this up myself.
Is it possible for me to just copy the resources from somewhere?
Is there a recommended way of removing the artifacts created once the demo is complete?
The following line (3) fails :
raw_gardening = spark.read.format("xml").option("rowTag", "row").load(f"{gardening_raw_path}/Posts.xml")
with error:
org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find data source: xml. Please find packages at
https://spark.apache.org/third-party-projects.html`.`
In the Dolly demo, notebook 03-Q&A-prompt-engineering-for-dolly
, there is sample code provided (originally commented out) that looks like:
# Note: if you use dolly 12B or smaller model but a GPU with less than 24GB RAM, use 8bit. This requires %pip install bitsandbytes
# instruct_pipeline = pipeline(model=model_name, load_in_8bit=True, trust_remote_code=True, device_map="auto")
However, the correct way to pass the load_in_8bit
param according to the Databricks Dolly Docs is as:
instruct_pipeline = pipeline(model=model_name, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", return_full_text=True, max_new_tokens=256, top_p=0.95, top_k=50, model_kwargs={'load_in_8bit': True})
The first if statement should have in that body "Archive" instead of "Staging"
"FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/demos/retail/customers/test/spend_csv/spend.csv'"
Solution that worked for me was to use the following instead of "with open" to write files to dbfs
Currently PAT tokens are need to exist in order for the provisioning of the demos to function properly. At Scale we will need an SPN authorization. Most enterprise systems are moving away from PAT auth.
Line 147 in a489a7c
Refers to https://github.com/QuentinAmbard/databricks-demo/raw/main/resources/spark-streaming-sessionization.png
but that path returns a 404
Dolly model keeps on answering questions that are not part of the input documents. For example, I can ask: why do Canadians play hockey? where is Toronto? where are brothels in Toronto? It keeps answering all of these questions instead of saying I do not know. Here is the prompt from the notebook.
template = """You are a chatbot having a conversation with a human. Your are asked to answer gardening questions and help cultivating plants.
Given the following extracted parts of a long document and a question, answer the user question. If you don't know, say that you do not know.
{context}
{chat_history}
{human_input}
Response:
"""
I did try to change it a bit but still does not work. How to fix this?
template = """You are a chatbot having a conversation with a human. Your are asked to answer gardening questions and help cultivating plants.
Answer from the following documents otherwise say I do not know.
{context}
{chat_history}
{human_input}
Response:
"""
In the last session “Tracking data quality” in 02-DLT-Loan-pipeline-PYTHON notebook, the reference link seems wrong:
in the lakehouse-iot-platform wind turbine demo
A DLT pipeline was created and points to two notebooks both having magic commands.
the magic command tied to a UDFs to install ML model.
Are there detailed steps to install the model outside the DLT pipeline and set the needed parameters?
Also there are %pip libraries to install in the '01.2' notebook - looking for guidance
See Notebooks, '01.1-DLT-Wind-Turbine-SQL' and '01.2-DLT-Wind-Turbine-SQL-UDF'
Expected behavior : As expected, the DLT pipeline fails on the magic commands and when commented out the code is using a UDF that was not registered due to the magic command.
Current Workaround in progress: Using SQL UDF. No workaround for libraries needed for ML model
Thanks in advance.
I have followed the "Build your Chat Bot with Dolly" demo and successfully logged an MLFlow model.
I am facing issues when loading the model again, following the demo I use:
def load_model_and_answer(similar_docs, question):
# Note: this will load the model once more in memory
# Load the langchain pipeline & run inferences
model_uri = 'runs:/2dc17c4b80994a3b9d300d199f6978a1/model'
chain = mlflow.pyfunc.load_model(model_uri)
chain.predict({"input_documents": similar_docs, "human_input": question})
To call the logged model I do the following:
hf_embed = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
chroma_db = Chroma(collection_name="documentation_docs", embedding_function=hf_embed, persist_directory= gardening_vector_db_path)
question = "Why does this not work?"
similar_docs = chroma_db.similarity_search(question, k=1)
load_model_and_answer(similar_docs = similar_docs, question = question)
Failing with:
MlflowException: Incompatible input types for column input_documents. Can not safely convert float64 to <U0.
Any suggestions for solving this?
Requirements:
mlflow==2.3
configparser==5.2.0
gunicorn==20.1.0
langchain==0.0.199
numpy==1.24.3
pyyaml==6.0
requests==2.28.1
sqlalchemy==2.0.16
tornado==6.1
Getting an error while calling model serving endpoint from the UI
Unrecognized content type parameters: format. IMPORTANT: The MLflow Model scoring protocol has changed in MLflow version 2.0. If you are seeing this error, you are likely using an outdated scoring request format. To resolve the error, either update your request format or adjust your MLflow Model's requirements file to specify an older version of MLflow (for example, change the 'mlflow' requirement specifier to 'mlflow==1.30.0'). If you are making a request using the MLflow client (e.g. via mlflow.pyfunc.spark_udf()
), upgrade your MLflow client to a version >= 2.0 in order to use the new request format. For more information about the updated MLflow Model scoring protocol in MLflow 2.0, see https://mlflow.org/docs/latest/models.html#deploy-mlflow-models.
Tried calling the real-time endpoint from the notebook. Get: TypeError: Object of type Timestamp is not JSON serializable
I'm simply running each cell in order for the demo notebooks that come with "llm-dolly-chatbot" and running into an Index error related to Chroma.
Cell 10 on Notebook 3 yields error: Index not found, please create an instance before querying
This is the code that generates the error:
def get_similar_docs(question, similar_doc_count):
return db.similarity_search(question, k=similar_doc_count)
# Let's test it with blackberries:
for doc in get_similar_docs("how to grow blackberry?", 2):
print(doc.page_content)
Please fix, thanks!
DLT Pipeline failed during Initialization. Below is the error.
java.lang.RuntimeException: Failed to execute python command for notebook '/Users/[email protected]/lakehouse-iot-platform/01-Data-ingestion/01.2-DLT-Wind-Turbine-SQL-UDF' with id RunnableCommandId(5811638086382625936) and error AnsiResult(---------------------------------------------------------------------------
RestException Traceback (most recent call last)
File :6
2 import mlflow
3 # Stage/version
4 # Model name |
5 # | |
----> 6 predict_churn_udf = mlflow.pyfunc.spark_udf(spark, "models:/dbdemos_turbine_maintenance/Production", "string") #use env_manager='virtualenv' to load the model env (instead of pip install the libs)
7 spark.udf.register("predict_maintenance", predict_churn_udf)
Kindly let me know what is the issue and how to fix it.
Hello there!
I just installed the 'delta-lake' demo in our (unity catalog enabled) workspace.
When running %run ./_resources/00-setup $reset_all_data=$reset_all_data
everything worked, but not as expected, as i got a "Unity Catalog seems not to be enabled[...]" type message and well, unity catalog was not used.
I looked into the code and then found out that the access mode of the created cluster (custom) prevents Unity Catalog from being used. Maybe that is mentioned somewehere and I didn't read properly, but just in case a little heads-up. After setting the created cluster to single-user, everything worked fine and a catalog was created.
Cheers,
Thomas
PS: Everything else worked like a charm with your demos so far!
we need to change the database to all user to have the demo available for multiple
Couldn't create endpoint. Creation response: {'error_code': 'INVALID_PARAMETER_VALUE', 'message': 'SQL warehouse with name dbdemos-shared-endpoint
already exists'}
ERROR: couldn't create or get a SQL endpoint for dbdemos. Do you have permission? Trying to import the dashboard without endoint (import will pick the first available if any)
Unauthorized call. Check your PAT token {"message": "You don't have permission to this resource."}
Fix:
Usign databricks dbdemos-lakehouse-c360 follwing error from cell:
rm: cannot remove '/dbfs/dbdemos/product/llm/gardening/raw': No such file or directory
shell script is not in the expected directory (/tmp/gardening) when executing
cp -f Posts.xml /dbfs/dbdemos/product/llm/gardening/raw
I fixed it by adding new cell immediately after with following commands:
%sh
pwd
mkdir -p /dbfs/dbdemos/product/llm/gardening/raw
cp -f /tmp/gardening/Posts.xml /dbfs/dbdemos/product/llm/gardening/raw
%pip install https://github.com/databricks-demos/dbdemos/raw/main/release/dbdemos-0.1-py3-none-any.whl --force
followed by
import dbdemos
dbdemos.list_demos()
dbdemos.install('lakehouse-retail-churn')
results in ...
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<command-4346789072647810> in <cell line: 1>()
----> 1 dbdemos.install('lakehouse-retail-churn')
/local_disk0/.ephemeral_nfs/envs/pythonEnv-35024614-92d6-4db2-880d-1375798b9e17/lib/python3.9/site-packages/dbdemos/dbdemos.py in install(demo_name, path, overwrite, username, pat_token, workspace_url, skip_dashboards, cloud)
185 def install(demo_name, path = None, overwrite = False, username = None, pat_token = None, workspace_url = None, skip_dashboards = False, cloud = "AWS"):
186 installer = Installer(username, pat_token, workspace_url, cloud)
--> 187 installer.install_demo(demo_name, path, overwrite, skip_dashboards = skip_dashboards)
188
189
/local_disk0/.ephemeral_nfs/envs/pythonEnv-35024614-92d6-4db2-880d-1375798b9e17/lib/python3.9/site-packages/dbdemos/installer.py in install_demo(self, demo_name, install_path, overwrite, update_cluster_if_exists, skip_dashboards)
156 self.tracker.track_install(demo_conf.category, demo_name)
157 self.get_current_username()
--> 158 cluster_id, cluster_name = self.load_demo_cluster(demo_name, demo_conf, update_cluster_if_exists)
159 pipeline_ids = self.load_demo_pipelines(demo_name, demo_conf)
160 dashboards = [] if skip_dashboards else self.install_dashboards(demo_conf, install_path)
/local_disk0/.ephemeral_nfs/envs/pythonEnv-35024614-92d6-4db2-880d-1375798b9e17/lib/python3.9/site-packages/dbdemos/installer.py in load_demo_cluster(self, demo_name, demo_conf, update_cluster_if_exists)
563 if existing_cluster == None:
564 cluster = self.db.post("2.0/clusters/create", json = cluster_conf)
--> 565 cluster_conf["cluster_id"] = cluster["cluster_id"]
566 else:
567 cluster_conf["cluster_id"] = existing_cluster["cluster_id"]
KeyError: 'cluster_id'
I tried to use the full DLT demo, but ran into a few problems.
Had to manually alter the cluster it tried to create to use a cluster policy since required by my org. (Resolved, but would be nice if aspects like that were easily configurable with args)
Failed to load the lending tree dataset csv. I couldn't find it anywhere in the workspace samples or in the repo. I looked on Kaggle, but there are too many potential matches. No idea which of the lending tree data sets it is expecting.
TODO: test if the run still exists when getting it from cache
Every time I attempt to install the demo uc-04-system-tables
I receive the following exception:
Installation error: '49187002-01a8-4abf-b140-7795957779b3'
Couldn't create or update a dashboard.
If this is a permission error, we recommend you to search the existing dashboard and delete it manually.
You can skip the dashboard installation with skip_dashboards = True:
dbdemos.install('uc-04-system-tables', skip_dashboards = True)
Of course skip_dashboards=True
does allow the rest of the demo to install successfully, but the dashboards are the most important part of this demo. I tried deleting the entire folder for the dashboards multiple times.
It seems that the ID 49187002-01a8-4abf-b140-7795957779b3
is hard-coded somewhere in the demo?
Tail stacktrace:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-fe1d2c16-4416-444d-be18-03dc0e27f7a5/lib/python3.10/site-packages/dbsqlclone/utils/load_dashboard.py:234, in duplicate_dashboard.<locals>.load_widget(widget)
232 query_id = widget["visualization"]["query"]["id"]
233 visualization_id = widget["visualization"]["id"]
--> 234 visualization_id_clone = dashboard_state["queries"][query_id]["visualizations"][visualization_id]
235 data = {
236 "dashboard_id": new_dashboard["id"],
237 "visualization_id": visualization_id_clone,
(...)
240 "width": widget["width"]
241 }
242 time.sleep(sleep_between_queries)
KeyError: '49187002-01a8-4abf-b140-7795957779b3'
I viewed the dashboard JSON from within the site-packages bundles and confirmed the 4918... UUID appears as the ID of one of the queries:
...
"id": "49187002-01a8-4abf-b140-7795957779b3",
"name": "System Tables - top workspaces",
...
I'd like to work through the dolly-llm-chatbot demo, but I'm stuck on the second code block:
Executing this...
%run ./_resources/00-init $catalog=hive_metastore $db=dbdemos_llm
...results in this:
Notebook not found: Users/<me>/<my-notebook>/_resources/00-init. Notebooks can be specified via a relative path (./Notebook or ../folder/Notebook) or via an absolute path (/Abs/Path/to/Notebook). Make sure you are specifying the path correctly.
I'd prefer to not install any additional packages/etc. I just want to experiment with the demo code itself.
How do I get past this error?
Thank you
Great work on this dbdemos.ai beauty.
I was using dbdemos.install("dlt-loans")
And as you can see in the below picture for the python version of the notebook the hyperlink is named as SQL like below
02-DLT-Loan-pipeline-PYTHON: DLT pipeline definition (SQL)
The typo in the Text needs to be fixed it should be
02-DLT-Loan-pipeline-PYTHON: DLT pipeline definition (PYTHON)
I was going through the Build your Chat Bot with Dolly demo and I encountered an issue when trying to access the notebooks hyperlinked in the tutorial (image below is one of the three notebooks mentioned in the tutorial).
The links redirect us to:
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/$./02-Data-preparation
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/$./03-Q&A-prompt-engineering-for-dolly
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/$./04-chat-bot-prompt-engineering-dolly
The $./
should be removed as the correct URLs are:
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/02-Data-preparation
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/03-Q&A-prompt-engineering-for-dolly
https://www.dbdemos.ai/minisite/llm-dolly-chatbot/04-chat-bot-prompt-engineering-dolly
I have premium databricks workspace and is able to install uc-04-system-tables demo without dashboard by below command.
dbdemos.install('uc-04-system-tables', use_current_cluster = True, overwrite= True, skip_dashboards = True)
I have to install it outside of main catalog (Not accessible), so had skipped dashboard creation, how can I create dashboard after installation? Is it possible to have queries + dashboard available as resources? which I can import?
Demo installation stops when I run below command...
dbdemos.install('uc-04-system-tables', use_current_cluster = True, overwrite= False, skip_dashboards = False)
DQ dashboard link refers to internal databricks - which ofcourse is not openable - https://e2-demo-field-eng.cloud.databricks.com/sql/dashboards/9c87ac15-b19b-4d05-a305-11affd1d8f12-dlt---retail-data-quality-stats?o=1444828305810485
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.