Giter VIP home page Giter VIP logo

cloud-data-quality's Introduction

Cloud Data Quality Engine

GA build-test status Code style: black

Introductions

CloudDQ is a cloud-native, declarative, and scalable Data Quality validation Command-Line Interface (CLI) application for Google BigQuery. CloudDQ allows users to define and schedule custom Data Quality checks across their BigQuery tables. Data Quality validation results will be available in another BigQuery table of their choice. Users can then build dashboards or consume data quality outputs programmatically and monitor data quality from the dashboards and data pipelines.

Key properties:

  • Declarative rules configuration and support for CI/CD
  • In-place validation without extracting data for both BigQuery-native tables and GCS structured data via BigQuery External Tables, benefitting from BigQuery performance and scalability while minimising security risk surface
  • Validation results endpoints designed for programmatic consumption (persisted BigQuery storage and Cloud Logging sink), allowing custom integrations such as with BI reporting and metadata management tooling.

CloudDQ takes as input Data Quality validation tests defined using declarative YAML configurations. Data Quality Rules can be defined using custom BigQuery SQL logic with parametrization to support complex business rules and reuse. For each Rule Binding definition in the YAML configs, CloudDQ creates a corresponding SQL view in BigQuery. CloudDQ then executes the view using BigQuery SQL Jobs and collects the data quality validation outputs into a summary table for reporting and visualization.

Data Quality validation execution consumes BigQuery slots. BigQuery Slots can be provisioned on-demand for each run, in which case you pay for the data scanned by the queries. For production usage, we recommend using dedicated slots reservations to benefit from predictable flat-rate pricing.

We recommend using Dataplex Data Quality Task to deploy CloudDQ. Dataplex Data Quality Task provides a managed and serverless deployment, automatic upgrades, and native support for task scheduling.

  • For a high-level overview of the purpose of CloudDQ, an explanation of the concepts and how it works, as well as how you would consume the outputs, please see our Overview
  • For tutorials on how to use CloudDQ, example use cases, and deployment best practices, see the User Manual
  • We also provide a Reference Guide with spec of the configuration and the library reference.
  • For more advanced rules covering more specific requirements, please refer to Advanced Rules User Manual.

Contributions

We welcome all community contributions, whether by opening Github Issues, updating documentations, or updating the code directly. Please consult the contribution guide for details on how to contribute.

Before opening a pull request to suggest a feature change, please open a Github Issue to discuss the use-case and feature proposal with the project maintainers.

Feedback / Questions

For any feedback or questions, please feel free to get in touch at clouddq at google.com.

License

CloudDQ is licensed under the Apache License version 2.0. This is not an official Google product.

cloud-data-quality's People

Contributors

amandeepsinghcs avatar ant-laz avatar charleskubicek avatar dependabot[bot] avatar hejnal avatar jaybana avatar pbalm avatar shourya116 avatar shuuji3 avatar thinhha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cloud-data-quality's Issues

Storing configs in gcs bucket

Hi, I wanted to ask if it could be possible to store the quality metrics in a cloud storage bucket instead of the "configs" folder. For example, defining rule_binding_config_path as "gs://bucket_name/path/to/configs".

Thank you.

Broken impersonate account flow

Hi team,

I found an permission issue when using impersonate account. I found that the reason is that the code does not pass the gcp_impersonation_credential at here when we try to generate the connection config again after we init the dbt runner

Allow multiple row_filters and params to row_filters

Is there any reason that not allow multiple row_filters and params like rule_id?
While using clouddq, i got some needs of this in some situation.

if allow params, then row_filters can be used in general. I mean one row_filter can be used in many entities.
Or sometime, we want to filter by a column other than the one used for rule_binding.
I suggest modifying like below, then filtering email row_filter can be use any entity with any given column.

row_filters:
  NONE:
    filter_sql_expr: |-
      True

  DATA_TYPE_EMAIL:
    params: |-
      - column
    filter_sql_expr: |-
      $column = 'email'

rule_bindings:
  T2_DQ_1_EMAIL:
    entity_id: TEST_TABLE
    column_id: VALUE
    row_filter_ids:
      - DATA_TYPE_EMAIL:
           column: contact_type
    rule_ids:
      - NOT_NULL_SIMPLE
      - REGEX_VALID_EMAIL
      - CUSTOM_SQL_LENGTH:
          upper_bound: 30
      - NOT_BLANK
    metadata:
      brand: one

Is there any way to use it in this situation? if not, can i adding this?

refactor code to allow cross-platform support

Right now the code only runs on BigQuery.

To add cross-platform support, at minimum we need to:

  • add a flag to the CLI to indicate the target platform
  • use this flag to perform query dry-run to validate SQL
  • check that the valid dbt configs for the target platform are provided
  • filter out rule bindings that do not target an entity in the corresponding platform
  • (ideally) refactor the query construction into a builder class, then pass that into a query engine class to execute, rather than passing raw SQL around. This builder class can customize the SQL to the target platform.
  • Allow specifying where the DQ summary results should be written to (e.g. validate data in GCS but write validation results to BigQuery). This requires decoupling the steps to generate the validation results and write them to the target sink.
  • Ensure all SQL and Jinja are cross-platform compatible.
  • Add automated integration test-suites for running the CLI against different platform

column_id at null in dq_summary In a Rule Type CUSTOM_SQL_STATEMENT

In a Rule Type CUSTOM_SQL_STATEMENT, the column_id is always null in dq_summary.
But in the documentation:

"The rule binding sets the value of the column parameter through the column_id field and the custom parameters are set explicitly by listing them on each rule reference."

image

add more default rule types

We should be able to implement all of dbt's schema tests as CUSTOM_SQL_STATEMENT. There may be other rules we can implement from https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html#dataset.

We should consider how to allow rules that reference multiple tables using references to tables (with overrides). The current alternative is to parameterize the table names.

It it make sense we may want to implement the rule directly in https://github.com/GoogleCloudPlatform/cloud-data-quality/blob/main/clouddq/classes/rule_type.py

Enhancement: Better diagnostics for errors.

While studying CloudDQ, I created a configuration file and used it in an execution. I obviously made a mistake in my config file which has resulted in the error: ERROR malformed JSON. The issue I am reporting here is in terms of "servicability". Based on the information presented in this error output, how can I narrow down on my issue? Is there an opportunity for the project to offer "better" diagnostics?

python3 clouddq_executable_v0.5.2_debian_10_python3.9.zip \
    ALL \
    test1.yaml \
    --gcp_project_id="${GOOGLE_CLOUD_PROJECT}" \
    --gcp_bq_dataset_id="${CLOUDDQ_BIGQUERY_DATASET}" \
    --gcp_region_id="${CLOUDDQ_BIGQUERY_REGION}" \
    --print_sql_queries \
    --target_bigquery_summary_table="${CLOUDDQ_TARGET_BIGQUERY_TABLE}"kolban@cloudshell:~/clouddq (kolban-dataplex)$ ./run
Your active configuration is: [cloudshell-11175]
2022-03-05 16:12:46 cs-158005325041-default clouddq.integration.gcp_credentials[1748] INFO Successfully created GCP Client.
2022-03-05 16:12:46 cs-158005325041-default clouddq[1748] INFO Starting CloudDQ run with configs:
{"clouddq_run_configs": {"rule_binding_ids": "ALL", "rule_binding_config_path": "test1.yaml", "dbt_path": null, "dbt_profiles_dir": null, "environment_target": "dev", "gcp_project_id": "kolban-dataplex", "gcp_region_id": "us-central1", "gcp_bq_dataset_id": "dq_output", "gcp_service_account_key_path": null, "gcp_impersonation_credentials": null, "metadata": "{}", "dry_run": false, "progress_watermark": true, "target_bigquery_summary_table": "kolban-dataplex.dq_output.target", "debug": false, "print_sql_queries": true, "skip_sql_validation": false, "summary_to_stdout": false, "enable_experimental_bigquery_entity_uris": false, "enable_experimental_dataplex_gcs_validation": false, "bigquery_client": null, "gcp_credentials": {"credentials": "<google.auth.compute_engine.credentials.Credentials object at 0x7fa1709d60a0>", "project_id": "kolban-dataplex", "user_id": "[email protected]"}}}
2022-03-05 16:12:46 cs-158005325041-default clouddq.runners.dbt.dbt_connection_configs[1748] INFO Using Application-Default Credentials (ADC) to authenticate to GCP...
2022-03-05 16:12:46 cs-158005325041-default clouddq[1748] INFO Writing rule_binding views and intermediate summary results to BigQuery dq_summary_table_name: `kolban-dataplex.dq_output.dq_summary`. 
2022-03-05 16:12:46 cs-158005325041-default clouddq[1748] INFO Using dq_summary_dataset: kolban-dataplex.dq_output
2022-03-05 16:12:46 cs-158005325041-default clouddq[1748] INFO Using target_bigquery_summary_table: `kolban-dataplex.dq_output.target`. 
2022-03-05 16:12:47 cs-158005325041-default clouddq[1748] INFO Preparing SQL for rule bindings: ['SALES_WIDGET', 'SALES_QUANTITY']
2022-03-05 16:12:47 cs-158005325041-default clouddq[1748] ERROR malformed JSON
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/clouddq/clouddq/main.py", line 486, in main
    configs_cache.get_entities_configs_from_rule_bindings(
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/clouddq/clouddq/classes/dq_configs_cache.py", line 354, in get_entities_configs_from_rule_bindings
    for record in self._cache_db.query(query):
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/py_deps/pypi__sqlite_utils/sqlite_utils/db.py", line 410, in query
    cursor = self.execute(sql, params or tuple())
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/py_deps/pypi__sqlite_utils/sqlite_utils/db.py", line 422, in execute
    return self.conn.execute(sql, parameters)
sqlite3.OperationalError: malformed JSON
"malformed JSON"
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/clouddq/clouddq/main.py", line 486, in main
    configs_cache.get_entities_configs_from_rule_bindings(
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/clouddq/clouddq/classes/dq_configs_cache.py", line 354, in get_entities_configs_from_rule_bindings
    for record in self._cache_db.query(query):
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/py_deps/pypi__sqlite_utils/sqlite_utils/db.py", line 410, in query
    cursor = self.execute(sql, params or tuple())
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/py_deps/pypi__sqlite_utils/sqlite_utils/db.py", line 422, in execute
    return self.conn.execute(sql, parameters)
sqlite3.OperationalError: malformed JSON


malformed JSON
Waiting up to 5 seconds.
Sent all pending logs.

Update the rule_binding value using the input parameters

I have a kind of reporting column/snapshot date column and would like to apply this as a filter when running some validations. The value of this column needs to be passed as input to the main function. Is this something possible?

Unable to run clouddq - FileNotFoundError

I was able to set up my local env based on instruction below:
https://github.com/GoogleCloudPlatform/cloud-data-quality/blob/main/docs/getting-started-with-default-configs.md

However, running clouddq command failed:
python3 clouddq \ T2_DQ_1_EMAIL \ configs \ --metadata='{"test":"test"}' \ --dbt_profiles_dir=. \ --dbt_path=. \ --environment_target=dev

FileNotFoundError: [Errno 2] No such file or directory: '~/dev/mms/cloud-data-quality/env/lib/python3.8/site-packages/dbt_project.yml'

uname -a
Darwin mac.local 20.3.0 Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64 x86_64
python3 --version
Python 3.8.6
pip --version
pip 21.1.2 from ~/dev/mms/cloud-data-quality/env/lib/python3.8/site-packages/pip (python 3.8)

Run row-level checks on individual groups of a column

Hi.
I come from using another framework where it is possible to perform group level checks. Consider a volume check declared as a custom sql statement grouped by a column. What I'd like to see is some sort of scan result for each individual group of the column, equating row-level checks on each individual group.

Is this something that can be achieved? I have been sifting through documentation without success as well as trying some custom statements when setting up tasks in the dataplex console.

BR

Views in BigQuery

Hi, maybe this is not the place to ask but I'll ask anyways. When I run cloudDQ, in the BigQuery dataset where it creates the dq_summary table it also creates a lot of views. Is it necessary to have them permanently? I have a lot of rule_bindings so these views take a lot of space which is annoying. Could these views perhaps be temporal?

Allow environment variables substitition in yaml files

It would be useful if environment variables substitution was allowed for yaml files. An usage example would be refering to a different table according to the environment we are:

entities:
  MY_TABLE:
    source_database: BIGQUERY
    table_name: ${GOOGLE_CLOUD_PROJECT}.mydataset.mytable
...

# or

rule_bindings:
  MY_BINDING:
    entity_uri: bigquery://projects/${GOOGLE_CLOUD_PROJECT}/datasets/mydataset/tables/mytable
...

Implement tagging for rule binding and tags execution

It would be really helpful if we could tag the rule bindings, and then execute the DQ checks for those tagged rule bindings only, similar to how it's done in dbt. Something like:

rule_bindings:
    my_rule_binding:
        tags: ["weekly"]
...

and then:

python3 clouddq_executable.zip tag:weekly ...

Would this be something easy to implement?

Consistency Rules not working on Dataplex

Hello!

I'm using Data Quality Tasks on Dataplex and i'm using a rule of consistency (using a sql case when), and i get an error and looking for a example of rules of this dimension (consistency) i can't find an example of it.

This is my .yaml file:

rules:
CONSISTENCY_MINSS:
rule_type: CUSTOM_SQL_EXPR
dimension: consistency
params:
custom_sql_arguments:
- ref_data_dataset
- ref_data_table_id
custom_sql_expr: |-
case
when ($column IN (0,10,100)) or ($column is null) then true
else false end
from data a
inner join $ref_data_project.$ref_data_dataset.$ref_data_table_id b
on a.loc = b.loc
rule_dimensions:

  • consistency
    row_filters:
    _PARTITIONTIME:
    filter_sql_expr: |-
    DATE(_PARTITIONTIME) = '2024-03-03'
    rule_bindings:
    MY-RULE-BINDING-CONSISTENCY-MINSS:
    entity_uri: bigquery://projects/my-gcp-project-ENV/locations/US/datasets/my_gcp_dataset_ENV/tables/my_bigquery_tablename_ENV
    column_id: column_name
    row_filter_id: _PARTITIONTIME
    rule_ids:
    • CONSISTENCY_MINSS:
      • ref_data_project: df-datalake-transformed-ENV
      • ref_data_dataset: my_gcp_dataset_ENV
      • ref_data_table_id: my_gcp_table_ENV
        metadata:
        manual_column_id: minss

This is the error in the output on dataproc:

metadata_registry_defaults:
dataplex:
projects:
locations:
lakes:
zones:

2024-03-12 18:35:10 gdpic-srvls-batch-9e830eba-2731-47cc-beeb-3fdf1ea8b20f-m clouddq[80] ERROR Failed to resolve Rule Binding ID 'MY-RULE-BINDING-CONSISTENCY-MINSS' with error:
Failed to resolve rule_id 'CONSISTENCY_MINSS' in rule_binding_id 'MY-RULE-BINDING-CONSISTENCY-MINSS' with error:
'list' object has no attribute 'get'
Traceback (most recent call last):
File "/tmp/Bazel.runfiles_xzqp1c9m/runfiles/clouddq/clouddq/classes/dq_rule_binding.py", line 213, in resolve_rule_sql_expr
rule.resolve_sql_expr()
File "/tmp/Bazel.runfiles_xzqp1c9m/runfiles/clouddq/clouddq/classes/dq_rule.py", line 122, in resolve_sql_expr
self.rule_sql_expr = self.rule_type.to_sql(self.params).safe_substitute()
File "/tmp/Bazel.runfiles_xzqp1c9m/runfiles/clouddq/clouddq/classes/rule_type.py", line 187, in to_sql
return to_sql_custom_sql_expr(params)
File "/tmp/Bazel.runfiles_xzqp1c9m/runfiles/clouddq/clouddq/classes/rule_type.py", line 96, in to_sql_custom_sql_expr
if params.get("rule_binding_arguments", {}).get(argument, None) is None:
AttributeError: 'list' object has no attribute 'get'

use more standardised schema validation

CloudDQ currently parses yaml into Python dictionaries and validates the expected fields in the Python classes directly.

This was done to 1) minimize dependencies and 2) allow more meaningful error messages. The downside is we need to write a lot more code to do the same thing & we can't take advantage of native schema-validation capability in json-schema or cross-platform support in protobuf.

We may want to consider using pydantic to manage the schema definitions instead.

clouddq-as-dataproc-workflow-composer-dag.md

Hi Team,

I tried to execute clouddq utility using dataproc workflow template, but got error while executing through template execution.

Clouddq version = 1.0.7
Dataproc os version = Debian 11
Executable file version = 1.0.7_debian11_python3.9

Got error as below -
cannot import name 'WKBWriter' from 'shapely.geos'

Note: I did it in GCP trial version account with custom SA.

add 'failed_values' column to dq_summary table

When using clouddq, I felt the need to check the actual failed cases in dq_summary.
So, i want to know your opinion about adding 'failed_values' column to dq_summary.

We need just a few lines to add failed_values, as follows.
ARRAY~ ~failed_values

main.sql

~
        null_count,
        null_percentage,
        ARRAY(
            SELECT column_value 
            FROM {{ ref(entity_dq_statistics_model) }} 
            WHERE column_value IS NOT NULL AND simple_rule_row_is_valid is False LIMIT 500
         ) AS failed_values
    FROM
        {{ ref(entity_dq_statistics_model) }}
~

When following USERMANUAL.md I get this error:TypeError: __init__() got an unexpected keyword argument 'unbound_message'

I am using Cloud Shell and when I am trying to run the first commend with CloudDQ I get this:

$ python3 clouddq_executable.zip --help
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/clouddq/clouddq/main.py", line 32, in <module>
    from clouddq.integration.bigquery.dq_target_table_utils import TargetTable
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/clouddq/clouddq/integration/bigquery/dq_target_table_utils.py", line 24, in <module>
    from clouddq.log import JsonEncoderDatetime
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/clouddq/clouddq/log.py", line 23, in <module>
    from google.cloud.logging.handlers import CloudLoggingHandler
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging/__init__.py", line 18, in <module>
    from google.cloud.logging_v2 import __version__
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging_v2/__init__.py", line 25, in <module>
    from google.cloud.logging_v2.client import Client
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging_v2/client.py", line 37, in <module>
    from google.cloud.logging_v2.handlers import CloudLoggingHandler
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging_v2/handlers/__init__.py", line 17, in <module>
    from google.cloud.logging_v2.handlers.app_engine import AppEngineHandler
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging_v2/handlers/app_engine.py", line 24, in <module>
    from google.cloud.logging_v2.handlers._helpers import get_request_data
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging_v2/handlers/_helpers.py", line 22, in <module>
    import flask
  File "/usr/local/lib/python3.9/dist-packages/flask/__init__.py", line 4, in <module>
    from . import json as json
  File "/usr/local/lib/python3.9/dist-packages/flask/json/__init__.py", line 8, in <module>
    from ..globals import current_app
  File "/usr/local/lib/python3.9/dist-packages/flask/globals.py", line 56, in <module>
    app_ctx: "AppContext" = LocalProxy(  # type: ignore[assignment]
TypeError: __init__() got an unexpected keyword argument 'unbound_message'

I tried with version 1.0.0 and 1.0.7

Cannot locate dbt_extractor

To have a setup that can be run on my mac, I'm setting up a debian-11-based Dockerfile containing the clouddq_executable.
However, when executing the zipfile, I'm encountering the following error:

Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/clouddq/clouddq/main.py", line 39, in <module>
    from clouddq.runners.dbt.dbt_runner import DbtRunner
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/clouddq/clouddq/runners/dbt/dbt_runner.py", line 26, in <module>
    from clouddq.runners.dbt.dbt_utils import run_dbt
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/clouddq/clouddq/runners/dbt/dbt_utils.py", line 25, in <module>
    from dbt.main import main as dbt
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/main.py", line 18, in <module>
    import dbt.task.build as build_task
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/task/build.py", line 1, in <module>
    from .run import RunTask, ModelRunner as run_model_runner
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/task/run.py", line 8, in <module>
    from .compile import CompileRunner, CompileTask
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/task/compile.py", line 3, in <module>
    from .runnable import GraphRunnableTask
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/task/runnable.py", line 54, in <module>
    from dbt.parser.manifest import ManifestLoader
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/parser/__init__.py", line 8, in <module>
    from .models import ModelParser  # noqa
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/parser/models.py", line 17, in <module>
    from dbt_extractor import ExtractionError, py_extract_from_source  # type: ignore
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_extractor/dbt_extractor/__init__.py", line 1, in <module>
    from .dbt_extractor import *
ImportError: /tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_extractor/dbt_extractor/dbt_extractor.abi3.so: cannot open shared object file: No such file or directory

My Dockerfile looks as follows:

ARG PYTHON_VERSION=3.9.7
ARG TARGET_PYTHON_INTERPRETER=3.9
FROM python:${PYTHON_VERSION}-slim-bullseye

ARG TARGET_OS=debian_11
ARG TARGET_PYTHON_INTERPRETER
ARG CLOUDDQ_RELEASE_VERSION=1.0.3

# Replace http with https
RUN sed -i 's/http/https/g' /etc/apt/sources.list
RUN apt-get update -y && apt-get upgrade -y && apt-get -y install wget

RUN wget -O \
    clouddq_executable.zip \
    https://github.com/GoogleCloudPlatform/cloud-data-quality/releases/download/v${CLOUDDQ_RELEASE_VERSION}/clouddq_executable_v${CLOUDDQ_RELEASE_VERSION}_${TARGET_OS}_python${TARGET_PYTHON_INTERPRETER}.zip

ENTRYPOINT [ "python3", "clouddq_executable.zip", "--help" ]

Any help or pointers on why the dbt_extractor.abi3.so module can't seem to be located, are greatly appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.