googlecloudplatform / cloud-data-quality Goto Github PK

Data Quality Engine for BigQuery

License: Apache License 2.0

Starlark 3.26% Makefile 0.44% Python 86.85% Shell 9.44%

cloud-data-quality's Introduction

Cloud Data Quality Engine

Introductions

CloudDQ is a cloud-native, declarative, and scalable Data Quality validation Command-Line Interface (CLI) application for Google BigQuery. CloudDQ allows users to define and schedule custom Data Quality checks across their BigQuery tables. Data Quality validation results will be available in another BigQuery table of their choice. Users can then build dashboards or consume data quality outputs programmatically and monitor data quality from the dashboards and data pipelines.

Key properties:

Declarative rules configuration and support for CI/CD
In-place validation without extracting data for both BigQuery-native tables and GCS structured data via BigQuery External Tables, benefitting from BigQuery performance and scalability while minimising security risk surface
Validation results endpoints designed for programmatic consumption (persisted BigQuery storage and Cloud Logging sink), allowing custom integrations such as with BI reporting and metadata management tooling.

CloudDQ takes as input Data Quality validation tests defined using declarative YAML configurations. Data Quality Rules can be defined using custom BigQuery SQL logic with parametrization to support complex business rules and reuse. For each Rule Binding definition in the YAML configs, CloudDQ creates a corresponding SQL view in BigQuery. CloudDQ then executes the view using BigQuery SQL Jobs and collects the data quality validation outputs into a summary table for reporting and visualization.

Data Quality validation execution consumes BigQuery slots. BigQuery Slots can be provisioned on-demand for each run, in which case you pay for the data scanned by the queries. For production usage, we recommend using dedicated slots reservations to benefit from predictable flat-rate pricing.

We recommend using Dataplex Data Quality Task to deploy CloudDQ. Dataplex Data Quality Task provides a managed and serverless deployment, automatic upgrades, and native support for task scheduling.

For a high-level overview of the purpose of CloudDQ, an explanation of the concepts and how it works, as well as how you would consume the outputs, please see our Overview
For tutorials on how to use CloudDQ, example use cases, and deployment best practices, see the User Manual
We also provide a Reference Guide with spec of the configuration and the library reference.
For more advanced rules covering more specific requirements, please refer to Advanced Rules User Manual.

Contributions

We welcome all community contributions, whether by opening Github Issues, updating documentations, or updating the code directly. Please consult the contribution guide for details on how to contribute.

Before opening a pull request to suggest a feature change, please open a Github Issue to discuss the use-case and feature proposal with the project maintainers.

Feedback / Questions

For any feedback or questions, please feel free to get in touch at clouddq at google.com.

License

CloudDQ is licensed under the Apache License version 2.0. This is not an official Google product.

cloud-data-quality's People

Contributors

Stargazers

Watchers

cloud-data-quality's Issues

Storing configs in gcs bucket

Hi, I wanted to ask if it could be possible to store the quality metrics in a cloud storage bucket instead of the "configs" folder. For example, defining rule_binding_config_path as "gs://bucket_name/path/to/configs".

Thank you.

Add integration testing for Dataproc PySpark job

Leftover item from:
#71

Known Issue: Github Actions fails when a PR comes from a fork

Example PR: #69

Github's response: https://github.community/t/make-secrets-available-to-builds-of-forks/16166/14

We need to update the CI process to allow testing from forks, otherwise contributors will all need maintainer permission to the project.

why is the semantics of `CUSTOM_SQL_EXPR` and `CUSTOM_SQL_STATEMENT` are opposite ?

Currently for CUSTOM_SQL_EXPR return true, it would considered as success. while CUSTOM_SQL_STATEMENT returning true it is considered as failure.

If really the rules are considered as assertion, shouldn't the complex_rule_validation_success_flag be true instead when CUSTOM_SQL_STATEMENT is returning true ?

Broken impersonate account flow

Hi team,

I found an permission issue when using impersonate account. I found that the reason is that the code does not pass the gcp_impersonation_credential at here when we try to generate the connection config again after we init the dbt runner

Allow multiple row_filters and params to row_filters

Is there any reason that not allow multiple row_filters and params like rule_id?
While using clouddq, i got some needs of this in some situation.

if allow params, then row_filters can be used in general. I mean one row_filter can be used in many entities.
Or sometime, we want to filter by a column other than the one used for rule_binding.
I suggest modifying like below, then filtering email row_filter can be use any entity with any given column.

row_filters:
  NONE:
    filter_sql_expr: |-
      True

  DATA_TYPE_EMAIL:
    params: |-
      - column
    filter_sql_expr: |-
      $column = 'email'

rule_bindings:
  T2_DQ_1_EMAIL:
    entity_id: TEST_TABLE
    column_id: VALUE
    row_filter_ids:
      - DATA_TYPE_EMAIL:
           column: contact_type
    rule_ids:
      - NOT_NULL_SIMPLE
      - REGEX_VALID_EMAIL
      - CUSTOM_SQL_LENGTH:
          upper_bound: 30
      - NOT_BLANK
    metadata:
      brand: one

Is there any way to use it in this situation? if not, can i adding this?

Migrate Makefile to universal-build

Migrating to universal-build (https://pypi.org/project/universal-build/) will give us docker image generation and release pipelines out-of-the-box without having to maintaining these components ourself.

error with "include all reference columns" option

When I declare the optgion of include all reference columns as the documentation define (reference-columns.yml)

INCLUDE_ALL_REFERENCE_COLUMNS:
include_reference_columns:
- *

, I have an error: "ERROR while scanning an alias" and the rule never starts.

refactor code to allow cross-platform support

Right now the code only runs on BigQuery.

To add cross-platform support, at minimum we need to:

add a flag to the CLI to indicate the target platform
use this flag to perform query dry-run to validate SQL
check that the valid dbt configs for the target platform are provided
filter out rule bindings that do not target an entity in the corresponding platform
(ideally) refactor the query construction into a builder class, then pass that into a query engine class to execute, rather than passing raw SQL around. This builder class can customize the SQL to the target platform.
Allow specifying where the DQ summary results should be written to (e.g. validate data in GCS but write validation results to BigQuery). This requires decoupling the steps to generate the validation results and write them to the target sink.
Ensure all SQL and Jinja are cross-platform compatible.
Add automated integration test-suites for running the CLI against different platform

make yaml parsing more robust

pyyaml has the Norway problem.

We should consider whether to replace it with:

ruamel.yaml which uses the yaml 1.2 standards
load in everything as string with strictyaml

Is is possible to access table_name in custom_sql_statement ?

by default there is a data CTE, but there are other use cases for checking freshness of last_modified_at

column_id at null in dq_summary In a Rule Type CUSTOM_SQL_STATEMENT

In a Rule Type CUSTOM_SQL_STATEMENT, the column_id is always null in dq_summary.
But in the documentation:

"The rule binding sets the value of the column parameter through the column_id field and the custom parameters are set explicitly by listing them on each rule reference."

add more default rule types

We should be able to implement all of dbt's schema tests as CUSTOM_SQL_STATEMENT. There may be other rules we can implement from https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html#dataset.

We should consider how to allow rules that reference multiple tables using references to tables (with overrides). The current alternative is to parameterize the table names.

It it make sense we may want to implement the rule directly in https://github.com/GoogleCloudPlatform/cloud-data-quality/blob/main/clouddq/classes/rule_type.py

Enhancement: Better diagnostics for errors.

While studying CloudDQ, I created a configuration file and used it in an execution. I obviously made a mistake in my config file which has resulted in the error: ERROR malformed JSON. The issue I am reporting here is in terms of "servicability". Based on the information presented in this error output, how can I narrow down on my issue? Is there an opportunity for the project to offer "better" diagnostics?

python3 clouddq_executable_v0.5.2_debian_10_python3.9.zip \
    ALL \
    test1.yaml \
    --gcp_project_id="${GOOGLE_CLOUD_PROJECT}" \
    --gcp_bq_dataset_id="${CLOUDDQ_BIGQUERY_DATASET}" \
    --gcp_region_id="${CLOUDDQ_BIGQUERY_REGION}" \
    --print_sql_queries \
    --target_bigquery_summary_table="${CLOUDDQ_TARGET_BIGQUERY_TABLE}"kolban@cloudshell:~/clouddq (kolban-dataplex)$ ./run
Your active configuration is: [cloudshell-11175]
2022-03-05 16:12:46 cs-158005325041-default clouddq.integration.gcp_credentials[1748] INFO Successfully created GCP Client.
2022-03-05 16:12:46 cs-158005325041-default clouddq[1748] INFO Starting CloudDQ run with configs:
{"clouddq_run_configs": {"rule_binding_ids": "ALL", "rule_binding_config_path": "test1.yaml", "dbt_path": null, "dbt_profiles_dir": null, "environment_target": "dev", "gcp_project_id": "kolban-dataplex", "gcp_region_id": "us-central1", "gcp_bq_dataset_id": "dq_output", "gcp_service_account_key_path": null, "gcp_impersonation_credentials": null, "metadata": "{}", "dry_run": false, "progress_watermark": true, "target_bigquery_summary_table": "kolban-dataplex.dq_output.target", "debug": false, "print_sql_queries": true, "skip_sql_validation": false, "summary_to_stdout": false, "enable_experimental_bigquery_entity_uris": false, "enable_experimental_dataplex_gcs_validation": false, "bigquery_client": null, "gcp_credentials": {"credentials": "<google.auth.compute_engine.credentials.Credentials object at 0x7fa1709d60a0>", "project_id": "kolban-dataplex", "user_id": "[email protected]"}}}
2022-03-05 16:12:46 cs-158005325041-default clouddq.runners.dbt.dbt_connection_configs[1748] INFO Using Application-Default Credentials (ADC) to authenticate to GCP...
2022-03-05 16:12:46 cs-158005325041-default clouddq[1748] INFO Writing rule_binding views and intermediate summary results to BigQuery dq_summary_table_name: `kolban-dataplex.dq_output.dq_summary`. 
2022-03-05 16:12:46 cs-158005325041-default clouddq[1748] INFO Using dq_summary_dataset: kolban-dataplex.dq_output
2022-03-05 16:12:46 cs-158005325041-default clouddq[1748] INFO Using target_bigquery_summary_table: `kolban-dataplex.dq_output.target`. 
2022-03-05 16:12:47 cs-158005325041-default clouddq[1748] INFO Preparing SQL for rule bindings: ['SALES_WIDGET', 'SALES_QUANTITY']
2022-03-05 16:12:47 cs-158005325041-default clouddq[1748] ERROR malformed JSON
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/clouddq/clouddq/main.py", line 486, in main
    configs_cache.get_entities_configs_from_rule_bindings(
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/clouddq/clouddq/classes/dq_configs_cache.py", line 354, in get_entities_configs_from_rule_bindings
    for record in self._cache_db.query(query):
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/py_deps/pypi__sqlite_utils/sqlite_utils/db.py", line 410, in query
    cursor = self.execute(sql, params or tuple())
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/py_deps/pypi__sqlite_utils/sqlite_utils/db.py", line 422, in execute
    return self.conn.execute(sql, parameters)
sqlite3.OperationalError: malformed JSON
"malformed JSON"
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/clouddq/clouddq/main.py", line 486, in main
    configs_cache.get_entities_configs_from_rule_bindings(
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/clouddq/clouddq/classes/dq_configs_cache.py", line 354, in get_entities_configs_from_rule_bindings
    for record in self._cache_db.query(query):
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/py_deps/pypi__sqlite_utils/sqlite_utils/db.py", line 410, in query
    cursor = self.execute(sql, params or tuple())
  File "/tmp/Bazel.runfiles_b3f80ukr/runfiles/py_deps/pypi__sqlite_utils/sqlite_utils/db.py", line 422, in execute
    return self.conn.execute(sql, parameters)
sqlite3.OperationalError: malformed JSON


malformed JSON
Waiting up to 5 seconds.
Sent all pending logs.

add spark support

Update the rule_binding value using the input parameters

I have a kind of reporting column/snapshot date column and would like to apply this as a filter when running some validations. The value of this column needs to be passed as input to the main function. Is this something possible?

add snowflake support

Unable to run clouddq - FileNotFoundError

I was able to set up my local env based on instruction below:
https://github.com/GoogleCloudPlatform/cloud-data-quality/blob/main/docs/getting-started-with-default-configs.md

However, running clouddq command failed:
python3 clouddq \ T2_DQ_1_EMAIL \ configs \ --metadata='{"test":"test"}' \ --dbt_profiles_dir=. \ --dbt_path=. \ --environment_target=dev

FileNotFoundError: [Errno 2] No such file or directory: '~/dev/mms/cloud-data-quality/env/lib/python3.8/site-packages/dbt_project.yml'

uname -a
Darwin mac.local 20.3.0 Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64 x86_64
python3 --version
Python 3.8.6
pip --version
pip 21.1.2 from ~/dev/mms/cloud-data-quality/env/lib/python3.8/site-packages/pip (python 3.8)

Nice to see that this project is production ready and open-sourced now 😊

I remember rule ids like NOT_NULL_SIMPLE 😀
Great work on open-sourcing this great DQ engine! @thinhha

Run row-level checks on individual groups of a column

Hi.
I come from using another framework where it is possible to perform group level checks. Consider a volume check declared as a custom sql statement grouped by a column. What I'd like to see is some sort of scan result for each individual group of the column, equating row-level checks on each individual group.

Is this something that can be achieved? I have been sifting through documentation without success as well as trying some custom statements when setting up tasks in the dataplex console.

Add flag to use Dataform instead of dbt

automated docs generation

Views in BigQuery

Hi, maybe this is not the place to ask but I'll ask anyways. When I run cloudDQ, in the BigQuery dataset where it creates the dq_summary table it also creates a lot of views. Is it necessary to have them permanently? I have a lot of rule_bindings so these views take a lot of space which is annoying. Could these views perhaps be temporal?

create integration tests for https://github.com/GoogleCloudPlatform/cloud-data-quality/pull/63

Allow environment variables substitition in yaml files

It would be useful if environment variables substitution was allowed for yaml files. An usage example would be refering to a different table according to the environment we are:

entities:
  MY_TABLE:
    source_database: BIGQUERY
    table_name: ${GOOGLE_CLOUD_PROJECT}.mydataset.mytable
...

# or

rule_bindings:
  MY_BINDING:
    entity_uri: bigquery://projects/${GOOGLE_CLOUD_PROJECT}/datasets/mydataset/tables/mytable
...

Make schema/dataset property part of table scan specification

wrong comment sorry

Null records not retrieved in failed_records_query column of summary table

When results summary table have values in null_count, the corresponding records are not retrieved in failed_records_query column of summary table

Implement tagging for rule binding and tags execution

It would be really helpful if we could tag the rule bindings, and then execute the DQ checks for those tagged rule bindings only, similar to how it's done in dbt. Something like:

rule_bindings:
    my_rule_binding:
        tags: ["weekly"]
...

and then:

python3 clouddq_executable.zip tag:weekly ...

Would this be something easy to implement?

Consistency Rules not working on Dataplex

Hello!

I'm using Data Quality Tasks on Dataplex and i'm using a rule of consistency (using a sql case when), and i get an error and looking for a example of rules of this dimension (consistency) i can't find an example of it.

This is my .yaml file:

rules:
CONSISTENCY_MINSS:
rule_type: CUSTOM_SQL_EXPR
dimension: consistency
params:
custom_sql_arguments:
- ref_data_dataset
- ref_data_table_id
custom_sql_expr: |-
case
when ($column IN (0,10,100)) or ($column is null) then true
else false end
from data a
inner join $ref_data_project.$ref_data_dataset.$ref_data_table_id b
on a.loc = b.loc
rule_dimensions:

consistency
row_filters:
_PARTITIONTIME:
filter_sql_expr: |-
DATE(_PARTITIONTIME) = '2024-03-03'
rule_bindings:
MY-RULE-BINDING-CONSISTENCY-MINSS:
entity_uri: bigquery://projects/my-gcp-project-ENV/locations/US/datasets/my_gcp_dataset_ENV/tables/my_bigquery_tablename_ENV
column_id: column_name
row_filter_id: _PARTITIONTIME
rule_ids:
- CONSISTENCY_MINSS:
  - ref_data_project: df-datalake-transformed-ENV
  - ref_data_dataset: my_gcp_dataset_ENV
  - ref_data_table_id: my_gcp_table_ENV
    metadata:
    manual_column_id: minss

This is the error in the output on dataproc:

metadata_registry_defaults:
dataplex:
projects:
locations:
lakes:
zones:

2024-03-12 18:35:10 gdpic-srvls-batch-9e830eba-2731-47cc-beeb-3fdf1ea8b20f-m clouddq[80] ERROR Failed to resolve Rule Binding ID 'MY-RULE-BINDING-CONSISTENCY-MINSS' with error:
Failed to resolve rule_id 'CONSISTENCY_MINSS' in rule_binding_id 'MY-RULE-BINDING-CONSISTENCY-MINSS' with error:
'list' object has no attribute 'get'
Traceback (most recent call last):
File "/tmp/Bazel.runfiles_xzqp1c9m/runfiles/clouddq/clouddq/classes/dq_rule_binding.py", line 213, in resolve_rule_sql_expr
rule.resolve_sql_expr()
File "/tmp/Bazel.runfiles_xzqp1c9m/runfiles/clouddq/clouddq/classes/dq_rule.py", line 122, in resolve_sql_expr
self.rule_sql_expr = self.rule_type.to_sql(self.params).safe_substitute()
File "/tmp/Bazel.runfiles_xzqp1c9m/runfiles/clouddq/clouddq/classes/rule_type.py", line 187, in to_sql
return to_sql_custom_sql_expr(params)
File "/tmp/Bazel.runfiles_xzqp1c9m/runfiles/clouddq/clouddq/classes/rule_type.py", line 96, in to_sql_custom_sql_expr
if params.get("rule_binding_arguments", {}).get(argument, None) is None:
AttributeError: 'list' object has no attribute 'get'

column_id is NULL for CUSTOM_SQL_STATEMENT rules in summary table

For rules of type 'CUSTOM_SQL_STATEMENT', the corresponding column_id in the summary table is null.

Is this a desired behaviour ?

Missing sub-network configuration while invoking 'DataplexCreateTaskOperator'

I'm trying to run a cloud DQ task via DataplexCreateTaskOperator in GCP where network/subnetwork comes from host VPC project. But i don't see an option to pass subnetwork.
Is anyone facing similar issue? Please help.

For CUSTOM_SQL_STATEMENT rules, column id is coming as null in summary table

When executing cloud dq for CUSTOM_SQL_STATEMENT rules, in the results summary table, column_id is coming as null, whereas for non CUSTOM_SQL_STATEMENT rules, column_id is correctly populated.

use more standardised schema validation

CloudDQ currently parses yaml into Python dictionaries and validates the expected fields in the Python classes directly.

This was done to 1) minimize dependencies and 2) allow more meaningful error messages. The downside is we need to write a lot more code to do the same thing & we can't take advantage of native schema-validation capability in json-schema or cross-platform support in protobuf.

We may want to consider using pydantic to manage the schema definitions instead.

clouddq-as-dataproc-workflow-composer-dag.md

Hi Team,

I tried to execute clouddq utility using dataproc workflow template, but got error while executing through template execution.

Clouddq version = 1.0.7
Dataproc os version = Debian 11
Executable file version = 1.0.7_debian11_python3.9

Got error as below -
cannot import name 'WKBWriter' from 'shapely.geos'

Note: I did it in GCP trial version account with custom SA.

write dq results to data catalog

add 'failed_values' column to dq_summary table

When using clouddq, I felt the need to check the actual failed cases in dq_summary.
So, i want to know your opinion about adding 'failed_values' column to dq_summary.

We need just a few lines to add failed_values, as follows.
ARRAY~ ~failed_values

main.sql

~
        null_count,
        null_percentage,
        ARRAY(
            SELECT column_value 
            FROM {{ ref(entity_dq_statistics_model) }} 
            WHERE column_value IS NOT NULL AND simple_rule_row_is_valid is False LIMIT 500
         ) AS failed_values
    FROM
        {{ ref(entity_dq_statistics_model) }}
~

When following USERMANUAL.md I get this error:TypeError: init() got an unexpected keyword argument 'unbound_message'

I am using Cloud Shell and when I am trying to run the first commend with CloudDQ I get this:

$ python3 clouddq_executable.zip --help
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/clouddq/clouddq/main.py", line 32, in <module>
    from clouddq.integration.bigquery.dq_target_table_utils import TargetTable
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/clouddq/clouddq/integration/bigquery/dq_target_table_utils.py", line 24, in <module>
    from clouddq.log import JsonEncoderDatetime
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/clouddq/clouddq/log.py", line 23, in <module>
    from google.cloud.logging.handlers import CloudLoggingHandler
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging/__init__.py", line 18, in <module>
    from google.cloud.logging_v2 import __version__
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging_v2/__init__.py", line 25, in <module>
    from google.cloud.logging_v2.client import Client
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging_v2/client.py", line 37, in <module>
    from google.cloud.logging_v2.handlers import CloudLoggingHandler
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging_v2/handlers/__init__.py", line 17, in <module>
    from google.cloud.logging_v2.handlers.app_engine import AppEngineHandler
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging_v2/handlers/app_engine.py", line 24, in <module>
    from google.cloud.logging_v2.handlers._helpers import get_request_data
  File "/tmp/Bazel.runfiles__m9_ww0t/runfiles/py_deps/pypi__google_cloud_logging/google/cloud/logging_v2/handlers/_helpers.py", line 22, in <module>
    import flask
  File "/usr/local/lib/python3.9/dist-packages/flask/__init__.py", line 4, in <module>
    from . import json as json
  File "/usr/local/lib/python3.9/dist-packages/flask/json/__init__.py", line 8, in <module>
    from ..globals import current_app
  File "/usr/local/lib/python3.9/dist-packages/flask/globals.py", line 56, in <module>
    app_ctx: "AppContext" = LocalProxy(  # type: ignore[assignment]
TypeError: __init__() got an unexpected keyword argument 'unbound_message'

I tried with version 1.0.0 and 1.0.7

Cannot locate dbt_extractor

To have a setup that can be run on my mac, I'm setting up a debian-11-based Dockerfile containing the clouddq_executable.
However, when executing the zipfile, I'm encountering the following error:

Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/clouddq/clouddq/main.py", line 39, in <module>
    from clouddq.runners.dbt.dbt_runner import DbtRunner
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/clouddq/clouddq/runners/dbt/dbt_runner.py", line 26, in <module>
    from clouddq.runners.dbt.dbt_utils import run_dbt
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/clouddq/clouddq/runners/dbt/dbt_utils.py", line 25, in <module>
    from dbt.main import main as dbt
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/main.py", line 18, in <module>
    import dbt.task.build as build_task
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/task/build.py", line 1, in <module>
    from .run import RunTask, ModelRunner as run_model_runner
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/task/run.py", line 8, in <module>
    from .compile import CompileRunner, CompileTask
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/task/compile.py", line 3, in <module>
    from .runnable import GraphRunnableTask
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/task/runnable.py", line 54, in <module>
    from dbt.parser.manifest import ManifestLoader
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/parser/__init__.py", line 8, in <module>
    from .models import ModelParser  # noqa
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_core/dbt/parser/models.py", line 17, in <module>
    from dbt_extractor import ExtractionError, py_extract_from_source  # type: ignore
  File "/tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_extractor/dbt_extractor/__init__.py", line 1, in <module>
    from .dbt_extractor import *
ImportError: /tmp/Bazel.runfiles_24wvewkh/runfiles/py_deps/pypi__dbt_extractor/dbt_extractor/dbt_extractor.abi3.so: cannot open shared object file: No such file or directory

My Dockerfile looks as follows:

ARG PYTHON_VERSION=3.9.7
ARG TARGET_PYTHON_INTERPRETER=3.9
FROM python:${PYTHON_VERSION}-slim-bullseye

ARG TARGET_OS=debian_11
ARG TARGET_PYTHON_INTERPRETER
ARG CLOUDDQ_RELEASE_VERSION=1.0.3

# Replace http with https
RUN sed -i 's/http/https/g' /etc/apt/sources.list
RUN apt-get update -y && apt-get upgrade -y && apt-get -y install wget

RUN wget -O \
    clouddq_executable.zip \
    https://github.com/GoogleCloudPlatform/cloud-data-quality/releases/download/v${CLOUDDQ_RELEASE_VERSION}/clouddq_executable_v${CLOUDDQ_RELEASE_VERSION}_${TARGET_OS}_python${TARGET_PYTHON_INTERPRETER}.zip

ENTRYPOINT [ "python3", "clouddq_executable.zip", "--help" ]

Any help or pointers on why the dbt_extractor.abi3.so module can't seem to be located, are greatly appreciated.

incremental validation assumes the table `dq_summary` already exists.

When I run the data quality job for a table in bigquery, I get a "ValueError: 'BIGNUMERIC' is not a valid DatabaseColumnType" error.

When I try to run a data quality task on a table in BigQuery, I get the error "'BIGNUMERIC' is not a valid DatabaseColumnType." I repeated the procedure in several tables and it worked perfectly. This table in question has columns in the format of BIGNUMERIC.

What should I do in this situation? The bignumber format is not supported by the data quality?