Giter VIP home page Giter VIP logo

data-diff's Introduction

Datafold

data-diff: Compare datasets fast, within or across SQL databases

data-diff-logo


Join our live virtual lab series to learn how to set it up!

What's a Data Diff?

A data diff is the value-level comparison between two tablesβ€”used to identify critical changes to your data and guarantee data quality.

There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies when moving data between databases.

data-diff OSS & Datafold Cloud

data-diff is an open source utility for running stateless diffs as a great single player experience.

Scale up with Datafold Cloud to make data diffing a company-wide experience to both supercharge your data diffing CLI experience (ex: data-diff --dbt --cloud) and run diffs manually in your CI process and within the Datafold UI. This includes column-level lineage with BI tool integrations, CI testing, faster cross-database diffing, and diff history.

Use Cases

Data Development Testing

When developing SQL code, data-diff helps you validate and preview changes by comparing data between development/staging environments and production. Here's how it works:

  1. Make a change to your SQL code
  2. Run the SQL code to create a new dataset
  3. Compare this dataset with its production version or other iterations

Data Migration & Replication Testing

data-diff is a powerful tool for comparing data when you're moving it between systems. Use it to ensure data accuracy and identify discrepancies during tasks like:

  • Migrating to a new data warehouse (e.g., Oracle -> Snowflake)
  • Validating SQL transformations from legacy solutions (e.g., stored procedures) to new transformation frameworks (e.g., dbt)
  • Continuously replicating data from an OLTP database to OLAP data warehouse (e.g., MySQL -> Redshift)

dbt Integration

dbt

data-diff integrates with dbt Core to seamlessly compare local development to production datasets.

Learn more about how data-diff works with dbt:

Getting Started

⚑ Validating dbt model changes between dev and prod

Looking to use data-diff in dbt development?

Development testing with Datafold enables you to see the impact of dbt code changes on data as you write the code, whether in your IDE or CLI.

Head over to our data-diff + dbt documentation to get started with a development testing workflow!

πŸ”€ Compare data tables between databases

  1. Install data-diff with adapters

To compare data between databases, install data-diff with specific database adapters. For example, install it for PostgreSQL and Snowflake like this:

pip install data-diff 'data-diff[postgresql,snowflake]' -U

Additionally, you can install all open source supported database adapters as follows.

pip install data-diff 'data-diff[all-dbs]' -U
  1. Run data-diff with connection URIs

Then, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:

data-diff \
  postgresql://<username>:'<password>'@localhost:5432/<database> \
  <table> \
  "snowflake://<username>:<password>@<account>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
  <TABLE> \
  -k <primary key column> \
  -c <columns to compare> \
  -w <filter condition>
  1. Set up your configuration

You can use a toml configuration file to run your data-diff job. In this example, we compare tables between MotherDuck (hosted DuckDB) and Snowflake using the hashdiff algorithm:

## DATABASE CONNECTION ##
[database.duckdb_connection]
  driver = "duckdb"
  # filepath = "datafold_demo.duckdb" # local duckdb file example
  # filepath = "md:" # default motherduck connection example
  filepath = "md:datafold_demo?motherduck_token=${motherduck_token}" # API token recommended for motherduck connection

[database.snowflake_connection]
  driver = "snowflake"
  database = "DEV"
  user = "sung"
  password = "${SNOWFLAKE_PASSWORD}" # or "<PASSWORD_STRING>"
  # the info below is only required for snowflake
  account = "${ACCOUNT}" # by33919
  schema = "DEVELOPMENT"
  warehouse = "DEMO"
  role = "DEMO_ROLE"

## RUN PARAMETERS ##
[run.default]
  verbose = true

## EXAMPLE DATA DIFF JOB ##
[run.demo_xdb_diff]
  # Source 1 ("left")
  1.database = "duckdb_connection"
  1.table = "development.raw_orders"

  # Source 2 ("right")
  2.database = "snowflake_connection"
  2.table = "RAW_ORDERS" # note that snowflake table names are case-sensitive

  verbose = false
  1. Run your data-diff job

Make sure to export relevant environment variables as needed. For example, we compare data based on the earlier configuration:

# export relevant environment variables, example below
export motherduck_token=<MOTHERDUCK_TOKEN>

# run the configured data-diff job
data-diff --conf datadiff.toml \
  --run demo_xdb_diff \
  -k "id" \
  -c status

# output example
- 1, completed
+ 1, returned
  1. Review the output

After running your data-diff job, review the output to identify and analyze differences in your data.

Check out documentation for the full command reference.

Supported databases

Database Status Connection string
PostgreSQL >=10 🟒 postgresql://<user>:<password>@<host>:5432/<database>
MySQL 🟒 mysql://<user>:<password>@<hostname>:5432/<database>
Snowflake 🟒 "snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"
BigQuery 🟒 bigquery://<project>/<dataset>
Redshift 🟒 redshift://<username>:<password>@<hostname>:5439/<database>
DuckDB 🟒 duckdb://<filepath>
MotherDuck 🟒 duckdb://<filepath>
Microsoft SQL Server* 🟒 mssql://<user>:<password>@<host>/<database>/<schema>
Oracle 🟑 oracle://<username>:<password>@<hostname>/servive_or_sid
Presto 🟑 presto://<username>:<password>@<hostname>:8080/<database>
Databricks 🟑 databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>
Trino 🟑 trino://<username>:<password>@<hostname>:8080/<database>
Clickhouse 🟑 clickhouse://<username>:<password>@<hostname>:9000/<database>
Vertica 🟑 vertica://<username>:<password>@<hostname>:5433/<database>

*MS SQL Server support is limited, with known performance issues that are addressed in Datafold Cloud.

  • 🟒: Implemented and thoroughly tested.
  • 🟑: Implemented, but not thoroughly tested yet.

Your database not listed here?


How it works

data-diff efficiently compares data using two modes:

joindiff: Ideal for comparing data within the same database, utilizing outer joins for efficient row comparisons. It relies on the database engine for computation and has consistent performance.

hashdiff: Recommended for comparing datasets across different databases or large tables with minimal differences. It uses hashing and binary search, capable of diffing data across distinct database engines.

Click here to learn more about joindiff and hashdiff

joindiff

  • Recommended for comparing data within the same database
  • Uses the outer join operation to diff the rows as efficiently as possible within the same database
  • Fully relies on the underlying database engine for computation
  • Requires both datasets to be queryable with a single SQL query
  • Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset

hashdiff:

  • Recommended for comparing datasets across different databases
  • Can also be helpful in diffing very large tables with few expected differences within the same database
  • Employs a divide-and-conquer algorithm based on hashing and binary search
  • Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
  • Time complexity approximates COUNT(*) operation when there are few differences
  • Performance degrades when datasets have a large number of differences

For detailed algorithm and performance insights, explore here, or head to our docs to learn more about how Datafold diffs data.

Contributors

We thank everyone who contributed so far!

We'd love to see your face here: Contributing Instructions


Analytics


License

This project is licensed under the terms of the MIT License.

data-diff's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-diff's Issues

Add support for Microsoft SQL Server 2016+

Hi All,

Very interested in this project, looks fantastic to me!.
I see a significant and probably quite common use case here where developers and the business will want to be reassured of a successful migration from an old legacy system to a cloud platform such as Snowflake.

As a consultant, I see a huge number of clients on MS SQL Server, and this is the source of the data for their migration from on-prem to the cloud.

I'm excited to use this tool but unfortunately it does not support most of the use cases I would want to use it for, due to the lack of MS SQL Support. To this end, I would like to suggest support for this platform.

I'm also more than happy to contribute to this, and I'm wondering if anyone can point me in the right direction for contributing please? I see the developer environment guide in the README, but what I mean is more specific information about creating a new adapter/adding new database support.

Thanks!

snowflake.connector.errors.ForbiddenError: 250001 (08001): Failed to connect to DB.

I keep getting the following error:

snowflake.connector.errors.ForbiddenError: 250001 (08001): Failed to connect to DB. Verify the account name is correct: fi45185.snowflakecomputing.com:443. HTTP 403: Forbidden

I've ran this with Poetry and printed out the credentials I was passing in and they all look correct. I've tried signing in with multiple different users and passwords and I am still getting the same error. I can't figure out what is going wrong. My account is fi45185.east-us-2.azure.snowflakecomputing.com- I am only inputting fi45185 as the account.

Support for NoSQL/document-based databases?

We do a lot of data ingestion/syncing from Firestore and MongoDB into BigQuery. Any plans in the roadmap for these kinds of DBs? I assume in order to support other database types, they would need to support running hashing algorithms (md5, sha) natively.

Type Error

I'm running data-diff on Linux os, specifically via Google Colab.

Here's my command which is constructed following the example on in your Example Command and Output section:

python -m data_diff "snowflake://$USER_1:$PASSWORD_1@$SNOWFLAKE_ACCOUNT/$DATAWAREHOUSE?database=$DATABASE&schema=$SCHEMA1" "same_table" "snowflake://$USER_2:$PASSWORD_2@$SNOWFLAKE_ACCOUNT/$DATAWAREHOUSE?database=$DATABASE&schema=$SCHEMA2" "same_table"

Here's the error message. Please help, I'm trying to setup a Workflow on GitAction, but I run into problem at this very first step.

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/data_diff/__main__.py", line 159, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/data_diff/__main__.py", line 132, in main
    diff_iter = differ.diff_tables(table1, table2)
  File "/usr/local/lib/python3.7/dist-packages/data_diff/diff_tables.py", line 208, in diff_tables
    end_key = max(maxs) + 1
TypeError: can only concatenate str (not "int") to str

BigQuery help URI

I know it's not fully tested yet, but it'd be nice to know what the URI format should look like for a BigQuery dataset on the README.

Materialize diff to database

We might want to add to the database drivers the ability to use them as output for the diff, so we can create diff_{src_table_name}_{dst_table_name} with all the differing rows. This would probably be quite useful for especially larger diffs over just writing output, if you can't integrate with the Python API.

Color-coded JSON output from the CLI

CleanShot 2022-06-21 at 10 47 56@2x

The current output isn't easily parseable. Would be easier to parse if it was JSONL with each line being a JSON object/array.

Something like:

["+", 312879, 3829, "hello world"]

Add support for Teradata

Hello guys!

First of all, let me thank you for this incredible package you've achieved. I find myself with this recurrent problem of trying to compare two "mirror tables" or tables that are being migrated.

I was going to ask you if you've considered including Teradata to the supported engines.
Do you need help with it? I'm not a professional coder, but it would be great to contribute in this project.

Greetings from Argentina!

Contributing guidelines

Most repositories contain either:

  1. A Contributing section in the README, or
  2. A Contributing.md file

It would be beneficial to have one here as well.

Unsupported type warnings

If there's a type we don't support, e.g. polygon, we should print a warning that we are coercing it to a string and it may not work if it's across databases.

Support Browser-based SSO in Snowflake connector

It would be great to support Browser-based SSO to connect to Snowflake.

When federated authentication is enabled for an account, Snowflake does not recommend maintaining user passwords in Snowflake so using SSO is the only way to connect and authenticate with a Snowflake client.

Currently, if I try to connect with an external authenticator, I get the following error:

data-diff \
    "snowflake://<user>@<account>/<database>/<schema>?warehouse=<warehouse>&role=<role>&authenticator='externalbrowser'" \
    <table> \
    "snowflake://<user>@<account>/<database>/<schema>?warehouse=<warehouse>&role=<role>&authenticator='externalbrowser'" \
    <table>
snowflake.connector.errors.ProgrammingError: 251006: Password is empty

In other tools like dbt, this feature is supported by providing the authenticator option: https://docs.getdbt.com/reference/warehouse-profiles/snowflake-profile#sso-authentication

Oracle, BigQuery, Redshift on CI

Currently we test those locally since they run in the Cloud and in Oracle & Redshift's case require a bastion host to be accessible from our local machines. BigQuery also has its own config. We'd love to get those running in CI (#135). We have Snowflake, MySQL, Presto, and Postgres as of #135

@cfernhout you able to take this on? ❀️

--stats gives weird info message

I get the message
Diff-Total: 3 changed rows out of None

Looking at the code, it looks like None should be the count of records in table 1. My test case is a really small table (only 3 rows) so wondering if that's what's throwing things off.
image

Add benchmarking scripts for % of changed rows

Currently the benchmarking introduced in #135 checks two tables that are equal.

We'd love to add some tests where we delete/change an increasing % of rows, starting at just 1 row, to see how it behaves.

Especially it would be useful to know at what threshold just pulling down all the rows would be faster

We should generate some graphs to show this relationship of checksum/download crossing in performance. That would be gold for the README

Allow source and target to have different column names

You might have a column called money in one database, and amount in another. Today we don't have a way to have the columns have different names across the two databases. In the Python API, perhaps it's sufficient to just make it based on the position in the column tuple.

For the CLI, maybe we could use: -c amount:money.

Multiple columns key

I have a database with multiple columns as a primary key. Is there any way to specify multiple columns as a key?

Remove Preql scripts; rely on database_types.py instead

@erezsh wdyt about removing all the Preql scripts and just rely on database_types.py instead? We can add instructions on running a single test to seed your local database if you want to just run it locally.

I think this would cut down on the amount of testing code we have to maintain

Config Files

We should likely create a config file format to avoid having to create wrapper-scripts with the right bisection factor, threshold, concurrency, URIs, tables, etc.

Failing datetime rounding tests between Oracle <-> Snowflake

Ran all the tests with unittest-parallel with all adapters, and looks like a few snuck through:

  • test_types_oracle_ts_y_tz_to_snowflake_ts3_50
  • test_types_oracle_ts9_y_tz_to_snowflake_ts3_50
  • test_types_oracle_ts6_y_tz_to_snowflake_ts3_50

CleanShot 2022-06-28 at 13 56 40@2x

Can reproduce with:

LOG_LEVEL=debug poetry run python3 -m unittest -k 'test_types_oracle_ts_y_tz_to_snowflake_ts3_50'

Specify schema name for Oracle

Trying to test a comparison to an Oracle database, but I cannot figure out how to inform data-diff which schema the table is in. It only seems to support specifying a table name and thus only tables belonging to the user I am logged in as.

Our DBAs do not let users log in using the owner of our production schemas (for good reason).

Let the differ choose a shared hashing algorithm

Right now we only support md5 for hashing columns.

However, for some databases that might not be feasible. For example, in mssql it's too slow, and sounds like Spanner only supports sha1.

I think the solution should be to allow several implementations in subclasses of Database, and have the differ choose the best one that's shared between them. (or throw an error if there isn't one)

See issues #51 and #99.

MySQL driver querying information_schema with a table_schema value of the user instead of the database

The first queries that data-diff performs is getting the columns from the specified table; however, the query is using the user as the TABLE_SCHEMA instead of the database.

data-diff --debug mysql://root:root_password@mysql5:3306/coolname Transaction mysql://user:user_password/@mysql:3306/coolname Transaction
[11:33:45] DEBUG - Running SQL (MySQL): SELECT column_name, data_type, datetime_precision, numeric_precision, numeric_scale FROM information_schema.columns WHERE table_name = 'Transaction' AND table_schema = 'root'
[11:33:45] DEBUG - Running SQL (MySQL): SELECT column_name, data_type, datetime_precision, numeric_precision, numeric_scale FROM information_schema.columns WHERE table_name = 'Transaction' AND table_schema = 'user'

In MySQL's information schema the TABLE_SCHEMA is the database, not the user
https://dev.mysql.com/doc/mysql-infoschema-excerpt/8.0/en/information-schema-columns-table.html

MySQL [(none)]> select TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE, engine from information_schema.tables where table_name = 'Transaction'\G
*************************** 1. row ***************************
TABLE_SCHEMA: coolname
  TABLE_NAME: Transaction
  TABLE_TYPE: BASE TABLE
      ENGINE: InnoDB
1 row in set (0.003 sec)

This is causing data-diff to fail to find the table and columns so data-diff cannot function

RuntimeError: MySQL: Table 'Transaction' does not exist, or has no columns

Versions:

root@22b19e3a116f:/# python3 --version
Python 3.7.3
root@22b19e3a116f:/# pip3 list
Package                Version
---------------------- ---------
asn1crypto             0.24.0
awscli                 1.16.113
botocore               1.12.103
certifi                2018.8.24
chardet                3.0.4
click                  8.1.3
colorama               0.4.5
commonmark             0.9.1
cryptography           2.6.1
data-diff              0.2.0
docutils               0.14
dsnparse               0.1.15
entrypoints            0.3
idna                   2.6
importlib-metadata     4.11.4
jmespath               0.9.4
keyring                17.1.1
keyrings.alt           3.1.1
mysql-connector-python 8.0.29
pip                    18.1
protobuf               4.21.1
pyasn1                 0.4.2
pycrypto               2.6.1
Pygments               2.12.0
PyGObject              3.30.4
python-dateutil        2.7.3
pyxdg                  0.25
PyYAML                 3.13
requests               2.21.0
rich                   10.16.2
roman                  2.0.0
rsa                    4.0
runtype                0.2.6
s3transfer             0.2.0
SecretStorage          2.3.1
setuptools             40.8.0
six                    1.12.0
typing-extensions      4.2.0
urllib3                1.24.1
wheel                  0.32.3
zipp                   3.8.0

TypeError: [TableDiffer] Attribute 'bisection_factor' expected value of type int. Instead got '32'

Works without --bisection-factor but when I try to use it e.g. data-diff db1 table1 db2 table2 --stats --bisection-factor=32 I get this error:

Traceback (most recent call last):
  File "/Users/dlyons/.pyenv/versions/3.9.4/bin/data-diff", line 8, in <module>
    sys.exit(main())
  File "/Users/dlyons/.pyenv/versions/3.9.4/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/dlyons/.pyenv/versions/3.9.4/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/dlyons/.pyenv/versions/3.9.4/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/dlyons/.pyenv/versions/3.9.4/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/dlyons/.pyenv/versions/3.9.4/lib/python3.9/site-packages/data_diff/__main__.py", line 87, in main
    return _main(**kw)
  File "/Users/dlyons/.pyenv/versions/3.9.4/lib/python3.9/site-packages/data_diff/__main__.py", line 177, in _main
    differ = TableDiffer(

  File "<string>", line 9, in __init__
  File "/Users/dlyons/.pyenv/versions/3.9.4/lib/python3.9/site-packages/runtype/dataclass.py", line 227, in __post_init__
    _post_init(self, config=config, should_cast=check_types == 'cast', sampler=sampler)
  File "/Users/dlyons/.pyenv/versions/3.9.4/lib/python3.9/site-packages/runtype/dataclass.py", line 119, in _post_init
    raise TypeError(msg)
TypeError: [TableDiffer] Attribute 'bisection_factor' expected value of type int. Instead got '32'

Documentation for how to write a driver

We need a section of the README we can point people to. Many things are documented inline, but a starting point would be nice, e.g. cover:

  • Threading
  • How to support types (rounding)
  • Writing tests (important)

Support Databases that don't have md5 or fast md5 (MSSQL)

I was browsing through the list of supported databases on your readme and noticed MSSQL is not listed although it can be found in databases.py. Just wondering if this was intentional and also what the status of that connector is at the moment.

Cheers

key_column is overrided to be using id

Hi,

I am currently trying to use data_diff module in python programmatically as stated in the README.md, somehow I encountered this issue that even though i already stated the key_column to be using column that i declare when calling the function, it force me to use id as key_column.

Here's the TableSegment output after i declare use the connect_to_table function:
TableSegment(database=<data_diff.databases.postgresql.PostgreSQL object at 0x1079f4c70>, table_path=('table_name',), key_column='date_key', update_column=None, extra_columns=(), min_key=None, max_key=None, min_update=None, max_update=None, case_sensitive=True, _schema=None)

but when I use the diff_tables function it throws error as such:
ValueError: Column 'id' not found in schema for table TableSegment(database=<data_diff.databases.postgresql.PostgreSQL object at 0x1079f4c70>, table_path=('table_name',), key_column='id', update_column=None, extra_columns=(), min_key=None, max_key=None, min_update=None, max_update=None, case_sensitive=True, _schema={})

Ideally, the key_column should be align with the TableSegment I passed to the function right ?
the version that i used is Version: 0.2.3

Thanks!

Start bisecting after getting min/max from first database

Some databases are awfully slow at getting min(id) and max(id) for a column when WHERE is added to the mix, for example, for this query in Snowflake, it takes 50s on a dataset in the millions of rows:

Running SQL (Snowflake): SELECT min(id), max(id) FROM TRANSFERS WHERE ('2022-06-04T14:27:44.096619' <= created_at) AND (created_at < '2022-06-14T13:57:44.096658')

In Postgres, the same query takes a few millis.

It's going to be rare they're different, so instead, we can just use the fast one and start bisecting.

When the second, slower database returns min + max, we compare with the faster one's. If it's not the same min/max, we'll warn, and could restart the bisection. Alternatively, we can just start bisecting at the extremes if min/max are now extended, so it's very graceful.

E.g. it should look like this:

Thread1, Time 00:00:00: Postgres: select min(id), max(id) from table
Thread2, Time 00:00:00: Snowflake: select min(id), max(id) from table
Thread1 Time 00:00:01: Postgres returns min=1, max=1000
... start bisecting ...
Thread2, Time 00:00:10: Snowflake returns min=1, max=1000
... continues bisecting because min + max are the same...

This should improve performance substantially on some platforms.

Non-integer IDs

I have several tables that use GUIDs for their primary key. Can data-diff test these tables or only tables where the primary key is an integer?

Warn on missing indexes and/or problematic EXPLAINs

For row-based databases if indexes are missing on the columns queried, the queries will be substantially slower on large tables (unless you're querying all columns).

Today, you can run with --interactive to see an EXPLAIN before each query. In --debug you can see each query, to run the EXPLAIN yourself.

However, we don't do anything beyond that. We should spit out at a warning logging level if we suspect we might be issuing queries that aren't optimal, due to a lack of indexes. Either we can check the schema that we have all the necessary indexes, or we can try to parse the EXPLAIN. I'm in favour of the schema, it's a little less real, but it means less maintenance per driver πŸ˜…

Of course, compound indexes will be better, but a bunch of individual may be good enough on many databases that can scan multiple in parallel (an assumption worth verifying with at least Postgres and MySQL).

Pass arbitrary WHERE

Users may have more complex --where conditions you want to pass, beyond what is offered by --min-age and --max-age.

Add support for DuckDB

DuckDB is an in-process database. You typically create it as a session, then discard it once you're done (though not the only way to use it)

https://duckdb.org

It's awesome for a few reasons that apply to data-diff. Namely, you can direct-query raw csv/txt/parquet files as though they were tables. (eg select posting_date, count(*) as r_count from '/Users/me/data.csv' group by posting_date )
We use this ability to load PROD v UAT files from our system to compare output. Being able to pass this across to data-diff would be incredible.

Whilst just being able to reference csv files in data-diff might be another option, doing this via duckDB would allow you to perform some basic transformations on the way; such as renaming fields, selecting a reduced range etc

Support Google Cloud Spanner

Motivation

I transfer Spanner tables to BigQuery using data engineering tools. I would like to make sure if data in BigQuery is completely the same as Spanner.

Dump indexes, schema, etc. with --debug

It would be helpful for debugging if we print as much as we can know about the two schemas we're querying: types, names, indexes, and so on at the start of --debug output. Maybe even --verbose.

Good lay-up for #78

Make data-diff faster when there are lots of differences

Today, one of the caveats of data-diff is that it's going to be significantly slower if you have a lot of differences, because we'll be checksumming so many segments repeatedly as we try to find the columns. I'm not exactly sure what the best solution is, but it likely entails a threshold of differences in earlier segments that cause us to increase the --bisection-threshold

readme should clarify that DB_URI arguments need to be surrounded in quotes when connecting to Snowflake

I figured I'd share a 🀦 experience I just had during my setup with Snowflake. I was having trouble getting data-diff to accept my DB1_URI and DB2_URI arguments. I eventually realized they had to be in quotes. One would not have known that from the README which offers a postgres example that doesn't put the arguments in quotes (I assume quotes are not needed when using postgres?).

Here is what I encountered (I intentionally omitted arguments for brevity and RCA while debugging).

Without quotes:

% data-diff snowflake://me:mypw@myaccount/mydb/myschema?warehouse=mywh&role=myrole 
[1] 12578
zsh: no matches found: snowflake://me:mypw@myaccount/mydb/myschema?warehouse=mywh
% 
[1]  + exit 1     data-diff snowflake://me:mypw@myaccount/mydb/myschema?warehouse=mywh

(huh? did I enter my snowflake info wrong? scratches head)

With quotes:

% data-diff "snowflake://me:mypw@myaccount/mydb/myschema?warehouse=mywh&role=myrole"
Usage: data-diff [OPTIONS] DB1_URI TABLE1_NAME DB2_URI
                 TABLE2_NAME
Try 'data-diff --help' for help.

Error: Missing argument 'TABLE1_NAME'.

(nice! the error message helpfully tells me that I should enter the other arguments) πŸ†’

I think it would help a lot if right under the postgres example, there was a similar example for Snowflake, and imo for each tool that is up and running, especially since the syntax evidently varies.

I don't think every example needs to be as verbose as the postgres example, which includes output, but capturing any variation in syntax seems crucial. I don't think it's reasonable to expect users to know that the syntax for arguments differs.

The Snowflake example would look like this to make it equivalent to the postres example:

$ data-diff \
    "snowflake://<user>:<password>@<myaccount>/<database>/<schema>?warehouse=<warehouse>&role=<role>" RATING \
    "snowflake://<user>:<password>@<myaccount>/<database>/<schema>?warehouse=<warehouse>&role=<role>" RATING_DEL1 \
    --bisection-threshold 100000 \ # for readability, try default first
    --bisection-factor 6 \ # for readability, try default first
    --update-column timestamp \
    --verbose

A couple of other notes:

  • I suggest adding carrots (<>) to the postgres example, as I have in the Snowflake example, to make it more clear what should be subbed out by the user. It's not really necessary for clarity in the postgres example, but the Snowflake example gets pretty confusing without carrots since the URI includes the word warehouse; and if there are going to be carrots in the Snowflake example, there should also be carrots in the postgres (and all other) examples. πŸ₯•
  • Note that I've made table names all caps in my Snowflake example. Snowflake makes all table names uppercase unless hard-coded otherwise. I would argue this is kind of the responsibility of the Snowflake user to understand this, BUT if there's going to be an example that includes table names, they might as well be uppercase since lowercase will not work 99% of the time.

Add --all-mutual-columns

In some cases, you might want to checksum everything, and instead of having to type out all the columns yourself, we should be able to get them from the schema.

--all-mutual-tables could be a nice addition at some point too...

Dump the config for debugging but with secrets redacted

As a follow-up for a discussion at #143 (comment).

Why? It is helpful for debugging when the users show their logs with their configs. However, configs contain database passwords, which should be printed.

Problem 1: Printing the configs with passwords to stdout/stderr is considered a security leak in some organizations, which requires changing the leaked passwords and reporting a security incident β€”Β e.g. if data-diff is executed in a remote environment (Docker containers, Kubernetes Jobs/Pods, etc).

Problem 2: With the realistic database tests from #112, the passwords should have been defined as GitHub Actions Sections. GitHub automatically redacts the secrets if/when it can detect them, but still recommends that we do not print them in the first place:

Goal:

  • Print/log the used config with all secrets redacted both from the config keys and from URI.

Notes:

A malicious contributor can simply print() them (with modifications to work around GitHub's redacting) β€” but GitHub usually requires clicking a button for external (non-trusted) contributors before running the Actions β€” at least some protection.

Add support for DB2

Hi, do you have a plan in your roadmap to add support for DB2 databases?

Thanks, Carlos.

CSV insertions for database for better benchmarking

Currently if you run the benchmarking scripts (see README and #135) it's very slow against the cloud databases. It would be better to use CSV imports for the cloud databases (redshift, bigquery, oracle, snowflake) by doing something similar to dev/_bq_import_csv.py in _insert_to_table.

It will work today for 100M rows, but it'll be very slow...

Snowflake/BigQuery parallelism

The drivers don't natively support multi-threading, so we need to create our own thread-pool. This should improve their performance quite a bit, so we can issue multiple checksum queries concurrently.

#43 (comment)

Types: Failing tests for Oracle timestamps w/ time zone against Snowflake precision=3

@erezsh I've uncovered a new set of incompatibility bugs. Unfortunately it doesn't show up in CI because it doesn't do Oracle yet.

On branch test-snowflake-prod these tests fail with round errors:

  • test_types_oracle_ts6_y_tz_snowflake_ts3_50
  • test_types_oracle_ts9_y_tz_snowflake_ts3_50
  • test_types_oracle_ts_y_tz_snowflake_ts3_50

Run with poetry run python -m unittest -k test_types_oracle_ts_y_tz_snowflake_ts3_50 with that branch checked out.

Sample error:

======================================================================
test_types_oracle_ts_y_tz_snowflake_ts3_50 (tests.test_database_types.TestDiffCrossDatabaseTables)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/data-diff-yPhKwkIo-py3.10/lib/python3.10/site-packages/parameterized/parameterized.py", line 533, in standalo
ne_func
    return func(*(a + p.args), **p.kwargs)
  File "/root/data-diff/tests/test_database_types.py", line 611, in test_types
    self.assertEqual(expected, diff)
AssertionError: Lists differ: [] != [('-', ('6', '2022-06-01 15:10:05.010000')[2287 chars]0'))]

Second list contains 52 additional elements.
First extra element 0:
('-', ('6', '2022-06-01 15:10:05.010000'))

- []
+ [('-', ('6', '2022-06-01 15:10:05.010000')),
+  ('+', ('6', '2022-06-01 15:10:05.009000')),
+  ('-', ('7', '2000-01-01 00:00:03.001000')),
+  ('+', ('7', '2000-01-01 00:00:03.000000')),
+  ('-', ('9', '2000-01-01 00:00:09.002000')),
+  ('+', ('9', '2000-01-01 00:00:09.001000')),
+  ('-', ('11', '2000-01-01 00:00:15.003000')),
+  ('+', ('11', '2000-01-01 00:00:15.002000')),
+  ('-', ('13', '2000-01-01 00:00:21.004000')),
+  ('+', ('13', '2000-01-01 00:00:21.003000')),
+  ('-', ('14', '2000-01-01 00:00:24.005000')),

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.