Testing the idempotent update query by checking that a table didn't re-ingest records

Explore WHERE NOT EXIST behavior when the subquery compares columns with null values about analytics_data_where_house HOT 2 CLOSED

matttriano commented on June 11, 2024

Explore WHERE NOT EXIST behavior when the subquery compares columns with null values

from analytics_data_where_house.

Comments (2)

MattTriano commented on June 11, 2024

I confirmed in some simplified testing (see code block below for test apparatus) that null values cause the WHERE NOT EXIST subquery to return identical records (that contain nulls). Following this finding, I wrote a little string formatter function in python

def format_cols_for_where_not_exists_dbt(df) -> None:
    for col in df.columns:
        print(f"          AND ((new.{col} = old.{col}) OR")
        print(f"               (new.{col} IS NULL AND old.{col} IS NULL))")

updated all of my dbt sql files with the updated null-catching conditions cached in the commit with this example. Running it with the homicide and shooting data (~55k records) was sufficiently quick to be unremarkable, but attempting again with a table with 2.5M rows took over 30m before I interrupted it.

Maybe the answer is another row_number() over partition(<all_col_names>) as rn, select * from <that_cte> where rn=1 pattern, but I want to see the performance and behavior of except first before giving up on a select * from data_raw_table union all select [records not identical to a row in the data_raw_table] approach.

DROP TABLE IF EXISTS data_raw.products1;
DROP TABLE IF EXISTS data_raw.products2;
CREATE TABLE IF NOT EXISTS data_raw.products1 (
    product_no integer,
    name text,
    price numeric
);
CREATE TABLE IF NOT EXISTS data_raw.products2 (LIKE data_raw.products1);
INSERT INTO data_raw.products1 (product_no, name, price) VALUES (1, 'Cheese', 9.99);
INSERT INTO data_raw.products2 (product_no, name, price) VALUES (1, 'Cheese', 9.99);
INSERT INTO data_raw.products1 (product_no, name, price) VALUES (2, 'HAM', NULL);
INSERT INTO data_raw.products2 (product_no, name, price) VALUES (2, 'HAM', NULL);
INSERT INTO data_raw.products1 (product_no, name, price) VALUES (3, 'Potato', 0.79);
INSERT INTO data_raw.products2 (product_no, name, price) VALUES (4, 'Pancake', 3.56);

WITH new_or_updated_records_in_current_pull AS (
  SELECT product_no, name, price
  FROM data_raw.products1 new
  WHERE NOT EXISTS (
    SELECT
    FROM data_raw.products2 old
    WHERE (new.product_no = old.product_no OR
		(new.product_no IS NULL AND old.product_no IS NULL)		
	) AND (new.name = old.name OR
		(new.name IS NULL AND old.name IS NULL)	
	) AND (new.price = old.price OR
		(new.price IS NULL AND old.price IS NULL)	
	)
  )
),
new_or_updated_records_in_current_pull_nulls_not_handled AS (
  SELECT product_no, name, price
  FROM data_raw.products1 new
  WHERE NOT EXISTS (
    SELECT
    FROM data_raw.products2 old
    WHERE new.product_no = old.product_no AND
                 new.name = old.name AND
		new.price = old.price 
  )
),
data_raw_table_with_new_and_updated_records AS (
  SELECT * 
  FROM data_raw.products2
    UNION ALL
  SELECT *
  FROM new_or_updated_records_in_current_pull
)

SELECT * 
FROM data_raw_table_with_new_and_updated_records

1 "Cheese" 9.99
2 "HAM"
4 "Pancake" 3.56
3 "Potato" 0.79

data_raw_table_with_new_and_updated_records_nulls_not_handled AS (
  SELECT * 
  FROM data_raw.products2
    UNION ALL
  SELECT *
  FROM new_or_updated_records_in_current_pull_nulls_not_handled
)

SELECT * 
FROM data_raw_table_with_new_and_updated_records_nulls_not_handled

1 "Cheese" 9.99
2 "HAM"
4 "Pancake" 3.56
2 "HAM"
3 "Potato" 0.79

from analytics_data_where_house.

MattTriano commented on June 11, 2024

Ultimately, I gave up on an EXCEPT pattern pretty quickly as I'd have to exclude the source_data_updated and ingestion_check_time columns to identify distinct records via EXCEPT, I could only think of cumbersome strategies for adding those columns back on correctly, and I thought of a better, much simpler strategy. I just added a column (named retention_policy; to indicate priority in ordering) to both the persistent data_raw table (assigned value: 1) and the "temp_" table (assigned value: 2), and in a row_number() over(partition <columns> ORDER BY retention_policy) as rn... deduplication step after union the existing records and temp records. This will ensure that if the new pull has a record identical to one in the persisting table (less columns I added), the prior record will be retained, while new/updated records wouldn't fall into any existing partition group (so rn = 1).

Also, here's the function for formatting table column names for copy+paste into the top of one of these dbt scripts.

import textwrap

def format_cols_as_jinja_variable_for_dbt_script(df: pd.DataFrame) -> None:
    metadata_columns = ["source_data_updated", "ingestion_check_time"]
    source_cols_str = '"' + '", "'.join([c for c in df.columns if c not in metadata_columns]) + '"'
    print("{% set source_cols = [")
    lines = textwrap.wrap(source_cols_str, width=96, break_long_words=False)
    lines = [f"    {line}" for line in lines]
    for line in lines:
        print(line)
    print("] %}")
    print(f'{{% set metadata_cols = ["source_data_updated", "ingestion_check_time"] %}}')

from analytics_data_where_house.

Explore WHERE NOT EXIST behavior when the subquery compares columns with null values about analytics_data_where_house HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent