Giter VIP home page Giter VIP logo

Comments (2)

MattTriano avatar MattTriano commented on June 11, 2024

I confirmed in some simplified testing (see code block below for test apparatus) that null values cause the WHERE NOT EXIST subquery to return identical records (that contain nulls). Following this finding, I wrote a little string formatter function in python

def format_cols_for_where_not_exists_dbt(df) -> None:
    for col in df.columns:
        print(f"          AND ((new.{col} = old.{col}) OR")
        print(f"               (new.{col} IS NULL AND old.{col} IS NULL))")

updated all of my dbt sql files with the updated null-catching conditions cached in the commit with this example. Running it with the homicide and shooting data (~55k records) was sufficiently quick to be unremarkable, but attempting again with a table with 2.5M rows took over 30m before I interrupted it.

Maybe the answer is another row_number() over partition(<all_col_names>) as rn, select * from <that_cte> where rn=1 pattern, but I want to see the performance and behavior of except first before giving up on a select * from data_raw_table union all select [records not identical to a row in the data_raw_table] approach.

DROP TABLE IF EXISTS data_raw.products1;
DROP TABLE IF EXISTS data_raw.products2;
CREATE TABLE IF NOT EXISTS data_raw.products1 (
    product_no integer,
    name text,
    price numeric
);
CREATE TABLE IF NOT EXISTS data_raw.products2 (LIKE data_raw.products1);
INSERT INTO data_raw.products1 (product_no, name, price) VALUES (1, 'Cheese', 9.99);
INSERT INTO data_raw.products2 (product_no, name, price) VALUES (1, 'Cheese', 9.99);
INSERT INTO data_raw.products1 (product_no, name, price) VALUES (2, 'HAM', NULL);
INSERT INTO data_raw.products2 (product_no, name, price) VALUES (2, 'HAM', NULL);
INSERT INTO data_raw.products1 (product_no, name, price) VALUES (3, 'Potato', 0.79);
INSERT INTO data_raw.products2 (product_no, name, price) VALUES (4, 'Pancake', 3.56);

WITH new_or_updated_records_in_current_pull AS (
  SELECT product_no, name, price
  FROM data_raw.products1 new
  WHERE NOT EXISTS (
    SELECT
    FROM data_raw.products2 old
    WHERE (new.product_no = old.product_no OR
		(new.product_no IS NULL AND old.product_no IS NULL)		
	) AND (new.name = old.name OR
		(new.name IS NULL AND old.name IS NULL)	
	) AND (new.price = old.price OR
		(new.price IS NULL AND old.price IS NULL)	
	)
  )
),
new_or_updated_records_in_current_pull_nulls_not_handled AS (
  SELECT product_no, name, price
  FROM data_raw.products1 new
  WHERE NOT EXISTS (
    SELECT
    FROM data_raw.products2 old
    WHERE new.product_no = old.product_no AND
                 new.name = old.name AND
		new.price = old.price 
  )
),
data_raw_table_with_new_and_updated_records AS (
  SELECT * 
  FROM data_raw.products2
    UNION ALL
  SELECT *
  FROM new_or_updated_records_in_current_pull
)

SELECT * 
FROM data_raw_table_with_new_and_updated_records

1 "Cheese" 9.99
2 "HAM"
4 "Pancake" 3.56
3 "Potato" 0.79

data_raw_table_with_new_and_updated_records_nulls_not_handled AS (
  SELECT * 
  FROM data_raw.products2
    UNION ALL
  SELECT *
  FROM new_or_updated_records_in_current_pull_nulls_not_handled
)

SELECT * 
FROM data_raw_table_with_new_and_updated_records_nulls_not_handled

1 "Cheese" 9.99
2 "HAM"
4 "Pancake" 3.56
2 "HAM"
3 "Potato" 0.79

from analytics_data_where_house.

MattTriano avatar MattTriano commented on June 11, 2024

Ultimately, I gave up on an EXCEPT pattern pretty quickly as I'd have to exclude the source_data_updated and ingestion_check_time columns to identify distinct records via EXCEPT, I could only think of cumbersome strategies for adding those columns back on correctly, and I thought of a better, much simpler strategy. I just added a column (named retention_policy; to indicate priority in ordering) to both the persistent data_raw table (assigned value: 1) and the "temp_" table (assigned value: 2), and in a row_number() over(partition <columns> ORDER BY retention_policy) as rn... deduplication step after union the existing records and temp records. This will ensure that if the new pull has a record identical to one in the persisting table (less columns I added), the prior record will be retained, while new/updated records wouldn't fall into any existing partition group (so rn = 1).

Also, here's the function for formatting table column names for copy+paste into the top of one of these dbt scripts.

import textwrap

def format_cols_as_jinja_variable_for_dbt_script(df: pd.DataFrame) -> None:
    metadata_columns = ["source_data_updated", "ingestion_check_time"]
    source_cols_str = '"' + '", "'.join([c for c in df.columns if c not in metadata_columns]) + '"'
    print("{% set source_cols = [")
    lines = textwrap.wrap(source_cols_str, width=96, break_long_words=False)
    lines = [f"    {line}" for line in lines]
    for line in lines:
        print(line)
    print("] %}")
    print(f'{{% set metadata_cols = ["source_data_updated", "ingestion_check_time"] %}}')

from analytics_data_where_house.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.