Comments (2)
I confirmed in some simplified testing (see code block below for test apparatus) that null values cause the WHERE NOT EXIST
subquery to return identical records (that contain nulls). Following this finding, I wrote a little string formatter function in python
def format_cols_for_where_not_exists_dbt(df) -> None:
for col in df.columns:
print(f" AND ((new.{col} = old.{col}) OR")
print(f" (new.{col} IS NULL AND old.{col} IS NULL))")
updated all of my dbt sql files with the updated null-catching conditions cached in the commit with this example. Running it with the homicide and shooting data (~55k records) was sufficiently quick to be unremarkable, but attempting again with a table with 2.5M rows took over 30m before I interrupted it.
Maybe the answer is another row_number() over partition(<all_col_names>) as rn, select * from <that_cte> where rn=1
pattern, but I want to see the performance and behavior of except first before giving up on a select * from data_raw_table union all select [records not identical to a row in the data_raw_table]
approach.
DROP TABLE IF EXISTS data_raw.products1;
DROP TABLE IF EXISTS data_raw.products2;
CREATE TABLE IF NOT EXISTS data_raw.products1 (
product_no integer,
name text,
price numeric
);
CREATE TABLE IF NOT EXISTS data_raw.products2 (LIKE data_raw.products1);
INSERT INTO data_raw.products1 (product_no, name, price) VALUES (1, 'Cheese', 9.99);
INSERT INTO data_raw.products2 (product_no, name, price) VALUES (1, 'Cheese', 9.99);
INSERT INTO data_raw.products1 (product_no, name, price) VALUES (2, 'HAM', NULL);
INSERT INTO data_raw.products2 (product_no, name, price) VALUES (2, 'HAM', NULL);
INSERT INTO data_raw.products1 (product_no, name, price) VALUES (3, 'Potato', 0.79);
INSERT INTO data_raw.products2 (product_no, name, price) VALUES (4, 'Pancake', 3.56);
WITH new_or_updated_records_in_current_pull AS (
SELECT product_no, name, price
FROM data_raw.products1 new
WHERE NOT EXISTS (
SELECT
FROM data_raw.products2 old
WHERE (new.product_no = old.product_no OR
(new.product_no IS NULL AND old.product_no IS NULL)
) AND (new.name = old.name OR
(new.name IS NULL AND old.name IS NULL)
) AND (new.price = old.price OR
(new.price IS NULL AND old.price IS NULL)
)
)
),
new_or_updated_records_in_current_pull_nulls_not_handled AS (
SELECT product_no, name, price
FROM data_raw.products1 new
WHERE NOT EXISTS (
SELECT
FROM data_raw.products2 old
WHERE new.product_no = old.product_no AND
new.name = old.name AND
new.price = old.price
)
),
data_raw_table_with_new_and_updated_records AS (
SELECT *
FROM data_raw.products2
UNION ALL
SELECT *
FROM new_or_updated_records_in_current_pull
)
SELECT *
FROM data_raw_table_with_new_and_updated_records
1 "Cheese" 9.99
2 "HAM"
4 "Pancake" 3.56
3 "Potato" 0.79
data_raw_table_with_new_and_updated_records_nulls_not_handled AS (
SELECT *
FROM data_raw.products2
UNION ALL
SELECT *
FROM new_or_updated_records_in_current_pull_nulls_not_handled
)
SELECT *
FROM data_raw_table_with_new_and_updated_records_nulls_not_handled
1 "Cheese" 9.99
2 "HAM"
4 "Pancake" 3.56
2 "HAM"
3 "Potato" 0.79
from analytics_data_where_house.
Ultimately, I gave up on an EXCEPT
pattern pretty quickly as I'd have to exclude the source_data_updated
and ingestion_check_time
columns to identify distinct records via EXCEPT
, I could only think of cumbersome strategies for adding those columns back on correctly, and I thought of a better, much simpler strategy. I just added a column (named retention_policy
; to indicate priority in ordering) to both the persistent data_raw table (assigned value: 1) and the "temp_" table (assigned value: 2), and in a row_number() over(partition <columns> ORDER BY retention_policy) as rn
... deduplication step after union the existing records and temp records. This will ensure that if the new pull has a record identical to one in the persisting table (less columns I added), the prior record will be retained, while new/updated records wouldn't fall into any existing partition group (so rn = 1
).
Also, here's the function for formatting table column names for copy+paste into the top of one of these dbt scripts.
import textwrap
def format_cols_as_jinja_variable_for_dbt_script(df: pd.DataFrame) -> None:
metadata_columns = ["source_data_updated", "ingestion_check_time"]
source_cols_str = '"' + '", "'.join([c for c in df.columns if c not in metadata_columns]) + '"'
print("{% set source_cols = [")
lines = textwrap.wrap(source_cols_str, width=96, break_long_words=False)
lines = [f" {line}" for line in lines]
for line in lines:
print(line)
print("] %}")
print(f'{{% set metadata_cols = ["source_data_updated", "ingestion_check_time"] %}}')
from analytics_data_where_house.
Related Issues (20)
- Implement a metadata collector for Census data sets
- Update pgAdmin to version 7.0
- Upgrade Airflow from 2.5.2 to 2.5.3
- Refactor _standardized stage scripts to clean col-values before making them into a composite key
- Remove the `report` schema and its models HOT 1
- Fix great_expectations workflow to be run from the airflow_scheduler container HOT 2
- Develop an interface to access data tables from an external notebook or other outside analysis environment
- Explore and document the Census API data catalog, endpoints, and metadata HOT 8
- Remove dev_utils module
- The default TimeZone of the system's postgres databases was left as UTC, so timestamptz columns need correction HOT 1
- Notes from great_expectations workflow experiments: replace now-obsolete CLI workflow HOT 3
- Update to Airflow v2.6.0 and update package versions installed in Airflow images HOT 1
- Update pgAdmin4 from v7.0 to v7.1
- Notes on postgres commands or recipes for efficiently cleaning column values HOT 4
- The clean-model formatting code fails when the composite-key list in the standardized model spans multiple lines
- Add tasks to register new tables as Great Expectations Data Sources
- Upgrade Airflow to v 2.6.1
- Add tasks to ingest metadata on Census API Dataset geographies
- Develop a prototype representation for a Census API dataset
- Define a schema for a Census API Dataset-Groups Metadata table
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from analytics_data_where_house.