Comments (8)
Acceptance Criteria
- My seeds and fixtures should treat empty values the same
col_a,col_b,col_c,col_d
1,blue,doug,hat
2,red,,shoe
,,
becomes NULL- there is no current way to specify an empty string using
csv
format
from dbt-core.
Thanks for trying out unit testing and reporting this @leesiongchan !
How dbt seeds treat empty values
dbt-labs/docs.getdbt.com#4867 describes how dbt seeds handle empty strings by converting them to SQL NULL
values.
It also describes that there are two values that you can't load directly with CSV seeds:
- An empty / blank string (
""
/''
)- A string with the value
"null"
It totally makes sense that you'd expect CSV-formatted unit test fixtures to behave the same way as dbt seeds.
But we'll need to take another look at this and choose between a couple options for moving forward.
Our options
What's your take on this @graciegoheen? We'd need to choose between one of these two options:
-
Update the behavior of blanks in CSV-formatted unit test fixtures to align with CSV-formatted seeds, accepting that it won't be possible to represent blank strings. Users that need to represent blank strings for unit tests will need to use
dict
rather thancsv
for theformat
. -
Keep current behavior of blanks in CSV-formatted unit test fixtures, accepting that it is different than CSV-formatted seeds. We'd want to add appropriate caveats in our documentation.
To me the 1st option has more congruence and the 2nd option has more dissonance. What do you think, Grace?
Reprex
seeds/my_seed.csv
animal,legs
duck,2
snake,""
tests/fixtures/my_model_fixture.csv
animal,legs
duck,2
snake,
models/animals.sql
select 'duck' as animal, 2 as legs union all
select 'snake' as animal, null as legs
models/_unit.yml
unit_tests:
- name: test_a_9881
description: https://github.com/dbt-labs/dbt-core/issues/9881
model: animals
given: []
expect:
# This doesn't work as expected
format: csv
fixture: my_model_fixture
- name: test_b_9881
description: https://github.com/dbt-labs/dbt-core/issues/9881
model: animals
given: []
expect:
# But this works as expected
rows:
- {animal: duck, legs: 2}
- {animal: snake, legs: null}
Run this command to see how seeds work with a blank value (a blank string ""
converts to a SQL NULL
):
dbt show -s seeds/my_seed.csv --output json
Run this command and get an unexpected failure:
dbt build -s my_seed animals --full-refresh
from dbt-core.
So I think this is going to be dependent on the warehouse you're using. For example, Snowflake is totally fine with try_cast('null' as NUMBER(38,0))
, so this actually works fine for Snowflake (defining a numeric column as null in a fixture file).
In postgres (which is what @dbeatty10 is using), cast('null' as integer)
does not work but cast(null as integer)
does.
Things get funky as well when trying to cast to a string, because try_cast('null' as text)
will result in just the actual word "null" rather than NULL
.
@dbeatty10 did a deep dive on what you can and can't cast NULL
to in Snowflake here.
We've had a tradeoff for seeds for awhile where:
- you cannot define an empty string in a seed
- you can define
NULL
I understand it's confusing for us to decide the opposite for fixtures:
- you can define an empty string
- you cannot define
NULL
That being said, you can define NULL
using the other formats available (currently dict
and coming soon sql
). I believe this is because we're wrapping the values inputted in the csv fixture in single quotes, but we don't do that for the other formats.
At minimum, I think we have a few things to do here:
- add a test to core to cover this case
- update our docs on data types for unit testing with an example of how you'd supply an intended null in each of the warehouses
In addition, we should decide how we want to handle this. A few options include:
- allowing seeds and fixtures to support both empty strings and explicit
NULL
s
col_a,col_b,col_c,col_d
1,blue,doug,hat
2,null,,shoe
- making fixtures consistent with seeds (not support empty strings, but support NULLs)
- documenting that fixtures don't support NULLs (must use alternative
format
for this case) - something else?
I'm going to move this over to technical refinement to get input from our engineers! Thanks for the discussion.
from dbt-core.
Hello! @leesiongchan - thanks for opening up this request. I'd like to know more about your use case here. We've designed unit tests such that you only have to supply input mock data for the columns that are relevant to your unit test.
If we imagine a unit test that only needs col_a
and col_c
β¦
Instead of:
# tests/fixtures/my_fixture.csv
col_a,col_b,col_c
1,,doug
2,,grace
You should do:
# tests/fixtures/my_fixture.csv
col_a,col_c
1,doug
2,grace
Now, that wonβt be relevant if you do need all columns but want to test 1 case where col_b
is null
.
In that case, could you explicitly call out the null
like so:
# tests/fixtures/my_fixture.csv
col_a,col_b,col_c
1,blue,doug
2,null,grace
Seeds behave differently than fixtures (though understand the comparison since they're both csv files). I am also hesitant to convert empty values to null
for fixtures, if it would prevent someone from using an empty string in their mock input/output data for a unit test.
from dbt-core.
@graciegoheen how does this behave for you when you add it as a fixture?
animal,legs
duck,2
snake,null
Here's the error message I get when I use the project files described in #9881 (comment):
16:15:10 Runtime Error in unit_test test_a_9881 (models/_unit.yml)
An error occurred during execution of unit test 'test_a_9881'. There may be an error in the unit test definition: check the data types.
Database Error
invalid input syntax for type integer: "null"
LINE 29: cast('null' as integer)
But maybe I have a bjorked local environment somehow?
from dbt-core.
In that case, could you explicitly call out the
null
like so:# tests/fixtures/my_fixture.csv col_a,col_b,col_c 1,blue,doug 2,null,grace
If this is supported perhaps it is the best solution for both worlds (seeding & unit test fixture)? In our case we need all columns because our update/delete event consists only partial of the data.
from dbt-core.
dbt-labs/dbt-snowflake#894 (comment) has the deep dive of casting NULL
(without quotes) across all known standard data types for the following databases:
- snowflake
- bigquery
- redshift
- postgres
- databricks
- spark
from dbt-core.
Opened a new issue in dbt-labs/docs.getdbt.com: dbt-labs/docs.getdbt.com#5476
from dbt-core.
Related Issues (20)
- [Epic] Snapshots as a First Class Citizen
- [Feature] `--empty` flag should be exposed in `flags` variable HOT 4
- [Bug] Encountered an error while generating catalog: 'NoneType' object has no attribute 'lower' HOT 3
- [Feature] Faster, if unsafe, dbt-compile please! (perhaps without connectors) HOT 2
- [Bug] Unit test raising error `'<macro name>' is undefined. This can happen when calling a macro that does not exist.`
- [Feature] better performance logging for `docs generate` command HOT 1
- [Feature] Make `warn_unenforced` and `warn_unsupported` constraint options on the project level HOT 4
- [Bug] `flags.warn_error_options.silence` should be settable `dbt_project.yml`
- [Feature] Support Semi-Structured Data Columns in Unit Test dict fixture data
- [Bug]Jobs discovering and attempting parse potentially premature manifest - v12 HOT 2
- [Feature] When a test fails, `dbt retry` should also rebuild the model(s) against which the test failed instead of re-running the same doomed test against the original model
- [Bug] Project level `store_failures` overriding test level config with dbt>=1.7 HOT 1
- [Feature] Support deferral of semantic models HOT 4
- [Bug] unit tests' comparisons are sometimes sensitive to the order of records returned HOT 4
- [Feature] Flag for adding the updating timestamp as a column in seeds HOT 1
- [Flaky Test] test_build_assert_equal
- [Flaky Test] test_build_assert_equal
- [Flaky Test] test_hooks_on_snapshots
- [Feature] Allow adapters (or users?) to extend/reimplement `materialization_enforces_constraints`
- [Bug] dbt 1.8 does not work correctly with sqlfluff
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbt-core.