datools's Issues

Add `support` to `Explanation` so you can see the size of the explanation rather than just the risk ratio.

Implement performance tests on a public dataset

Intel sensors: http://db.csail.mit.edu/labdata/labdata.html

DuckDB support

Drop 3.10 support until DuckDB supports it (duckdb/duckdb#3031 says it's Windows, but I'm experiencing the same thing in Linux) no longer an issue---resolved for now!
Replace PKs with https://github.com/Mause/duckdb_engine#auto-incrementing-id-columns until DuckDB supports SERIAL: duckdb/duckdb#1768

Talk to Eugene Wu once candidate generation works

Notes for conversation

No need to read code --- I can explain it all
So far, I've implemented two things
- column statistics --- given a table, identifies range- and set-valued columns. For range-valued columns, identifies percentile bucket boundaries (e.g., 3 values representing [start_of_first, start_of_second, start_of_third] bucket values). For set-valued columns, identifies the (e.g., 100) most popular values.
- Based on the column statistics, a candidate predicate generator that, for each column statistic, generates predicates. For range-valued columns, this means predicates of the form column >= start_of_first AND column < start_of_second. For set-valued columns, this means predicates of the form column = popular_value.
With those primitives, we can implement the fun part! Some thoughts
- I liked your suggestion to implement https://homes.cs.washington.edu/~suciu/main_explanation.pdf ("the UW paper") instead of https://dspace.mit.edu/bitstream/handle/1721.1/89076/scorpion-vldb13.pdf?sequence=1&isAllowed=y ("the Scorpion paper"), because I can then push as much of the heavy lifting in to the DB without relying on external libraries for decision trees, etc. One hitch in implementing the UW paper is it relies heavily on cubes, which several DBs don't natively implement (especially SQLite and DuckDB, which I'm targeting for tests before supporting a broader set of databases). This isn't a huge problem: while I await a conversation with you, I'll implement a wrapper that implements grouping sets/cubes by way of a bunch of UNION ALL of GROUP BY combinations.
- One thing to align on early: what's the API?
  - The UW paper says the user provides a list of queries with individual aggregates that can be arithmetically combined, along with a high/low direction indicator:
    .
  - The Scorpion paper says the user provides an annotation over the aggregate query that separates the query into a hold-out set and a set of outliers that are annotates with high/low direction indicators:
    .
- Once we agree on the API, I also need help deciphering the UW algorithm for a single table. Namely
  - What's the difference between the two metrics \mu_{aggr} and \mu_{interv}?
  - Can we work through implementing the cube from the UW paper on the sensor example in the Scorpion paper since I've got that "sample dataset" implemented in the tests?

Consider more aesthetically pleasing documentation (MkDocs?)

https://docs.readthedocs.io/en/stable/config-file/v2.html

Column statistics for enums (char/int) and ranges (%tiles for int/float)

Add mypy

Solution sketch

pip install mypy (add to requirements)
mypy datools (add to Makefile)
(requires some cleanup to fix errors from here
add to CI build

Augment unit tests with tests on larger datasets

It's hard to test things like range-valued attributes in test_diff.py with such a small amount of data. Introducing a test dataset (it can even be the sensor one in our examples folder) would help us stress test the library a bit more.

Tests should include Python 3.10

dbt diff() macro

I'm interested in prototyping the following proposal.

Goal

Enable the DIFF operator within a dbt project.

Implementation proposal

Create a suite of dbt macros that mimic the SQL generated within datools/explanations.py
diff() macro as the main interface
Result is a relation like this:

Potential syntax

with
this_week as (

    select *
    from {{ ref("logs") }}
    where
        crash = true
        and timestamp between '2018-08-28' and '2018-09-04'

),

last_week as (

    select *
    from {{ ref("logs") }}
    where
        crash = true
        and timestamp between '2018-08-21' and '2018-08-28'

)

{{ datools.diff(this_week, last_week, on=["app_version", "device_type", "os"], compare_by="risk_ratio", threshold="2.0", support="0.05", max_order="1") }}

Examples of dbt macros

These two dbt packages contain macros that might be useful for inspiration:

For example:

https://github.com/calogica/dbt-expectations/blob/13eb72c7cc797b103c9a604bf4cbd6d513f136de/macros/schema_tests/_generalized/expression_is_true.sql

Test on a business dataset (churn? conversion?)

Use prepared statements or turn raw SQL into SQLAlchemy to avoid injection exploits

Be more careful with statistics

As discussed on this thread and expanded on in the DIFF paper

Support for the other metrics
Introduce p-values and multiple hypothesis testing

test_scorpion.py should load a dummy DB, call the hypothesis generator, and assert a generated hypothesis.

Resolved by #14.

Better handle range-valued columns by integrating table_statistics into DIFF

Use this: https://github.com/marcua/datools/blob/main/datools/table_statistics.py

That functionality is wrapped by the currently unused

datools/datools/explanation/algorithms.py

Line 32 in 18cdbc4

def _single_column_candidate_predicates(

, which could be used by diff to generate range-based candidates.

Might need to modify the statistics code to work on arbitrary queries, not just tables.
- Update: Python's sqlite3 won't let you detect column types for a query.
Thinking about this harder, we're likely going to want more explicit direction: should each column be treated as set-valued or range-valued? Otherwise, range-valued columns will probably cause unhelpful explanations when also treated as set-valued.
Rewrite _range_valued_statistics and _set_valued_statistics to be public APIs and take a query instead of a table name.
- engine instead of connection as argument
- query instead of table as argument
Create a peer to on_columns called on_range_columns and return explanations for both types of column after transforming ranges to sets by bucketing.
test_diffs.py works with new API, but doesn't return any of the buckets as explanations. Is that by design given the data?
Update the Intel Sensor example created in #20 to transform range-valued attributes.
- Make sure the example has the same results before you add range-valued columns
- Make sure that after bucketing range-valued columns, the results are more sensible than treating everything as a set-valued attribute *I did, and it's not more sensible. I wrote up some hypotheses in the notebook)

Release v0.1.5

grouping sets
postgres
support in explanation

Dummy dataset for testing

Blog post describing progress so far

It's live: https://blog.marcua.net/2022/02/20/data-diffs-algorithms-for-explaining-what-changed-in-a-dataset.html

GROUPING SETS docs

Spark SQL support

(Add votes if this is important to you)

Wrapper to support grouping stats on SQLite/DuckDB/databases that don't natively support cubes and grouping sets

Redshift support

(Add votes if this is important to you)

More elegant way to handle division by zero (explanation is the size of the relation) in the risk ratio

Current (reasonably, I think) hack: 1bab9af

columns_to_ignore instead of columns as argument

Postgres support

Consider using https://pypi.org/project/testing.postgresql/
Refactor grouping sets ahead of introducing native grouping sets support for databases that implement them natively
Implement native grouping sets support

Candidate column generation for enums (char/int) and ranges (%tiles for int/float)

    # TODO(marcua): column_values[column] seems to have # rows as                                                                                                
    # output, rather than # buckets.

see the range values below

[(Column('id', INTEGER(), table=<sensor_readings>, primary_key=True, nullable=False), [SetValuedStatistics(distinct_values: 9), RangeValuedStatistics(bucket_maximums: [1, 2, 3, 4, 5, 6, 7, 8, 9])]), (Column('sensor_id', VARCHAR(), table=<sensor_readings>, nullable=False), [SetValuedStatistics(distinct_values: 3)]), (Column('created_at', DATETIME(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: ['2021-05-05 11:00:00.000000', '2021-05-05 11:00:00.000000', '2021-05-05 11:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 13:00:00.000000', '2021-05-05 13:00:00.000000', '2021-05-05 13:00:00.000000'])]), (Column('voltage', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [2.3, 2.3, 2.63, 2.64, 2.65, 2.7, 2.7, 2.7, 2.7])]), (Column('humidity', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [0.3, 0.3, 0.4, 0.4, 0.4, 0.5, 0.5, 0.5, 0.5])]), (Column('temperature', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [34.0, 35.0, 35.0, 35.0, 35.0, 35.0, 35.0, 80.0, 100.0])])]

marcua / datools Goto Github PK

datools's People

Contributors

Stargazers

Watchers

Forkers

datools's Issues

Notes for conversation

Goal

Implementation proposal

Potential syntax

Examples of dbt macros

Recommend Projects

Recommend Topics

Recommend Org