Giter VIP home page Giter VIP logo

datools's People

Contributors

dependabot[bot] avatar marcua avatar mohmmadhd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datools's Issues

Talk to Eugene Wu once candidate generation works

Notes for conversation

  • No need to read code --- I can explain it all
  • So far, I've implemented two things
    • column statistics --- given a table, identifies range- and set-valued columns. For range-valued columns, identifies percentile bucket boundaries (e.g., 3 values representing [start_of_first, start_of_second, start_of_third] bucket values). For set-valued columns, identifies the (e.g., 100) most popular values.
    • Based on the column statistics, a candidate predicate generator that, for each column statistic, generates predicates. For range-valued columns, this means predicates of the form column >= start_of_first AND column < start_of_second. For set-valued columns, this means predicates of the form column = popular_value.
  • With those primitives, we can implement the fun part! Some thoughts
    • I liked your suggestion to implement https://homes.cs.washington.edu/~suciu/main_explanation.pdf ("the UW paper") instead of https://dspace.mit.edu/bitstream/handle/1721.1/89076/scorpion-vldb13.pdf?sequence=1&isAllowed=y ("the Scorpion paper"), because I can then push as much of the heavy lifting in to the DB without relying on external libraries for decision trees, etc. One hitch in implementing the UW paper is it relies heavily on cubes, which several DBs don't natively implement (especially SQLite and DuckDB, which I'm targeting for tests before supporting a broader set of databases). This isn't a huge problem: while I await a conversation with you, I'll implement a wrapper that implements grouping sets/cubes by way of a bunch of UNION ALL of GROUP BY combinations.
    • One thing to align on early: what's the API?
      • The UW paper says the user provides a list of queries with individual aggregates that can be arithmetically combined, along with a high/low direction indicator:
        image.
      • The Scorpion paper says the user provides an annotation over the aggregate query that separates the query into a hold-out set and a set of outliers that are annotates with high/low direction indicators:
        image.
    • Once we agree on the API, I also need help deciphering the UW algorithm for a single table. Namely
      • What's the difference between the two metrics \mu_{aggr} and \mu_{interv}?
      • Can we work through implementing the cube from the UW paper on the sensor example in the Scorpion paper since I've got that "sample dataset" implemented in the tests?

Add mypy

Solution sketch

  • pip install mypy (add to requirements)
  • mypy datools (add to Makefile)
  • (requires some cleanup to fix errors from here
  • add to CI build

Augment unit tests with tests on larger datasets

It's hard to test things like range-valued attributes in test_diff.py with such a small amount of data. Introducing a test dataset (it can even be the sensor one in our examples folder) would help us stress test the library a bit more.

dbt diff() macro

I'm interested in prototyping the following proposal.

Goal

Enable the DIFF operator within a dbt project.

Implementation proposal

  • Create a suite of dbt macros that mimic the SQL generated within datools/explanations.py
  • diff() macro as the main interface
  • Result is a relation like this:
    image

Potential syntax

with
this_week as (

    select *
    from {{ ref("logs") }}
    where
        crash = true
        and timestamp between '2018-08-28' and '2018-09-04'

),

last_week as (

    select *
    from {{ ref("logs") }}
    where
        crash = true
        and timestamp between '2018-08-21' and '2018-08-28'

)

{{ datools.diff(this_week, last_week, on=["app_version", "device_type", "os"], compare_by="risk_ratio", threshold="2.0", support="0.05", max_order="1") }}

Examples of dbt macros

These two dbt packages contain macros that might be useful for inspiration:

For example:

Better handle range-valued columns by integrating table_statistics into DIFF

Use this: https://github.com/marcua/datools/blob/main/datools/table_statistics.py

That functionality is wrapped by the currently unused

def _single_column_candidate_predicates(
, which could be used by diff to generate range-based candidates.

  • Might need to modify the statistics code to work on arbitrary queries, not just tables.
  • Thinking about this harder, we're likely going to want more explicit direction: should each column be treated as set-valued or range-valued? Otherwise, range-valued columns will probably cause unhelpful explanations when also treated as set-valued.
  • Rewrite _range_valued_statistics and _set_valued_statistics to be public APIs and take a query instead of a table name.
    • engine instead of connection as argument
    • query instead of table as argument
  • Create a peer to on_columns called on_range_columns and return explanations for both types of column after transforming ranges to sets by bucketing.
  • test_diffs.py works with new API, but doesn't return any of the buckets as explanations. Is that by design given the data?
  • Update the Intel Sensor example created in #20 to transform range-valued attributes.
    • Make sure the example has the same results before you add range-valued columns
    • Make sure that after bucketing range-valued columns, the results are more sensible than treating everything as a set-valued attribute *I did, and it's not more sensible. I wrote up some hypotheses in the notebook)

Candidate column generation for enums (char/int) and ranges (%tiles for int/float)

    # TODO(marcua): column_values[column] seems to have # rows as                                                                                                
    # output, rather than # buckets. 

see the range values below

[(Column('id', INTEGER(), table=<sensor_readings>, primary_key=True, nullable=False), [SetValuedStatistics(distinct_values: 9), RangeValuedStatistics(bucket_maximums: [1, 2, 3, 4, 5, 6, 7, 8, 9])]), (Column('sensor_id', VARCHAR(), table=<sensor_readings>, nullable=False), [SetValuedStatistics(distinct_values: 3)]), (Column('created_at', DATETIME(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: ['2021-05-05 11:00:00.000000', '2021-05-05 11:00:00.000000', '2021-05-05 11:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 13:00:00.000000', '2021-05-05 13:00:00.000000', '2021-05-05 13:00:00.000000'])]), (Column('voltage', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [2.3, 2.3, 2.63, 2.64, 2.65, 2.7, 2.7, 2.7, 2.7])]), (Column('humidity', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [0.3, 0.3, 0.4, 0.4, 0.4, 0.5, 0.5, 0.5, 0.5])]), (Column('temperature', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [34.0, 35.0, 35.0, 35.0, 35.0, 35.0, 35.0, 80.0, 100.0])])]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.