marcua / datools Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
(Add votes if this is important to you)
column >= start_of_first AND column < start_of_second
. For set-valued columns, this means predicates of the form column = popular_value
.\mu_{aggr}
and \mu_{interv}
?Solution sketch
It's hard to test things like range-valued attributes in test_diff.py
with such a small amount of data. Introducing a test dataset (it can even be the sensor one in our examples
folder) would help us stress test the library a bit more.
I'm interested in prototyping the following proposal.
Enable the DIFF
operator within a dbt project.
diff()
macro as the main interfacewith
this_week as (
select *
from {{ ref("logs") }}
where
crash = true
and timestamp between '2018-08-28' and '2018-09-04'
),
last_week as (
select *
from {{ ref("logs") }}
where
crash = true
and timestamp between '2018-08-21' and '2018-08-28'
)
{{ datools.diff(this_week, last_week, on=["app_version", "device_type", "os"], compare_by="risk_ratio", threshold="2.0", support="0.05", max_order="1") }}
These two dbt packages contain macros that might be useful for inspiration:
For example:
As discussed on this thread and expanded on in the DIFF paper
Use this: https://github.com/marcua/datools/blob/main/datools/table_statistics.py
That functionality is wrapped by the currently unused
datools/datools/explanation/algorithms.py
Line 32 in 18cdbc4
diff
to generate range-based candidates.
_range_valued_statistics
and _set_valued_statistics
to be public APIs and take a query instead of a table name.
on_columns
called on_range_columns
and return explanations for both types of column after transforming ranges to sets by bucketing.(Add votes if this is important to you)
(Add votes if this is important to you)
Current (reasonably, I think) hack: 1bab9af
# TODO(marcua): column_values[column] seems to have # rows as
# output, rather than # buckets.
see the range values below
[(Column('id', INTEGER(), table=<sensor_readings>, primary_key=True, nullable=False), [SetValuedStatistics(distinct_values: 9), RangeValuedStatistics(bucket_maximums: [1, 2, 3, 4, 5, 6, 7, 8, 9])]), (Column('sensor_id', VARCHAR(), table=<sensor_readings>, nullable=False), [SetValuedStatistics(distinct_values: 3)]), (Column('created_at', DATETIME(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: ['2021-05-05 11:00:00.000000', '2021-05-05 11:00:00.000000', '2021-05-05 11:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 12:00:00.000000', '2021-05-05 13:00:00.000000', '2021-05-05 13:00:00.000000', '2021-05-05 13:00:00.000000'])]), (Column('voltage', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [2.3, 2.3, 2.63, 2.64, 2.65, 2.7, 2.7, 2.7, 2.7])]), (Column('humidity', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [0.3, 0.3, 0.4, 0.4, 0.4, 0.5, 0.5, 0.5, 0.5])]), (Column('temperature', FLOAT(), table=<sensor_readings>, nullable=False), [RangeValuedStatistics(bucket_maximums: [34.0, 35.0, 35.0, 35.0, 35.0, 35.0, 35.0, 80.0, 100.0])])]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.