Giter VIP home page Giter VIP logo

metricflow's Introduction

metricflow logo

Build and maintain all of your metric logic in code.


Welcome to MetricFlow

See our latest updates in the Metricflow Changelog!

MetricFlow is a semantic layer that makes it easy to organize metric definitions. It takes those definitions and generates legible and reusable SQL. This makes it easy to get consistent metrics output broken down by attributes (dimensions) of interest.

The name comes from the approach taken to generate metrics. A query is compiled into a query plan (represented below) called a dataflow that constructs metrics. The plan is then optimized and rendered to engine-specific SQL.



MetricFlow provides a set of abstractions that help you construct complicated logic and dynamically generate queries to handle:

  • Multi-hop joins between fact and dimension sources
  • Complex metric types such as ratio, expression, and cumulative
  • Metric aggregation to different time granularities
  • And so much more

To get up and running with your own metrics, you should rely on MetricFlow’s documentation.

Licensing

MetricFlow is distributed under a Business Source License (BUSL-1.1). For details on our additional use grant, change license, and change date please refer to our licensing agreement.

Getting Started

Install MetricFlow

MetricFlow can be installed from PyPi for use as a Python library with the following command:

pip install dbt-metricflow

MetricFlow currently serves as a query compilation and SQL rendering library, built to work in conjunction with a dbt project. As such, using MetricFlow requires a working dbt project and a dbt adapter. We provide the dbt-metricflow bundle for this purpose. You may choose to install other adapters as optional extras from dbt-metricflow.

You may need to install Postgres or Graphviz. You can do so by following the install instructions for Postgres or Graphviz. Mac users may prefer to use brew: brew install postgresql or brew install graphviz.

Tutorial

The best way to get started is to follow the tutorial steps, which you can access by running:

mf tutorial

Note: this must be run from a dbt project root directory.

Resources

Contributing and Code of Conduct

This project will be a place where people can easily contribute high-quality updates in a supportive environment.

Please read our code of conduct before diving in.

To get started on direct contributions, head on over to our contributor guide.

License

MetricFlow is source-available software.

Version 0 to 0.140.0 was covered by the Affero GPL license. Version 0.150.0 and greater is covered by the BSL license..

MetricFlow is built by dbt Labs.

metricflow's People

Contributors

allegraholland avatar batou9150 avatar belindabennett avatar callum-mcdata avatar courtneyholcomb avatar davidbloss avatar dbeatty10 avatar devonfulcher avatar hesham-nawaz avatar jackckelly avatar jpreillymb avatar jstein77 avatar jwills avatar jzhu13 avatar kyleli626 avatar lebca avatar marcodamore avatar mirnawong1 avatar nhandel avatar plypaul avatar qmalcolm avatar rexledesma avatar sarbmeetka avatar serramatutu avatar tlento avatar vviers avatar vvlu603 avatar williamdee avatar yanghua avatar zzsza avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metricflow's Issues

Re-write validation tests to rely on concrete model files or fresh UserDefinedModel objects

Describe the bug
In #17 we encountered an issue where the model validation tests failed non-deterministically due to an in-place mutation of the simple model we use for most of our tests resulting in the wrong exception being thrown. The short term issue was addressed in #18, but this was a quick hack patch rather than a holistic solution.

Steps To Reproduce

  1. Find all of the locations in test/model/validations where use copy.deepcopy
  2. Observe that these are making local in-memory copies of the simple model and then mutating those copies to cause validation issues

Expected behavior
These should be pointing at a set of model files (or, alternatively, a constructed UserDefinedModel object) that fails validation in the specific manner expected by the test. We will decide on a path here and update this issue accordingly.

Environment (please complete the following information):

  • MetricFlow Version [e.g. 1.0.0]

Update model parser to allow parsing context to be available to all model objects

Describe the Feature
We currently populate some parsing context metadata in top level model fields and one nested model field. The latter is accomplished by doing an improperly typed recursive callback, and much of our parsing context handling is confusing and difficult. More importantly, with completion of #107 it will stop working altogether for nested model objects.

This issue is to improve our parsing logic to preserve this functionality and allow for a more natural mechanism for including parsing context metadata in arbitrarily nested model objects.

Would you like to contribute?
I am building this feature

Make JSON schema generator check action required

With #211 we have a generated json schema in place to support certain IDE integrations for model editing, and #216 adds a github action to ensure that schema remains up to date.

Ideally that action would be required on every PR, but because any action with a "path" conditional will always be marked pending unless the path condition is met those actions will never be considered successful. This means we can't use them in branch protection.

The work-around is to make the workflow run two jobs on every PR, one of which uses something like the dorny/paths-filter action and the other of which only runs if the path condition check passes. This will let us add branch protection.

Improve model parsing to allow for input flexibility

In order to ship improvements that require potential breaking changes to the model config structure we need our parser to be a bit more flexible with respect to input.

Currently, the code is fairly rigid with respect to how parsing is applied at a model field level, and extending it requires a lot of ugly branch conditionals. This issue is to refine and restructure the code to make adding this and other improvements more straightforward.

Simplify Time Constraint SQL

Describe the bug
Time Constraints are rendered in SQL as follows:

WHERE (
  subq_166.ds >= CAST('2022-02-26' AS TIMESTAMP)
) AND (
  subq_166.ds <= CAST('2022-03-28' AS TIMESTAMP)
)

Change them to:

WHERE
  subq_166.ds BETWEEN CAST('2022-02-26' AS TIMESTAMP) AND CAST('2022-03-28' AS TIMESTAMP)

Allow aggregating measures by a custom time dimension

Describe the Feature

Prior to this all data sources had to have a primary time dimension with the same name, but there are lots of use cases for:

  1. Allowing data sources to have different time dimension names (like maybe one has a primary time of "day" and another has a primary time of "hour")
  2. Allowing measures within a data source to have different primary times (e.g., an expected_revenue measure might be based on billing_date and revenue might be based on payment_received_date, and people might want those in the same data source)

Support DuckDB as a SQL Client

Describe the Feature
DuckDB is basically the SQLite equivalent of OLAP databases. This would be great if you want an embeddable metrics layer in your product. You wouldn't need to manage an external server, don't need to manage a separate installation, support for denormalized tables, etc.

Would you like to contribute?
Perhaps.

Anything Else?
https://duckdb.org/

Add support for Hive

Describe the Feature
Considering Apache Hive is still the standard for the data warehouse in the open-source Hadoop ecosystem. Can we support interaction with Hive?

Would you like to contribute?
yes

Anything Else?
no

Add validation to protect against raw inferred config commits

Describe the Feature
With #194 we have the ability to generate an inferred data source config from Snowflake schemas. When this is production-ready and extended to all warehouse types (and other inference input types, like data catalogues or whatever) we will want a way to automatically prevent users from committing any config that contains mechanically valid but potentially semantically incorrect inferred values.

We can track examples of such cases here for now, since we don't currently mark them in a way automatic validation can evaluate.

For example, we default time dimension granularity to "day" but we don't actually have signal on the true granularity in the warehouse.

Provide an API for warehouse client configurations based on in memory objects

Describe the Feature
Today MetricFlow depends entirely on the config file for initializing warehouse clients with proper connection configuration and credentials, and the APIs for initializing SqlClient instances based on non-config sources of data (e.g., an external credential service) are not formally supported.

This task is to provide a standard API for initializing a MetricFlow Python client based on non-filesystem inputs for data warehouse information. This could be combined with #133 into a simple factory method that allows for, essentially, a dataclass form of the metricflow config which callers could populate on their own.

Define constraints for each measure in a metric

Describe the Feature
Sometimes users want to take two measures and define a ratio or expr type metric that applies constraints on the measures independently, like to compute a ratio metric of paid to total users. So you get something like:

  • numerator: count_users
    • where account_type = 'paid'
  • denominator: count_users

To do this requires us to update the measures lists for metric inputs, and the numerator and denominator entries, to accept structs instead of string names. As this is a breaking change for the model we will build out #107 and accept both input formats in the short term. If we determine a migration of end-user configs is warranted we can push out validation warnings and remove support for the current input format at a later time.

Add support for MySQL

Describe the Feature
MetricFlow should be able to issue queries against MySQL. This requires:

  1. Adding a MySQL SqlClient (with its associated attributes struct),
  2. (If necessary) updating the SQL rendering to handle any dialect-specific weirdness
  3. Ensuring the test suite passes against a MySQL target warehouse

Support Trino

Describe the Feature
MetricFlow should be able to issue queries against Trino. This requires:

Adding a Trino SqlClient (with its associated attributes struct),
(If necessary) updating the SQL rendering to handle any dialect-specific syntaxes
Ensuring the test suite passes against a Trino target warehouse

Would you like to contribute?
This is a great first issue!

Anything Else?
Add any other context or screenshots about the feature request here.

Build a Python interface

Describe the Feature
Leverage the current implementation of the CLI to make it more extensible and allow for ease of usage in other parts of the codebase. This can be followed by creating a simple Python API that can be used to perform most MetricFlow commands.

Add support for Databricks

Describe the Feature
MetricFlow should be able to issue queries against Databricks. This requires:

  1. Adding a Databricks SqlClient (with its associated attributes struct),
  2. (If necessary) updating the SQL rendering to handle any dialect-specific weirdness
  3. Ensuring the test suite passes against a Databricks target warehouse

Would you like to contribute?
Are you interested in building this feature?

Anything Else?
Add any other context or screenshots about the feature request here.

Model validation fails with strange pickling exception

Describe the bug
Seeing errors like this in unit test runs locally on both Python 3.8 and Python 3.9, but it does not repro in CI:

AttributeError: Can't get attribute '_Pydantic_DataSourceReference_140501776576224' on <module 'pydantic.dataclasses'

The culprit is the concurrent futures handling in model validation:

metricflow/test/model/validations/test_identifiers.py:35:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
metricflow/model/model_validator.py:81: in validate_model
    issues.extend(future.result())
../../miniconda3/envs/mf3.8/lib/python3.8/concurrent/futures/_base.py:437: in result
    return self.__get_result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = None

    def __get_result(self):
        if self._exception:
            try:
>               raise self._exception
E               concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

../../miniconda3/envs/mf3.8/lib/python3.8/concurrent/futures/_base.py:389: BrokenProcessPool
__________________________________________________________________________________________________________________________ test_invalid_composite_identifiers __________________________________________________________________________________________________________________________
concurrent.futures.process._RemoteTraceback:
'''
Traceback (most recent call last):
  File "/Users/tom/miniconda3/envs/mf3.8/lib/python3.8/concurrent/futures/process.py", line 368, in _queue_management_worker
    result_item = result_reader.recv()
  File "/Users/tom/miniconda3/envs/mf3.8/lib/python3.8/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute '_Pydantic_DataSourceElementReference_140502096930096' on <module 'pydantic.dataclasses' from '/Users/tom/miniconda3/envs/mf3.8/lib/python3.8/site-packages/pydantic/dataclasses.cpython-38-darwin.so'>
'''

Create a MetricFlow client by passing the absolute path of the config file

Describe the Feature
It would be useful to be able to create a MetricFlow client by passing the path of the configuration file.
If the path of the config file is not provided, then fallback on the default config file.

Right now we can do:

mf_client = MetricFlowClient.from_config() # config file must be in ~/.metricflow/config.yaml

It would be useful to be able to do also:

path = "/path/to/my/config/file.yaml"
mf_client = MetricFlowClient.from_config(path)

Would you like to contribute?
I can try 😅

Make GroupBy Alias handling an engine attribute

Describe the Feature

In DataflowToSqlQueryPlanConverter.convert_to_sql_query_plan we currently check to see if the engine is BigQuery and then do some special handling around group by aliases. This should probably be moved to be a property of the SQL engine, and handled more generally instead of as a special case check against BigQuery.

Today:

https://github.com/transform-data/metricflow/blob/main/metricflow/plan_conversion/dataflow_to_sql.py#L1302-L1304

In the future, perhaps this can be an attribute of this Protocol, and updated for BigQuery:

https://github.com/transform-data/metricflow/blob/main/metricflow/protocols/sql_client.py#L132

Would you like to contribute?
Are you interested in building this feature?

Anything Else?
Add any other context or screenshots about the feature request here.

Setup fails because of `grpcio >= 1.48.0` on M1 Pro

Describe the bug
When going through CONTRIBUTING.md, grpcio 1.48.0 installation fails on M1 Pro.

We already faced that problem on other python projects and usually is fixed with:

python -m pip install --no-binary :all: grpcio --ignore-installed

But this downloads from pypi and the version there is 1.47.0 where as the version in the lock file is 1.48.0.

Steps To Reproduce
Steps to reproduce the behavior:

make venv
source venv/bin/activate
pip install poetry
make install  # fails when installing grpcio

Expected behavior
I'd expect make install to install dependencies successfully.

Environment (please complete the following information):

  • Apple M1 Pro
  • Python 3.9.13 (venv)
  • MetricFlow v0.111.0

Enable Pydantic plugin for MyPy

We don't use the pydantic plugin for mypy, and it turns out this allows for a lot of shady stuff up to and including passing the wrong BaseModel subclass into stock Python dataclass and mypy pretending that's ok.

This issue is to enable the pydantic plugin. Note, however, that we were (inadvertently) doing some shady stuff that will need fixing, or at least a type annotation override. The most common issue is the explicit Any, which would be good to clean up, but we have a couple of uses of generics which seemed sound but are probably inappropriate in some subtle fashion, and those should probably be addressed as part of this effort.

Add a custom metadata construct to the user defined model and expose it for consumption

Describe the Feature
Sometimes end users need to annotate their metrics or data sources or what have you with some bits of custom metadata that are not accounted for in the model. These are things that would be company specific - like if you work at a company that's subject to certain types of audit requirements on your data collection and storage, maybe you'd like to annotate your model with audit levels or last audit date.

These are not necessarily appropriate to include in Metricflow proper, but they'd be useful for end users with these types of needs.

This task is to spec out what such a construct might look like (a bare string? a map? an open struct with a flexible schema to-be-defined by the caller?) and to provide a way of including it in the model, parsing it into something consumable by end users, and providing an API for it.

Semantic conversions between objects and specs leads to circular import dependencies

Describe the bug
While attempting to update #153 I encountered some circular dependencies that could not be expediently resolved. The time has come to clean this up.

We have a couple of situations where we need to move from a model object in metricflow.model.objects to a spec in metricflow.specs via a call to a semantic container in metricflow.model.semantics.semantic_containers.

Right now the existing logic (converting WhereClauseConstraint to SpecWhereClauseConstraint) lives in the query parser, which is a bit silly but neatly sidesteps the circular import dependency since the the query parser was a natural collector of these imports, and the DataFlowPlanBuilder didn't depend on anything in the query parser itself. However, #153 adds a conversion call into MetricSemantics.measures_for_metric(), which makes this problem unavoidable.

Splitting up MetricSemantics doesn't make a lot of sense - getting input measures for a metric is the kind of thing MetricSemantics is supposed to do. Naively forking the conversion logic into a separate module isn't going to work either, because that requires a dependency on a semantic container, which in turn requires a dependency on that converter.

Proposal

The Metricflow package essentially consists of:

  • Primitive model/core query handling objects
    • Model objects (metricflow.model.objects)
    • Specs/instances/references (metricflow.specs, metricflow.instances, metricflow.references)
  • Semantic containers for those data objects (metricflow.model.semantics.semantic_containers)
  • Resolvers/converters related to semantic container operation
    • The WhereClauseConstraint -> SpecWhereClauseConstraint converter
    • LinkableInstanceSpecResolver
  • Query Parser (metricflow.query.query_parser)
  • Plan objects
    • DataflowPlanNode (metricflow.dataflow.dataflow_plan)
    • SqlQueryPlanNode (metricflow.sql.sql_plan)
    • etc.
  • Plan builders (DataflowPlanBuilder, mainly, metricflow.dataflow.builder.dataflow_plan_builder)
  • Various converter and resolver objects for handling transformations between object types
    • Visitors for plan conversion
    • InstanceSetConverters
    • NodeEvaluator
    • NodeProcessor
    • ColumnResolver

Some notes on how these fit together:

  1. The primitive objects need to be loaded into the semantic containers, which are in turn passed around all over the place.
  2. The semantic containers and specs/instances/references are needed for plan building and occasionally plan conversion.
  3. The query parser requires both of the above

This implies:

  1. Model objects and Specs/Instances/References should not depend on each other. It's ok for instances to depend on specs - they're the same class of thing - but they should generally not depend on anything in UserDefinedModel
  2. Model objects and specs/instances/references cannot depend directly on the semantic containers, the query parser, or plan builders/converters

So we should have something like:

[ Model Objects ]      [Specs/InstancesReferences]
[ Semantic containers ]
[ Model Object -> Spec Converter ]
[ Plan objects ]
[ Plan Builders and Converters ]
[ Query Parser ]
[ Core Query APIs ] 
[ CLI  entrypoints ] [ API entrypoints ] 

Items in brackets can depend on anything in the bracket within reason. Items in the list can only depend on things above them (within reason).

We can use Protocols to resolve cases where, e.g., the Object -> Spec converter requires a semantic container that also requires the model/spec objects (like we need a DataSourceSemantics call to do a conversion involving a Measure).

ALL TIME CONSTRAINT should not be rendered as a WHERE clause

Describe the bug
When a query does not specify a time constraint we define a TimeRangeConstraint using ALL_TIME_BEGIN and ALL_TIME_END, and then construct a WHERE clause with this constraint. So we end up with something like:

SELECT <columns>
FROM <data_source>
WHERE primary_time >= CAST('1700-01-01' AS TIMESTAMP) AND primary_time <= CAST('2300-12-31' AS TIMESTAMP)

This is silly. We should just not have any primary_time filters in the rendered SQL.

Steps To Reproduce
Generate the SQL query output for any query with no time constraint applied on the primary time dimension.

Enable additional auth methods for bq connections

Describe the Feature
Allow users to authorize with user account json credentials or oauth as opposed to solely supporting service account credential file authorization

Would you like to contribute?
Potentially (definitely would be happy to help with testing and gathering requirements but not sure I have the bandwidth to write the pr myself)

Anything Else?
Currently the errors you get when you try and use a user json credential file is complaining about various required fields not being present in the file. Also it's not really intuitive to have the field for the credential file location in the config.yml file be called dwh_password. The comment after clarifies but the name is not useful for new adopters.

Switch from time.time to time.monotonic for latency computations

Describe the bug
We have a number of timing calls for logging duration of operations that use time.time() comparisons to measure duration.

This function is dependent on system clock time, which means NNTP or other changes to system clock could cause incorrect output results.

Wherever we do time comparisons (i.e., reporting the difference between time2 and time1) the time.monotonic() function is more appropriate, as it is not affected by system clock changes. time.monotonic() uses an arbitrary reference point, so if we have places where we need to report a meaningful single-point timestamp then we should continue to use time.time() for that purpose.

MetricFlow Engine Tests persistently fail on forks

Describe the bug
Apparently these tests run once per day on forks of the repo and then fail. This might only be an issue for certain accounts.

Steps To Reproduce
Steps to reproduce the behavior:

  1. fork the repo
  2. wait and see if that action triggers on a schedule?

Expected behavior
Either the actions should be able to succeed, or they should not run. Given the nature of these tests, probably the latter

Validation methods should not return Optional collections of errors

Describe the bug
We have validation methods that return objects containing Optional collections of issues and things of that nature. In particular, the ModelBuildResult has an optional "issues" tuple which gets populated with a tuple of validation errors whenever validation runs. Making this optional leads to callsite shenanigans like this:

errors = ModelValidator.validate_model.issues or tuple()

It's a small thing but it has cascading effects, where we need to do None type checks or coalesces all over the place.

Expected behavior
It'd be more natural to have the return type be a required collection (i.e., a Tuple rather than an Optional[Tuple]).

This will require some restructuring, since the ModelValidator.validate_model function returns a ModelBuildResult, and the ModelBuildResult class is currently overloaded to return a model built by parsing input YAML, which might or might not have been validated. Using None as an implicit indicator of whether or not validation has occurred is not desirable. We should either have an explicit flag or, better still, we should divide up ModelBuildResult into ModelParsingResult and ModelValidationResult.

Environment (please complete the following information):

  • MetricFlow Version [e.g. 1.0.0]

README is way too complex

It took two experienced developers 10 minutes of close reading to figure out what you actually do. The first line of the readme should accurately convey to the layman what the product does in its simplest form.

Example:

Welcome to MetricFlow

We create SQL queries that you can reuse.

Make SqlTable escaped keyword aware

Describe the Feature
Update our table name rendering to properly handle escape characters for sql_table entries involving reserved keywords. This involves figuring out how to make the SqlTable class process and emit properly escaped table names in a generally robust fashion. Note this can be difficult as different engines have different escape characters, and escaped table names can be quite literally anything.

Today the SqlTable class does simple split/join operations on the . character for differentiating between database, schema, and table sections in a table identifier string.

The issue is any table containing a reserved keyword can be escaped in either of the following ways in SQL:

-- schema is a reserved keyword
SELECT *
FROM "my_database.schema.my_table"

Or:

-- schema is a reserved keyword
SELECT *
FROM my_database."schema".my_table

However, the same is not true of our YAML input. If a user does this:

sql_table: "my_database.schema.my_table"

We will, at some point render a broken FROM clause like FROM schema.my_table"

Since warehouse validation should catch these kind of issues we might not need to deal with this at this level, but if there's a simple way to robustly manage this scenario it might be nice to have it in place.

Aggregation time dimension not set error is confusing

If users forget to set an aggregation time dimension on a measure OR forget to set a primary time dimension in their data source, they hit this inscrutable validation error:

AssertionError: Aggregation time dimension for {name} should have been set during model transformation

Set up automation for CLA signing

Describe the Feature
Developers have no self-service way to sign our CLA, which is a pre-condition for contribution. This Issue is for setting up some automation, probably using https://github.com/contributor-assistant

Would you like to contribute?
Are you interested in building this feature?

Anything Else?
Add any other context or screenshots about the feature request here.

Remove blanket "ignore missing imports" setting for mypy

The current configuration for running mypy in the Metricflow .pre-commit-config.yaml uses the --ignore-missing-imports flag. This is generally undesirable.

This issue is to remove that and replace it with a combination of typed versions of those packages, type stubs, and more limited scope ignore missing import directives (either inline on the individual imports or else as part of the mypy.ini).

For more information on these approaches, please refer to https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports

Add support for versioned dimensions (SCD Type II data sources)

Describe the Feature
A number of users have asked for the ability to use join against the valid dimension value in a versioned data source built on top of an SCD (Slowly Changing Dimension) Type II warehouse table.

The initial request is to join against a table with the following basic structure:

entity_key | dimension_1 | .... | dimension_x | valid_from | valid_to

where valid_from and valid_to are timestamp fields, and valid_to is NULL if the row represents the currently valid set of dimension values.

On the query side, the repeated request is to be able to easily query against this data source and use the measure aggregation time dimension values to select the relevant dimension values for the aggregation date range. In other words, a join condition of the form:

SELECT metric_time, dimension_1, SUM(1) AS num_events
FROM events a
LEFT OUTER JOIN scd_dimensions b
ON 
  a.entity_key = b.entity_key 
  AND a.metric_time >= b.valid_from 
  AND (a.metric_time < b. valid_to OR b.valid_to IS NULL)
GROUP BY 1, 2

This is useful for joining against dimensions which change gradually over time when you need the value from some point in history. For example, if you want to look at historical trends for sales by customer__country you probably want the customer__country of the moment the sale occurred, not whatever that value happens to be on the day you run the query. Just because someone moves from France to Germany doesn't mean that sale should be considered part of your German market.

Once initial support is in place for this base feature specification we can add support for different schemas (e.g., a single date partition column indicating that all values are valid for that date) and different query patterns (e.g., select dimension values from the start of a cumulative time window instead of the end, or from the endpoint of a time constraint range on the query instead of the time of the event). We will open separate issues for these extensions with example use cases.

Add support for Postgres

Describe the Feature
MetricFlow should be able to issue queries against Postgres. This requires:

  1. Adding a Postgres SqlClient (with its associated attributes struct),
  2. (If necessary) updating the SQL rendering to handle any dialect-specific weirdness
  3. Ensuring the test suite passes against a Postgres target warehouse

Validation error message on ratio metrics is confusing

Describe the bug
If someone puts in the "measures" property in the type_params for a ratio metric they get a generic "additional properties are not allowed" message, which makes it very difficult to detect the issue since "measures" is a valid property in some cases (such as with expr metrics).

See https://metricflow.slack.com/archives/C0386PW7R98/p1651761752427959 for a real-world example.

Steps To Reproduce
Steps to reproduce the behavior:

  1. Add a ratio metric to the config
  2. Include numerator, denominator, and measures in the type params
  3. Run the validation

Expected behavior
I'd expect the error message to reference the ratio metric type, at the very least. Ideally we'd be able to say which metric was incorrectly defined. This is harder to do when the issue is in the base YAML parser/PyDantic object initializer (as it is in this case - the error is coming from the JSONSchema validation layer) but it's worth exploring.

Environment (please complete the following information):

  • MetricFlow Version [0.93.0]

CLI MF Infer Inference solver could not determine a type for this column

Describe the bug
When running mf infer against a snowflake database we get lots of FIXMEs like:

# FIXME: company_name
#   - Inference solver could not determine a type for this column

Expected behavior
Would expect these columns to appear as dimensions

Environment (please complete the following information):

  • MetricFlow CLI Version 0.111.0

Fix custom jsonschema validator for package upgrade consistenc

Our jsonschema version is pegged to something nearly a year out of date. Since the Draft7Validator we rely on should be pretty stable, updating that dependency isn't a burning issue.

However, we are in the process of adding a custom validator (see #159) that's based on their old code flow - they updated to use yield from everywhere, which is a little different from what we have (but perhaps not meaningfully so in this case, it's hard to tell).

We'd like to be up to date (and in a more maintainable state) going forward.

First, we need to decide whether to proceed with the custom validator or to replace it with a shim to update the "patternProperties" attribute of all our schemas to add the pattern the custom validator ignores. The advantage of "patternProperties" is it's outside the validation stack and fully supported as part of the json-schema standard. It's also more flexible - like maybe there's a schema where do NOT want to allow those properties. The disadvantage is we need to update every schema to have it otherwise our validation might break on things like our internal __parsing_context__ key.

If we keep the validator, we need to update the package and update the validator internals to be more consistent with the rest of them. This includes switching from create to extend to initialize the validator class (if not already done) and updating the internals of the custom validator to match the yield from and other updates in the upstream validator.

If we drop the validator, we can add a function wrapper to finalize our schemas and append the pattern to patternProperties. We can either do this on schema initialization, here:

https://github.com/transform-data/metricflow/blob/main/metricflow/model/parsing/schemas.py#L66

and also everywhere else we initialize a schema. Or we can do this in bulk by updating every schema by reference prior to schema use here:

https://github.com/transform-data/metricflow/blob/main/metricflow/model/parsing/schemas.py#L297-L314

Note we'll need to do this for both schemas.py and schemas_internal.py.

Build Metrics for Experimentation

Describe the Feature

MetricFlow currently builds generalized datasets for analytical applications(i.e. metrics by dimensions). Experimentation applications require a specific set of steps to denormalize and aggregate metrics in a way that enables statistical testing. The currently configurable process through MetricFlow’s APIs does not support the two-step aggregation required to create statistical tests.

We should add a plan builder that allows users to use MetricFlow to construct metrics for experimentation applications. Users of MetricFlow should not have to reinvent the wheel every time they want to run a product experiment. This is a logical extension of the metric framework in place today.

Story / Assumptions

  • Users of MetricFlow have an experiment assignment or feature flagging system in place that exports assignment/exposure logs to a supported DW.
  • Users would like to use their existing metric configurations to build metrics for experimentation applications
  • Users of MetricFlow can call an API to generate SQL and construct metrics for a list of metrics and a specific experiment. This API can either return data to the users' querying interface or write a dataset back to the data warehouse.

Open/New Questions

  • How do we capture experiment assignment configuration?
  • Should we allow for an entity_expr for tables that have a subject_id that could be multiple entities?
  • When do we load data to support CUPED or more complicated statistical tests?
  • What do the inputs look like from common experimentation assignment tools? Do they conform or can they be transformed into a conformed input?
  • What do people need to be able to do with the API?

Lots more to do to spec the plan builder and any new node types

Experiment Assignment Configuration

Expected input from experiment assignment:

CREATE TABLE events.experiment_assignments (
  subject_id BIGINT
  subject_entity STRING
, experiment_name STRING
, variant_name STRING
, first_assigned_at TIMESTAMP
);

which would be configured in MetricFlow as follows:

name: experiment_assignments
sql_table: events.experiment_assignment
experiment_assigments: True – NEW

identifiers:
  - name: subject
    expr: subject_id
    entity_expr: subject_entity – NEW
    type: foreign

measures:
  - name: assignments
    agg: sum
    expr: 1

dimensions:
  - name: experiment_name
    type: categorical
  - name: variant_name
    type: categorical
  - name: first_assigned_at
    type: time
    type_params:
      is_primary: True
      time_granularity: second

Generated

Consider a metric

name: transactions
sql_table: fact.transactions

identifiers:
  - name: user
    expr: user_id
    type: foreign

measures:
  - name: transactions
    agg: sum
    expr: 1
    create_metric: True

To support the construction of metrics for the experiment application we would need to generate two steps of aggregation. The query below could be broken into a non-aggregated step with a timestamp for more complicated statistical tests.

First, MetricFlow's bread and butter, construct a metric to the granularity of the entity of the experiment (i.e. subject_id):

-- Step 1
CREATE TABLE metricflow.user_assignments_transactions
SELECT
  ua.subject_id
  , ua.experiment_name
  , ua.variant_name
  , SUM(1) AS transactions
FROM metricflow.user_assignments ua
LEFT JOIN fact.transactions e
ON ua.subject_id = e.subject_id AND  e.created_at BETWEEN ua.first_assigned_at and {{ analysis_date }}
GROUP BY 1,2,3

Second, get the first and second moment at the granularity of the experiment and variant name

-- Step 2
CREATE TABLE metricflow.metric_variant_aggregates
SELECT
  experiment_name
  , variant_name
  , 'transactions' as metric_name
  , COUNT(subject_id) AS assignments
  , SUM(metric) / COUNT(subject_id) AS metric_mean
  , SQRT(VAR(metric)) AS metric_std
FROM metricflow.metric_user_aggregates
GROUP BY 1,2

Finally, the appropriate statistical tests for each variant name compared to the control.

-- Step 3
CREATE TABLE metricflow.metric_comparison
SELECT
  c.experiment_name
  , c.variant_name AS variant_name_control
  , t.variant_name AS variant_name_treatment
  , c.metric_name
  , t.metric_mean - c.metric_mean AS metric_diff
  , t.metric_mean / c.metric_mean - 1 AS metric_pct
  , UDFS.TTEST_IND_FROM_STATS(
      t.metric_mean, t.metric_std, t.assignments
      c.metric_mean, c.metric_std, c.assignments
    ) AS pvalue
FROM (
  SELECT *
  FROM metricflow.metric_variant_aggregates
  WHERE variante_name = {{ control_name }}
) c
JOIN (
  SELECT *
  FROM metricflow.metric_variant_aggregates
  WHERE variante_name != {{ control_name }}
) t
ON c.experiment_name = t.experiment_name
  AND c.metric_name = t.metric_name

Outputs

API

The API to generate experimental datasets could look something like this:

mf experiment_query
  --metrics x,y,z
  --dimensions a,b,c
  --experiment_name button_color
  --control_name control
  --analysis_date

Output Schemas

Load from Step 2:

CREATE TABLE metricflow.experiment_metric_values
  , experiment_name STRING
  , variant_name STRING
  , dimenison_name STRING
  , dimenison_value STRING
  , assignments BIGINT
  , metric_mean DOUBLE
  , metric_std DOUBLE
  , ts__day DATE -- analysis_date
PARTITIONED BY (experiment_name STRING, ts__day DATE)
);

Load from Step 3:

CREATE TABLE metricflow.experiment_event_source_summary
  , experiment_name STRING
  , variant_name_control STRING
  , variant_name_treatment STRING
  , metric_diff DOUBLE
  , metric_pct DOUBLE
  , pvalue DOUBLE
  , ts__day DATE -- analysis_date
PARTITIONED BY (experiment_name STRING, ts__day DATE)
);

Dimensional Analysis

Dimensional analysis could be achieved by injecting a dimension name and value into steps 1 and 2 and performing all aggregations to that dimension's name and value. The output table could look something like this

CREATE TABLE metricflow.experiment_metric_values
  , experiment_name STRING
  , variant_name STRING
  , dimenison_name STRING
  , dimenison_value STRING
  , assignments BIGINT
  , metric_mean DOUBLE
  , metric_std DOUBLE
  , ts__day DATE -- analysis_date
PARTITIONED BY (experiment_name STRING, ts__day DATE)
);

Would you like to contribute?
Totally. I’m just here to add value.

Anything Else?
I want to thank @danfrankj and @askeys for providing some initial thoughts that helped me sort through how we could do this.

Enable debug or verbose mode in CLI

Describe the Feature
The CLI only allows one logging and output level, which is fine for most normal operation, but if an end user hits an issue having a more verbose mode for communicating error states and other logging info would be nice.

Ideally verbose mode would do the following:

  1. Include DEBUG or above log entries (default is INFO), or possibly allow for a log level override
  2. Print complete exception traces in all cases where an exception is encountered
  3. (Optional) print verbose output to stderr instead of the log file (this is pretty standard behavior but we may need to separate "verbose" from "log level")

Allow CLI to specify optimization level

Describe the Feature
MetricFlow runs certain optimization passes against the Dataflow and SQL Query plans prior to rendering SQL to do things like remove redundant joins and collapse subqueries. Currently all of these are always on, but it would be nice to have a flag for the CLI that can disable some or all of the optimizers. This would aid in development and testing for new optimizations, end-user debugging for queries returning unexpected results, and query performance testing against specific engines.

Metrics accepting multiple measures can fail if no group by elements requested

Describe the bug
When we have a metric based on multiple measures originating from different data sources, such as a ratio metric with a numerator and denominator computed from different tables or with different constraints, queries against that metric will fail if no group by element is requested.

As a practical matter this should be quite rare for the following reasons:

  1. The CLI does not allow a query with an empty dimension list
  2. The dominant majority of queries for metrics will group them by metric_time (or some other relevant time dimension), since Metricflow's initial metric offering is fundamentally built around measures computed across time

Steps To Reproduce
Steps to reproduce the behavior:

  1. Add a test case to the configured test case config or the dataflow to sql plan conversion tests for bookings_per_listing with no dimensions selected
  2. Observe the rendered SQL generates an empty ON statement in the join
  3. Presumably the query will also fail, haven't tried running it

Expected behavior
We'd expect the join to be written correctly for this case, or else for such queries to be disallowed at the API level and to fail with a helpful error message.

Version: Metricflow 0.100.2

Making installation of libraries optional

Describe the Feature
If I am not using snowflake why should it be installed?

Would you like to contribute?
Are you interested in building this feature?
Sure, can raise the PR on weekend

Anything Else?
Add any other context or screenshots about the feature request here.

Provide an API for generating a model object from a set of in-memory model "files"

Describe the Feature
The MetricFlow Python API currently has two initializer mechanisms - a factory method that reads a config off a file system and initializes the SqlClient and UserDefinedModel objects, and a bare initializer that essentially hangs the caller out to dry since parsing files into a UserDefinedModel is not formally supported.

At the moment, the YAML model parser depends almost entirely on a local filesystem path containing all relevant YAML files.

This Issue is open for providing a better, fully supported API that will allow for the following things:

  1. Developer-friendly mechanisms for initializing a MetricFlow client based on in-memory representations of the YAML files we currently expect.
  2. Eventual extension to provide standard access from other cloud services (e.g., for reading from Google Cloud Storage or AWS s3 buckets, remote repos, URL paths, etc.)

The specific public-facing API has not yet been designed. Whoever claims this Issue is welcome to provide some suggestions and we will happily give guidance around what we're most interested in supporting.

Anything Else?
This depends on #132

Add real world examples to README and other documentation

Describe the Feature
In our discussion on #8 the topic of real world examples came up - having some clear, simple examples of the utility of MetricFlow to point to in the README would be helpful for newcomers, provided we can succinctly describe them.

Validate unique names test fails in CI

Describe the bug
See #16 for the failure. This seems to happen a lot in CI and not at all locally.

Steps To Reproduce

  1. Make a trivial change
  2. Put up a PR
  3. wait for failure
    This is apparently not deterministic

Expected behavior
This test should pass

Environment (please complete the following information):

  • MetricFlow Version [e.g. 1.0.0]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.