dbt-labs / dbt-semantic-interfaces Goto Github PK
View Code? Open in Web Editor NEWThe shared semantic layer definitions that dbt-core and MetricFlow use.
License: Apache License 2.0
The shared semantic layer definitions that dbt-core and MetricFlow use.
License: Apache License 2.0
As requested by @marcodamore - it would be useful to have a configuration object in the semantic manifest to hold configuration options like:
{
"project_config": {
"dsi_version_number": 123,
"timespine": {
"location": "transform.metricflow_time_spine",
"column_name": "date_day",
"grain": "daily",
}
}
}
No response
No response
No response
No response
call_parameter_sets
is currently a property on the Pydantic implementation of the WhereFilter
. Since this property might be more broadly useful, it should be moved into the protocol definition.
No response
No response
Yes
No response
Right now having Dimension be agnostic of it's type forces us to do validation all throughout the codebase. So breaking up these into two different classes makes it a lot easier for us to break up the code!
No response
No response
No response
No response
We recently discovered that the named parameter etype
of traceback.format_exception_only
was dropped in python 3.10+. Currently we use the etype
parameter in two places:
If either of these code paths are hit when one is using python 3.10+, then an exception like the following gets raised
File "/Users/quigleymalcolm/Developer/dbt-labs/dbt-semantic-interfaces/dbt_semantic_interfaces/validations/validator_helpers.py", line 365, in wrapper
generate_exception_issue(
File "/Users/quigleymalcolm/Developer/dbt-labs/dbt-semantic-interfaces/dbt_semantic_interfaces/validations/validator_helpers.py", line 343, in generate_exception_issue
f"{''.join(traceback.format_exception_only(etype=type(e), value=e))}",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: format_exception_only() got an unexpected keyword argument 'etype'
This is extra problematic as the only way to work around the issue is to fix whatever is wrong with your config that is trying to raise a validation issue, but this exception ends up swallowing up the initial discovered problem ๐
We could continue passing the type of the exception, as it's still accepted as a positional argument wherein it just has to be an unnamed argument and the first in the function call. However as of python 3.5, the exception type began being inferred from the passed in exception value. Since we only support python 3.8+, it seems like the best path forward would be to just drop the etype
in the two linked calls.
We want a standard definition of what a valid SemanticModel
implementation should have. This allows for a shared definition to be understood in MetricFlow and dbt-core of what anything implementing the SemanticModel
protocol will make available without MetricFlow and dbt-core needing to import each other. This should use the new Protocol
type which exists in python 3.8+
This will be an open source repository. As such we should have a contributing guide. It can likely be much the same as dbt-core's, but likely paired down as this repository should be smaller and have fewer dependencies
As above so below. Let's do some renaming!
No response
No response
No response
No response
In the new world of dbt-core x MetricFlow Identifiers
are becoming Entities
. Additionally some of the properties of the object are changing. The resulting object should have the following properties
Property Name | Type | Description |
---|---|---|
name | str | Name of the entity |
type | enum | Type of the entity |
description | str | Description of the entity |
role | str | Role of the entity |
entities | List[str] | List of composite sub-entities |
expr | str | Expression of the entity |
The above properties are were pulled from dbt-labs/dbt-core#7456
Metric
s can have many WhereFilter
s associated with them. Specifically:
Metric.filter
Metric.type_params.measure.filter
Metric.type_params.denominator.filter
Metric.type_params.numerator.filter
Metric.type_params.input_metrics[x].filter
where_sql_template
of a WhereFilter
is a highly structured. Technically with #110 call_parameter_sets
guarantees the structure of a WhereFilter
s where_sql_template
, but this only happens if call_parameter_sets
is actually called. The best way to pseudo guarantee this happens is to add a SemanticManifestValidationRule to the default rules of the SemanticManifestValidator. It's only a pseudo guarantee because from DSI's perspective, it's not guaranteed that a SemanticManifest
has been run through the SemanticManifestValidator
, however that is best practice and what people should do.THERE EXISTS a SemanticManifestValidationRule
THAT checks call_parameter_sets
of all filters of all metrics on a SemanticManifest
AND the new rule is added to the default rules of the SemanticManifestValidator
This should be a protocol that simply is the composition of SemanticModel and Metric protocols. Something like:
from types import List
from dbt_semantic_interfaces.protocols import SemanticModel, Metric
class SemanticManifest(Protocol):
semantic_models: List[SemanticModel]
metrics: List[Metric]
Most of our open source repositories at dbt Labs have github workflow release actions (core, dbt-snowflake, dbt-redshift, dbt-bigquery). These workflows handle compiling the changie docs, bumping the version, building the distributions, pushing the distributions to pypi and github. We should have this in place for the 0.1.0 release of DSI.
We don't have PR templates. We should!
๐ค continuing this rodeo wild west world
Everyone who contributes here
Yarp
No response
As soon as we've got a working version, we should publish dbt-semantic-interfaces
on PyPi so that both dbt-core and MetricFlow can install it without needing to reference the github link.
Having them reference the hard coded git link but that seems like a
The developers of MetricFlow & dbt-core!
Yarp
No response
Currently the SemanticManifestValidator
(currently named ModelValidator
) expects a UserConfiguredModel
which is a concrete object. Initially we thought we should move to it expecting SemanticManifest
protocol. However, in dbt-core we'll be writing nodes which extend the protocol definition. If we want to be able to write validation rules that can operate on the extensions and guarantee type safety, then we need to take it a step further. Thus the SemanticManifestValidator
should instead operate on a generic bound by the SemanticManifest
protocol.
Something like...
from typing import TypeVar
from dbt_semantic_interfaces.protocols import SemanticManifest
T = TypeVar("T", bound="SemanticManifest")
class SemanticManifestValidator:
...
def validate(self, semantic_manifest: T) -> ValidationResults:
...
This is an alternative or potential addition to #56 . We want some way of defining metrics/measures without showing them.
No response
No response
No response
No response
We want a standard definition of what a valid Entity
implementation should have. This allows for a shared definition to be understood in MetricFlow and dbt-core of what anything implementing the Entity
protocol will make available without MetricFlow and dbt-core needing to import each other. This should use the new Protocol
type which exists in python 3.8+
In the new world of dbt-core x MetricFlow the properties of metrics are changing slightly. Metric objects should have the following properties
Property Name | Type | Description |
---|---|---|
name | str | Name of the metric |
type | enum | Metric type |
description | str | Description of the metric |
type_params | TypeParams | Type parameters for the metric. These parameters change based on the type |
filter | str | WHERE clause constraint applied to the metric |
The above properties are were pulled from dbt-labs/dbt-core#7456
The requirement for this functionality was discussed during a standup and it resolves some issues we would have around knowing what version of DSI a semantic manifest is using.
Not doing this
The metricflow developers
No response
No response
Currently the SemanticManifestTransformer
(currently named ModelTransformer
) expects and returns a UserConfiguredModel
which is a concrete object. Initially we thought we should move to it expecting and returning a SemanticManifest
protocol. However, in dbt-core we'll be writing nodes which extend the protocol definition. Additionally we want to be able to hand in raw-ish parsings and transform them into the final objects, meaning the input and return types should be different. If we want to be able to write transformation rules that can operate on the raw parsed objects, return the extended classes, and guarantee type safety, then we need to take it a step further. Thus the SemanticManifestTransformer
should instead expect a generic and return a generic bound by the SemanticManifest
protocol.
Something like...
from typing import TypeVar
from dbt_semantic_interfaces.protocols import SemanticManifest
T = TypeVar("T", bound="SemanticManifest")
U = TypeVar("U")
class SemanticManifestTransformer:
...
@staticmethod
def transform(
raw_semantic_manifest: U,
ordered_rule_sequences: Tuple[Sequence[SemanticManifestTransformRule], ...] = DEFAULT_RULES,
) -> T:
....
This entails getting rid of two metric types because the functionality can be accomplished with derived
. We want to commit to simple
metrics being the building blocks and derived
metrics serving as the mechanism to create more complicated metrics.
No response
No response
No response
No response
Measure proxy isn't going to make as much sense in the new world of simple metrics & derived metrics. So let's rename it to simple!
No response
No response
No response
No response
The version of mypy should be upgraded from 1.1.1
to the latest version 1.3.0
to be consistent with what's used in metricflow
.
No response
No response
No response
No response
mypy fails in metricflow with error: Skipping analyzing "dbt_semantic_interfaces.test_utils": module is installed, but missing library stubs or py.typed marker [import]
mypy does typechecking without skipping the imports
Take a dependency with follow imports enabled for dbt_semantic_interfaces, run mypy
No response
No response
No response
Say you want to define a derived metric that is a ratio of two measures. In the old world you'd use a ratio metric type but that will no longer exist. So you have two options:
metric:
name: derived_metric
type: derived
type_params:
expr: "metric_a / metric_b"
metrics:
- name: metric_a
type: simple
type_params:
- measure: column_a
- name: metric_b
type: simple
type_params:
- measure: column_b
In the above example, metric_a and metric_b would NOT show up in list-metrics
as they are only defined inline to derived_metric. Similar to how dbt uses ephemeral models today.
No response
No response
No response
No response
Rename the 'schema_name' field on node_relation to simply 'schema'
Right now everything in dbt land is represented as a node in the DAG. These relationships are established with things like ref('some_model')
.
Once we introduce the semantic constructs (metric & semantic model) we'll want those relationships to hold true. IE I should be able to see that model A feeds into semantic model B and Metric C.
๐ค
Hiding?
Anyone who uses the semantic layer
Yarp
No response
With this being a new repository we should setup some good hygiene, as part of that we should setup pre-commit hooks. The pre-commit hooks should be similar to the pre-commit hooks in dbt-core.
@QMalcolm can you offer more context here on what this issue entails?
No response
No response
No response
No response
In the before times, pre March 2023, we referred to SemanticManifest
s as Model
s. We've taken an initial pass getting a fair number of these moved over, but many remain (Examples: A, B, C, etc).
Getting these corrected is slowly becoming harder due to Hyrum's Law. The incorrect usage of the word model
shows up in class names, function names, function attributes, variables within functions, comments. Some of these are easier to fix than others. For the use of model
in comments and in variables instantiated within a function run have no outside exposure and can be changed without worry. Usage of model
in function names beginning with a _
and parameters of a function starting with a _
are safe to change because they are "private" functions. For function names, parameters to public functions, class names, and public attributes of classes we'll first have to investigate MetricFlow
and dbt-core
to see if these names are depended on. At this time I'm not too worried about third party exposure given that we're in RC of our first version release.
Oh this is extra hard because pydantic's HashableBaseModel
and BaseModel
, SemanticModel
, and the like have valid uses of the word model
and thus a simple "find and replace" cannot be used ๐
Do measures always create metrics? In a world where all complicated metrics are defined as derived metrics on top of simple metrics, this might be true.
This Issue is complicated by the fact that create_metric = true has implications in parsing for core.
Additionally we have the create_metric_display_name
property. If we want to retain this then we should make create_metric a dict to support it as a class. That way someone isn't adding the display value without create metric
No response
No response
No response
No response
Currently the WhereFilterParser
uses jinja2.Template
when parsing the FilterCallParameterSets
from a str
. This is not considered best practice because the jinja2.Template
does not by default use a SandboxedEnvironment
. Generally DSI doesn't make assumptions about the implementing architecture of projects using DSI, and depending on your usage more security checks should be done on the implementation side. However it's a pretty small change to begin using a SandboxedEnvironment for jinja rendering to provide people utilizing DSI some additonal peace of mind.
THE WhereFilterParser
USES a SandboxedEnviornment
WHEN performing jinja parsing/rendering
In the new world of dbt-core x MetricFlow the properties of measures are changing slightly. Measure objects should have the following properties
Property Name | Type | Description |
---|---|---|
name | str | Name of the measure |
agg | enum | Aggregation type |
description | str | Description of the measure |
expr | str | Expression of the measure |
create_metric | Bool | Boolean flag that creates a metric from the measure if True |
agg_params | Optional[AggregationParameters] | The aggregation parameters |
non_additive_dimension | ? | Non-additive dimension parameters |
agg_time_dimension | str | The time dimension to aggregate the measure by |
The above properties are were pulled from dbt-labs/dbt-core#7456
Describe the Feature
I want to be able to retrieve a display_name
from the domain objects.
Currently, Metric
and Dimension
domain objects do not have display_name
as an attribute. It is available in the schema, but isn't defined at the domain level.
Would you like to contribute?
Certainly, but it is 2 LOC change.
Anything Else?
@nhandel I know there are some changes coming, maybe that is something that could be added ?
Describe the bug
When we run our unit tests everything grinds to a halt on the semantic validator tests
This is potentially caused by our test construction - most of these tests do something like:
with pytest.raises(...):
# run all validations
That means we run every validation on every test input model on every test case, even though most, if not all, of these tests cases are targeted at highly specific validation rules.
A good start here would be to migrate these test cases to only run the specific rule we care to test.
Another possible optimization - both for runtime and readability - is to use smaller, more targeted models defined local to the test case, so converting these cases away from model fixtures onto the local model shims would be great as well. See https://github.com/transform-data/metricflow/blob/main/metricflow/test/model/validations/test_validity_param_definitions.py#L51-L71 for an example of a locally defined model with a specific failure state written into it.
Steps To Reproduce
Steps to reproduce the behavior:
make test
validations
pathExpected behavior
These should be much faster
The create_metric
key on a measure shouldn't be part of the protocol. It's an ergonomics field that is only ever used at parse time, it's never used by MetricFlow. It can be left on the PydanticMeasure
implementation, but should be removed from the protocol.
We want to have a good change log for this repository, starting as soon as possible. dbt-core uses Changie, and we want to follow that pattern. Here is dbt-core's changie config. Additionally PR's should require a changie entry to be present.
This repository / package is intended to be where the semantic interface protocols live such that MetricFlow and dbt-core (and other projects) have shared importable understood protocol. That's a tall order. It all starts though with the current definitions as they are in MetricFlow, specifically everything that lives in the model director. The only carve outs from that are files & directories for data warehouse validations and dbt-metrics -> MetricFlow conversion. Data warehouse validations will remain part of MetricFlow, and conversions from dbt-metrics to MetricFlow will no longer be needed.
In the new world of dbt-core x MetricFlow the properties of dimensions are changing slightly. Dimension objects should have the following properties
Property Name | Type | Description |
---|---|---|
name | str | Name of the dimension |
type | enum | Type of the dimension |
description | str | Description of the dimension |
expr | str | Expression of the dimension |
type_params | TypeParameters | Parameters needed for the given type |
The above properties are were pulled from dbt-labs/dbt-core#7456
Currently, pyproject.toml
specifies pinned versions on many dependencies. The pinned dependencies make it more difficult to use this project with other projects as there will be dependency conflicts. Consequently, the pinned dependencies should be relaxed when possible.
No response
No response
No response
No response
As noted in dbt-labs/dbt-core#7456, for the integration of dbt-core and MetricFlow we are creating a third repo to put the shared semantic interface defintions. This is that repo. What that entails is protocol definitions for the semantic interfaces, default object implementations, default parsing / transformations / validations, and associated tests.
In the new world of dbt-core x MetricFlow Models
have become SemanticModels
and the properties have changed slightly. Semantic Model objects should have the following properties
Property Name | Type | Description |
---|---|---|
name | str | Name of the semantic model |
description | str | Description of the semantic model |
entities | List[Entity] | Entities for the semantic model |
dimensions | List[Dimension] | Dimensions for the semantic model |
measures | List[Measure] | Measures for the semantic model |
data_path | DataPath | Path object to where the data should be |
Relatedly the DataPath object should have the following for this first pass
Property Name | Type |
---|---|
name | str |
schema | str |
database | Optional[str] |
The above properties are were pulled from dbt-labs/dbt-core#7456
We want a standard definition of what a valid Metric
implementation should have. This allows for a shared definition to be understood in MetricFlow and dbt-core of what anything implementing the Metric
protocol will make available without MetricFlow and dbt-core needing to import each other. This should use the new Protocol
type which exists in python 3.8+
SemanticModel
objects can have the same primary Entity
Dimension
names are not required to unique across SemanitcModel
s in the SemanticManifest
Together statements 1 and 2 mean that it is possible for SemanticModel
s to have the same primary Entity
and have Dimension
s of the same name. This is problematic because if a dimension <primary_entity>__<dimension_nam>
is specified in a WhereFilter
, it can be ambiguous which dimension is actually being referenced.
Disallow SemanticModel
s with the same primary Entity
to have duplicate Dimension
s
A validation rule exists which disallows SemanticModel
s with the same primary Entity
to have duplicate Dimension
s
This is a follow up after #49 . In reality this is more of a semantic model property than it is a dimension property.
No response
No response
No response
No response
Currently the WhereFilter protocol only has one property, where_sql_template
which is defined as a str
. However, in reality this string
is a highly structured object. We know we are going to make changes to this structure over time. This is problematic because the protocol only captures that it is a string and nothing more, thus any changes to the structure wouldn't actually be associated with any DSI version. This could lead to some really funky situations. To my knowledge we can't set the protocol filter to match a specific string pattern. Though if we could, that would be cool. The alternative is to take the structure out of the string, i.e. have a more structured protocol definition for WhereFilter
class DimensionInput(Protocol):
dimension: str
primary_entity: str
entity_path: Optional[List[str]]
class TimeDimensionInput(Protocol):
dimension: str
primary_entity: str
granularity: TimeGranularity
entity_path: Optional[List[str]]
class EntityInput(Protocol):
entity: str
entity_path: Optional[List[str]]
class WhereFilter(Protocol):
where_sql_template: str
input_dimensions: List[DimensionInput]
input_time_dimensions: List[TimeDimensionInput]
input_entities: List[EntityInput]
WhereFilter
objectWhereFilter(
where_sql_template="{{ country }} = 'US' AND {{ ds }} >= '2023-0701' AND {{ user }} == 'SOME_USER_ID'",
input_dimensions=[
DimensionInput(
dimension='country',
entity_path=['user']
)
],
input_time_dimensions=[
TimeDimensionInput(
dimension='ds',
entity_path=['user', 'transaction'],
granularity=TimeGranularity.MONTH
)
],
input_entities=[EntityInput(entity='user')]
}
WhereFilter
{
"where_sql_template": "{{ country }} = 'US' AND {{ ds }} >= '2023-0701' AND {{ user }} == 'SOME_USER_ID'"
"input_dimensions": [
{
"dimension": "country",
"entity_path": ["user"]
}
],
"input_time_dimensions": [
{
"dimension": "ds",
"entity_path": ["user", "transaction"],
"granularity": "month"
}
],
"input_entities": [{"entity": "user"}]
}
We don't want the user to have to specify their filters in this verbose structured manner. That would suck. However, the user facing spec and the protocol can be divorced, and in core they actually are. My view more and more has become that the user spec compiles down to the protocol definitions. The user facing spec in YAML would continue to be
metric:
- name: 'my metric name'
...
- filter: "{{ dimension(name='country', entity_path=['user']) }} = 'US' AND {{ time_dimension('ds', 'month', entity_path=['user', 'transaction']) }} >= '2023-07-01' AND {{ entity('user') }} == 'SOME_USER_ID'"
This example would then compile down to the structure protocol definition example given above. This lifts the string specification into implementations of the protocols, and the agreed protocol definition would be structured (and likely less frequently change).
We'll lift the call_parameter_sets
into a generic jinja where filter compiler.
Core has a unparsed nodes and parsed nodes. It'll be fairly straight forward to handle this in core. And core will have it's own ticketed work for doing this. It'll likely just reuse the generic jinja where filter compiler produced by DSI.
We want a standard definition of what a valid Measure
implementation should have. This allows for a shared definition to be understood in MetricFlow and dbt-core of what anything implementing the Measure
protocol will make available without MetricFlow and dbt-core needing to import each other. This should use the new Protocol
type which exists in python 3.8+
We want a standard definition of what a valid Dimension
implementation should have. This allows for a shared definition to be understood in MetricFlow and dbt-core of what anything implementing the Dimension
protocol will make available without MetricFlow and dbt-core needing to import each other. This should use the new Protocol
type which exists in python 3.8+
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.