Giter VIP home page Giter VIP logo

mara-schema's People

Contributors

ice1e0 avatar jankatins avatar kirilpanev avatar leo-schick avatar martin-loetzsch avatar notsaman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mara-schema's Issues

Redesign `data_set_sql_query` use SQLAlchemy Query API

Currently the function data_set_sql_query does not support other db engines and is mainly raw SQL coded. This should be IMO be changed.

Planned changes

  • add a new function with a db parameter using functools.singledispatch
  • use the SQLAlchemy Query API to generate the SQL code
  • mark the current data_set_sql_query function as deprecated

How to model measure as an attribute (and bug for attributes of Type.ID)

Metabase allows measure columns (so something with an order, like revenue per hour) to be used in Groupings by binning it. Unfortunately, mara-schema (together with mara-metabase) currently does not allow this, as measure columns are hidden in each sync. To work around this, I can precompute the bins and add this as an Type.ENUM Attribute, but this removes the flexibility metabase gives me to chose my own bins during the analysis as it's converted to a string.

One could currently use Type.ID for this, as, despite the docs saying so, it is not converted to TEXT:

Type.ID: A numeric ID that is converted to text in a flattened table so that it can be filtered

but in the code: cast_to_text=attribute.type == Type.ENUM

My first guess is that the the docs correct and the code incorrect here and this inconsistency should be fixed.

If so, I would like to request a Type.BINNED or Type.QUANTITY or so which is not converted to TEXT during SQL generation for metabase.

Any other idea?

(Example: in our case it's for "measures" like "revenue per hour" which I precompute and put into a column: other examples could be "xth order" or simply "number if items in order")

Handle constants

We currentlx want to implement some workaround which manifests as a constant factor in a metric: basically "Overhead = [Composed Metric] * factor". Unfortunately this isn't really possible as the parser cannot handle the constant factor in a formula, so I have to either implement the overhead formula in SQL (meaning I have to reimplement the compose metric :-() or add a column (and add it as simple measure) which represents the factor (bad as we will in the next step replace the constant with some dynamic thingy...).

For now I will do the first, but having the ability to specify a constant factor in a composed metric would be awesome...

Support different naming conventions

Currently the package has several assumptions which are based on a snake case (e.g. my_foreign_table_reference_fk) naming convention. You might work in a data warehouse where you use a different naming convention e.g. Pascal case (e.g. MyForeignTableReferenceFK).

With PR #23 I added an option to patch the foreign key naming behavior. But it would be nice to have this as an option in the package itself to define which naming convention you use.

This is ticket is related to #11

Bad assumption with prefixed attributes: attributes only have the first char as uppercase

We decided on using Uppercase names for entities, so we have attributes like Business Region. If such an Attribute is added with a Prefix, we get Prefix business Region because mara_schema.attributes.Attribute.prefixed_name(...) will lowercase the first char in a name if and all path elements if it is prefixed: https://github.com/mara/mara-schema/blob/master/mara_schema/attribute.py#L54-L55

It would be nice if this could be configured/patchable. I'm currently patching the whole prefixed_name() function but because this is a method and not a function, the normal @patch() doesn't work :-(

# Overwrite the lowercasing of entity names after a prefix
# Can't use patch, but simply assigning works :-)
def _prefixed_name(self, path: t.Tuple['EntityLink'] = None) -> str:
    """Generate a meaningful business name by concatenating the prefix of entity link instances and original
    name of attribute. """
    from mara_schema.attribute import normalize_name

    if path:
        prefix = ' '.join([entity_link.prefix for entity_link in path])
        return normalize_name(prefix + ' ' + self.name)
    else:
        return normalize_name(self.name)

mara_schema.attribute.Attribute.prefixed_name = _prefixed_name

(BTW: seeing this, if would be nice if there could be a best practise guide for namings, so that this doesn't show up later. Especially because metabase has a bug (metabase/metabase#7923 (comment)) which does not allow for renaming only the casing of an already ingested column, which totally screws up the schema sync in mara-metabase in such cases :-()

Term discussion: metric vs measure

in the schema definition the terms 'simple metric' and 'composed metric' are used (e.g. see code file metric.py).

Shouldn't this be renamed:

  • 'simple metric' to measure and
  • 'composed metric' to metric?

I think this article gives a good explanation how to distinguish between both terms:

Definition of measure

In a data context, measures are the numbers or values that can be summed and/or averaged, such as sales, leads, distances, durations, temperatures, and weight. [...]

Definition of metric:

A metric is a quantifiable measure that is used to track and assess the status of a specific process. If you’re confused because we haven’t yet covered “measure,” get this: according to the Oxford dictionary, the word measure is derived from the Latin word "metiri." In others words, their meanings are almost identical—which is why you may find them used interchangeably. That said, here is the difference: a measure is a fundamental or unit-specific term—a metric can literally be derived from one or more measures. [...]

 

Sample

The measures and metrics analyzed in this repo about the New York Citi bike trips can be distinguished as following:

Measures

  • incoming trips — Number of trips whose endpoint is this station.
  • outgoing trips — Number of trips whose starting point is this station.
  • bikes outbound — Number of bikes which start their day here and get ridden out.
  • outbound trips — Number of trips taken by bikes which start their day here.
  • bikes inbound — Number of bikes which end their day here.
  • inbound trips — Number of trips taken by bikes which end their day here.

Metrics

  • all trips — Number of trips that end or being here (incoming trips + outgoing trips)
  • delta bikes — Difference in number of bikes between start and end of the day (bikes inbound - bikes outbound).
  • delta trips — Difference in number of trips taken by bikes which start their day here (incoming trips +
    outgoing trips).

pip package mara-schema does not have folder mara_schema/ui

The folder mara_schema/ui is only installed when usind dev. mode (pip install -e), not when installing without -e.

Tests

mkdir test
cd test
python3 -m venv .venv
source .venv/bin/activate
pip install wheel
pip install mara-schema
ls -l .venv/lib/python3.7/site-packages/mara_schema/
# returns:
# total 37
# -rwxrwxrwx 1 ice1e0 ice1e0  3105 Oct 22 18:35 attribute.py
# -rwxrwxrwx 1 ice1e0 ice1e0   230 Oct 22 18:35 config.py
# -rwxrwxrwx 1 ice1e0 ice1e0 10616 Oct 22 18:35 data_set.py
# -rwxrwxrwx 1 ice1e0 ice1e0  6800 Oct 22 18:35 entity.py
# drwxrwxrwx 1 ice1e0 ice1e0   512 Oct 22 18:35 example
# -rwxrwxrwx 1 ice1e0 ice1e0   505 Oct 22 18:35 __init__.py
# -rwxrwxrwx 1 ice1e0 ice1e0  3908 Oct 22 18:35 metric.py
# drwxrwxrwx 1 ice1e0 ice1e0   512 Oct 22 18:35 __pycache__
# -rwxrwxrwx 1 ice1e0 ice1e0  8119 Oct 22 18:35 sql_generation.py

It does not work with

[..]
pip install git+https://github.com/mara/mara-schema.git#egg=mara-schema
ls -l .venv/lib/python3.7/site-packages/mara_schema/
# total 41
# -rwxrwxrwx 1 ice1e0 ice1e0  3192 Oct 22 18:46 attribute.py
# -rwxrwxrwx 1 ice1e0 ice1e0   238 Oct 22 18:46 config.py
# -rwxrwxrwx 1 ice1e0 ice1e0 10840 Oct 22 18:46 data_set.py
# -rwxrwxrwx 1 ice1e0 ice1e0  6937 Oct 22 18:46 entity.py
# drwxrwxrwx 1 ice1e0 ice1e0   512 Oct 22 18:46 example
# -rwxrwxrwx 1 ice1e0 ice1e0   536 Oct 22 18:46 __init__.py
# -rwxrwxrwx 1 ice1e0 ice1e0  4004 Oct 22 18:46 metric.py
# drwxrwxrwx 1 ice1e0 ice1e0   512 Oct 22 18:46 __pycache__
# -rwxrwxrwx 1 ice1e0 ice1e0  8283 Oct 22 18:46 sql_generation.py

either, only with

[..]
pip install -e git+https://github.com/mara/mara-schema.git#egg=mara-schema
ls -l .venv/src/mara-schema/mara_schema/
# returns:
# total 41
# -rwxrwxrwx 1 ice1e0 ice1e0  3192 Oct 22 18:42 attribute.py
# -rwxrwxrwx 1 ice1e0 ice1e0   238 Oct 22 18:42 config.py
# -rwxrwxrwx 1 ice1e0 ice1e0 10840 Oct 22 18:42 data_set.py
# -rwxrwxrwx 1 ice1e0 ice1e0  6937 Oct 22 18:42 entity.py
# drwxrwxrwx 1 ice1e0 ice1e0   512 Oct 22 18:42 example
# -rwxrwxrwx 1 ice1e0 ice1e0   536 Oct 22 18:42 __init__.py
# -rwxrwxrwx 1 ice1e0 ice1e0  4004 Oct 22 18:42 metric.py
# -rwxrwxrwx 1 ice1e0 ice1e0  8283 Oct 22 18:42 sql_generation.py
# drwxrwxrwx 1 ice1e0 ice1e0   512 Oct 22 18:42 ui

Support for composite business keys

I use several times composite business keys in addition to a surrogate keys. Sample dimensional table:

CREATE TABLE vendor
(
    id INTEGER IDENTITY(1,1) NOT NULL,
    company_num VARCHAR(5) NOT NULL,
    account_num VARCHAR(20) NOT NULL,

    name TEXT,
    street TEXT,

    PRIMARY KEY (id),
    UNIQUE (company_num, account_num)
)

My fact table often only includes the business key but not the surrogate key:

CREATE TABLE purchase_transaction
(
    company_num VARCHAR(5) NOT NULL,
    vendor_order_account_num VARCHAR(20) NOT NULL,
    vendor_invoice_account_num VARCHAR(20) NOT NULL,

    invoice_date DATE,

    product_num VARCHAR(20),
    net_line_amount_acy DECIMAL(32,5),
    net_line_discount_acy DECIMAL(32,5)
)

The mara_schema module currently only supports relationships where the foreign key is related to the primary key (which is unique). So I have to manually add a vendor_order_account_fk to purchase_transaction and use an UPDATE SQL statement to set the foreign key based on the business key connection.

It would be great when mara_schema would support the following use cases:

  1. use the business key instead of the primary key and/or
  2. generate SQL statements to update a foreign key related based on the primary key

Option 1: Using the business key in relations

Required changes:

  1. The entity would need the option to define a business key (bk_column_names in Entity.__init__), `Entity.
  2. The Entity.link_entity method would need a new parameter bk_columns.
  3. The mara_schema.sql_generation.data_set_sql_query would need to use the business key in the LEFT JOIN when the primary key is not defined.

Option 2: Gelerating an update statement

Required changes:

  1. Add a function to generate an SQL statement setting a foreign key property based on the business key. It should generate something similar to:
UPDATE purchase_transaction
SET purchase_transaction.vendor_order_account_fk = vendor.id
FROM purchase_transaction
INNER JOIN vendor ON
    vendor.company_num = purchase_transaction.company_num AND
    vendor.account_num = purchase_transaction.vendor_order_account_num
WHERE purchase_transaction.vendor_order_account_fk IS NULL
  1. add a function to generate the SQL statements for all relations for multiple/all defined entities
  2. (option) maybe adding the option to add the foreign key column to the fact table as well.
ALTER TABLE purchase_transaction ADD COLUMN IF NOT EXISTS vendor_order_account_fk INTEGER;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.