Giter VIP home page Giter VIP logo

mojap-metadata's Introduction

mojap-metadata

This python package allows users to read and alter our metadata schemas (using the metadata module) as well as convert our metadata schemas to other schema definitions utilised by other tools (these are defined in the converters module and are defined as Converters).

Metadata

Converters

Converter Systems

Installation

Make sure you are using a new version of pip (>=20.0.0)

pip install git+https://github.com/moj-analytical-services/mojap-metadata

To install additional dependencies that will be used by the converters (e.g. etl-manager and arrow extras)

pip install 'mojap-metadata[etl-manager,arrow] @ git+https://github.com/moj-analytical-services/mojap-metadata'

Metadata

This module creates a class called Metadata which allows you to interact with our agnostic metadata schemas. The Metadata class deals with parsing, manipulating and validating metadata json schemas.

The Schema

Our metadata schemas are used to define a table. The idea of these schemas are to define the contexts of a table with generic metadata schemas. If you want to use this schema to interact with Oracle, PyArrow or AWS Glue for example, then you can create a Converter class to take the metadata and converter it to a schema that works with that tool (or vice versa).

When adding a parameter to the metadata config first thing is to look if it exists in json-schema. For example enum, pattern and type are parameters in our column types but come from json schema naming definitions.

An example of a basic metadata schema:

{
    "$schema" : "$schema": "https://moj-analytical-services.github.io/metadata_schema/mojap_metadata/v1.0.0.json",
    "name": "employees",
    "description": "table containing employee information",
    "file_format": "parquet",
    "columns": [
        {
            "name": "employee_id",
            "type": "int64",
            "type_desc": "integer",
            "description": "an ID for each employee",
            "minimum": 1000,
            "maximum": 9999
        },
        {
            "name": "employee_name",
            "type": "string",
            "type_string": "string",
            "description": "name of the employee"
        },
        {
            "name": "employee_dob",
            "type": "date64",
            "type_desc": "date",
            "description": "date of birth for the employee in ISO format",
            "pattern": "^\\d{4}-([0]\\d|1[0-2])-([0-2]\\d|3[01])$"
        }
    ]
}

Schema Properties

  • name: String that can be whatever you want to name the table. Best to avoid spaces as most systems do not like that but it will let you do this.

  • file_format: String denoting the file format.

  • columns: List of objects where each object descibes a column in your table. Each column object must have at least a name and a (type or type_description).

    • name: String denoting the name of the column.
    • type: String specifing the type the data is in. We use data types from the Apache Arrow project. We use their type names as it seems to comprehensively cover most of the data types we deal with. Note: In our naming convention for types we allow bool (which is equivalent to bool_) and list (which is equivalent to list_).
    • type_category: These group different sets of type properties into a single superset. These are: integer, float, string, timestamp, bool, list, struct. For example we class int8, int16, int32, int64, uint8, uint16, uint32, uint64 as integer. It allows users to give more generic types if your data is not coming from a system or output with strict types (i.e. data exported from Excel or an unknown origin). The Metadata class has default type values for each given type_category. See the default_type_category_lookup attribute of the Metadata class to see said defaults. This field is required if type is not set.
    • description: Description of the column.
    • enum: List of what values that column can take. (Same as the standardised json schema keyword).
    • pattern: Regex pattern that value has to to match (for string type_categories only). (Same as the standardised json schema keyword).
    • minLength / maxLength: The minimum and maximum length of the string (for string type_categories only). (Same as the standardised json schema keyword).
    • minimum / maximum: The minumum and maximum value a numerical type can take (for integer and float type_categories only).
  • partitions: List of what columns in your dataset are partitions.

  • table_location: the location of the table. This is a string that can represent a file path, directory, url, etc.

  • database_name: the name of the database this table belongs to.

Additional Schema Parameters

We allow users to add addition parameters to the table schema object or any of the columns in the schema. If there are specific parameters / tags you want to add to your schema it should still pass validation (as long as the additional parameters are not the same name of ones already used in the schema).

Usage

from mojap_metadata import Metadata

# Generate basic Metadata Table from dict
meta1 = Metadata(name="test", columns=[{"name": "c1", "type": "int64"}, {"name": "c2", "type": "string"}])

print(meta1.name) # test
print(meta1.columns[0]) # {"name": "c1", "type": "int64"}
print(meta1.description) # ""

# Generate meta from dict
d = {
    "name": "test",
    "columns": [
        {"name": "c1", "type": "int64"},
        {"name": "c2", "type": "string"}
    ]
}
meta2 = Metadata.from_dict(d)

# Read / write to json
meta3 = Metadata.from_json("path/to/metadata_schema.json")
meta3.name = "new_table"
meta3.to_json("path/to/new_metadata_schema.json")

Added Class methods and properties

The metadata class has some methods and properties that are not part of the schema but helps organise and manage the schema.

Column Methods

The class has multiple methods to alter the columns list.

  • column_names: Get a list of column names
  • update_column: If column with name matches replace it otherwise add it to the end
  • remove_column: Remove column that matches the given name. Note if a name in the partitions property matches that name then it is also removed.
    meta = Metadata(columns=[
        {"name": "a", "type": "int8"},
        {"name": "b", "type": "string"},
        {"name": "c", "type": "date32"},
    ])
    meta.column_names # ["a", "b", "c"]
 
    meta.update_column({"name": "a", "type": "int64"})
    meta.columns[0]["type"] # "int64"

    meta.update_column({"name": "d", "type": "string"})
    meta.column_names # ["a", "b", "c", "d"]

    meta.remove_column("d")
    assert meta.column_names == ["a", "b", "c"]

The metadata class is a subclass of MutableMappings, where keys are column names and values are column metadata.

  • A metadata column can be accessed using the column name as a key.
  • A new or existing column can be updated using the column name as a key. The key must match the column name. Calls update_column.
  • A column can be deleted. Calls remove_column.
  • Columns of metadata can be iterated over.
  • The length of metadata is defined as the number of columns.
# Access a specific column
meta["c1"] # {"name": "c1", "type": "int64"}

# Add a new column (key must match name)
meta["c3"] = {"name": "c3", "type": "bool"}

# Delete a column
del meta["c3"]

# Iterate over all columns
for col in meta:
    print(f"column name:{col["name"]}, column type:{col["type"]}")

# Get the number of columns
len(meta) # 3

force_partition_order Property

By default this is set to None. However can be set to "start" or "end". When set to None the Metadata Class will not track column order relative to partitions.

Note: For Athena we normally set partitions at the end.

meta = Metadata(columns=[
        {"name": "a", "type": "int8"},
        {"name": "b", "type": "string"},
        {"name": "c", "type": "date32"},
    ])

meta.partitions = ["b"]
meta.column_names # ["a", "b", "c"]

If set to "start" or "end" then any changes to partitions will affect the column order.

meta.force_partition_order = "start"
meta.column_names # ["b", "a" ,"c"]

Generating Metadata objects


Converters

Converters takes a Metadata object and generates something else from it (or can convert something to a Metadata object). Most of the time your converter will convert our schema into another systems schema.

How to use the Converters

For example the ArrowConverter takes our schemas and converts them to a pyarrow schema:

from mojap_metadata import Metadata
from mojap_metadata.converters.arrow_converter import ArrowConverter

d = {
    "name": "test",
    "columns": [
        {"name": "c1", "type": "int64"},
        {"name": "c2", "type": "string"},
        {"name": "c3", "type": "struct<k1: string, k2:list<int64>>"}
    ],
    "file_format": "jsonl"
}
meta = Metadata.from_dict(d)

ac = ArrowConverter()
arrow_schema = ac.generate_from_meta(meta)

print(arrow_schema) # Could use this schema to read in data as arrow dataframe and cast it to the correct types

Another use for the arrow converter is to convert it back from an Arrow schema to our metadata. This is especially useful if you have nested data types that would be difficult to write out the full STRUCT / LIST. Instead you can let Arrow do that for you and then pass the agnostic metadata object into something like the Glue Converter to generate a schema for AWS Glue.

import pyarrow as pa
import pandas as pd

from mojap_metadata.converters.arrow_converter import ArrowConverter

data = {
    "a": [0,1],
    "b": [
        {"cat": {"meow": True}, "dog": ["bone", "fish"]},
        {"cat": {"meow": True}, "dog": ["bone", "fish"]},
    ]
}
df = pd.DataFrame(data)
arrow_df = pa.Table.from_pandas(df)
ac = ArrowConverter()
meta = ac.generate_to_meta(arrow_df.schema)

print(meta.columns) # [{'name': 'a', 'type': 'int64'}, {'name': 'b', 'type': 'struct<cat:struct<meow:bool>, dog:list<string>>'}]

All converter classes are sub classes of the mojap_metadata.converters.BaseConverter. This BaseConverter has no actual functionality but is a boilerplate class that ensures standardised attributes for all added Converters these are:

  • generate_from_meta: (function) takes a Metadata object and returns whatever the converter is producing .

  • generate_to_meta: (function) takes Any object (normally another schema for another system or package) and returns our Metadata object. (i.e. the reverse of generate_from_meta).

  • options: (Data Class) that are the options for the converter. The base options have a suppress_warnings parameter but it doesn't mean call converters use this. To get a better understanding of setting options see the GlueConverter class or the tests/test_glue_converter.py to see how they are set.

## Further Usage

See the mojap-aws-tools repo which utilises the converters a lot in different tutorials.

Contributing and Design Considerations

Each new converter (if not expanding on existing converters) should be added as a new submodule within the parent converters module. This is especially true if the new converter has additional package dependencies. By design the standard install of this package is fairly lightweight. However if you needed the ArrowConverter you would need to install the additional package dependencies for the arrow converter:

pip install 'mojap-metadata[arrow] @ git+https://github.com/moj-analytical-services/mojap-metadata'

This means we can continuely add converters (as submodules) and add optional package dependencies (see pyproject.toml ) without making the default install any less lightweight. mojap_metadata would only error if someone tries to import a converter subclass that with having the additional dependencies dependencies installed.

Converter systems

Glue Converter

The GlueConverter takes our schemas and converts them to a dictionary that be passed to the glue_client to deploy a schema on AWS Glue.

import boto3
from mojap_metadata import Metadata
from mojap_metadata.converters.glue_converter import GlueConverter

d = {
    "name": "test",
    "columns": [
        {"name": "c1", "type": "int64"},
        {"name": "c2", "type": "string"},
        {"name": "c3", "type": "struct<k1: string, k2:list<int64>>"}
    ],
    "file_format": "jsonl"
}
meta = Metadata.from_dict(d)

gc = GlueConverter()
boto_dict = gc.generate_from_meta(meta, )
boto_dict = gc.generate_from_meta(meta, database_name="test_db", table_location="s3://bucket/test_db/test/")

print(boto_dict) 

glue_client = boto3.client("glue")
glue_client.create_table(**boto_dict) # Would deploy glue schema based on our metadata

included alongside GlueConverter is GlueTable that can overlay a metadata object, dictionary, or path to metadata file. it has one method:

  • generate_from_meta: generates a glue table from the metadata object, dict, or string path, takes the following arguments:
    • metadata: the metadata object, dict, or string path that is to be overlaid
    • table_location: a kwarg, the location of the table data. This can also be a property of the metadata object, dict, or file
    • database_name: a kwarg, the name of the glue database to put the table. This can also be a property of the metadata object, dict, or file

SQLAlchemy Converter

Uses the Inspector class to extract metadata from database dialects supported by SQLAlchemy.

See SQLAlchemy Converter for more details.

Postgres Converter

Postgres Converter provides the following functionality

  1. Conenction to postgres database
  2. Extract the metadata from the tables
  3. Convert the extracted ouptut into Metadata object
  • get_object_meta (function) takes the table name, schema name then the extracts the metadata from postgres database and converts into Metadata object

  • generate_to_meta: (function) takes the database connection and returns a list of Metadata object for all the (non-system schemas) schemas and tables from the connection.

NOTE: the sqlalchemy converter is more robust and should be the default method for most databases, but the postgres converter is retained for compatibility

mojap-metadata's People

Contributors

isichei avatar soumayamauthoormoj avatar mratford avatar s-ducks avatar lavmatt avatar thomas-hirsch avatar antfmoj avatar lalithanagarur avatar calumabarnett avatar gustavmoller avatar gwionap avatar matt-heery avatar priyabasker23 avatar tmpks avatar

Stargazers

Marcelo Zlotnik avatar Ugandhar K avatar Georvic Tur avatar  avatar

Watchers

James Cloos avatar  avatar  avatar  avatar  avatar

mojap-metadata's Issues

Add new type_desc parameter

This is the higher level group of datatypes outlined in the readme.

  • Should be able to infer type_desc from type property.
  • Should be able to infer type from type_desc i.e. if old etl-manager schema has a type: int then this package should be able to infer both columns by setting type_desc to integer and setting type: int64 where int64 is the classes default type for type_desc = int

Complex column types

I really dislike having to define arrays and structs like this.

array<struct<num:int,desc:character>>
struct<num:int,desc:character>

However can't really think of a better way to define them

Type missing from schema

I have been validating the 1.3.0 schema with a json validator for an R project. It says that the schema is missing '"type": "object"' in several places, always before a properties field. This occurs directly within "additional_col_properties" and also within any of the "allOf" fields. The following appears to work. I only raise this because the schema will not validate without the corrections.

{
"$schema": "http://json-schema.org/draft-07/schema",
"definitions": {
"additional_col_properties": {
"type": "object",
"required": [
"name"
],
"properties": {
"name": {
"type": "string",
"title": "The column name"
},
"description": {
"type": "string",
"title": "A description of this field",
"default": ""
},
"nullable": {
"type": "boolean",
"title": "Specifies if column is nullable (can have missing values) or not (cannot have missing values)"
},
"sensitive": {
"type": "boolean",
"title": "Specifies if the column contains personal or special category data"
},
"alias": {
"type": "string",
"title": "If this column contains a standard ID or other linking field but has a non-standard name, the alias can provide the standard name",
"examples": [
"nomis_id"
]
},
"foreign_key": {
"type": "array",
"title": "The list of foreign key relationships this column has, if any",
"default": [],
"examples": [
[
{
"table": "",
"columns": []
}
]
],
"items": {
"anyOf": [
{
"type": "object",
"default": {},
"examples": [
{
"table": "",
"columns": []
}
],
"required": [
"table",
"columns"
],
"properties": {
"table": {
"type": "string",
"title": "The table name",
"default": "",
"examples": [
""
]
},
"columns": {
"type": "array",
"title": "The list of column names",
"default": []
}
},
"additionalProperties": true
}
]
}
},
"unique": {
"type": "boolean",
"title": "Specifies if the values in the column must be unique",
"default": false
},
"enum": {
"type": "array",
"title": "An array of valid values that can exist in this column. Note NULL/None is not required, please use nullable property to define if column is nullable.",
"examples": [
[
"Y",
"N"
],
[
0,
1,
2,
3,
4
],
[
"England",
"Northern Ireland",
"Scotland",
"Wales"
]
]
},
"pattern": {
"type": "string",
"title": "regex pattern that can be used to validate data in this column"
},
"minimum": {
"type": [
"null",
"number"
],
"title": "The minumum value a numerical type can take",
"default": null
},
"maximum": {
"type": [
"null",
"number"
],
"title": "The maximum value a numerical type can take",
"default": null
},
"minLength": {
"type": [
"null",
"number"
],
"title": "The minimum length a string type can have",
"default": null
},
"maxLength": {
"type": [
"null",
"number"
],
"title": "The maximum length a string type can have",
"default": null
}
},
"additionalProperties": true
},
"null_types": {
"allOf": [
{
"$ref": "#/definitions/additional_col_properties"
},
{
"type": "object",
"properties": {
"type": {
"type": "string",
"pattern": "^null$"
},
"type_category": {
"enum": [
"null"
]
}
}
}
]
},
"int_types": {
"allOf": [
{
"$ref": "#/definitions/additional_col_properties"
},
{
"type": "object",
"properties": {
"type": {
"type": "string",
"pattern": "^u?int(8|16|32|64)$"
},
"type_category": {
"enum": [
"integer"
]
}
}
}
]
},
"float_types": {
"allOf": [
{
"$ref": "#/definitions/additional_col_properties"
},
{
"type": "object",
"properties": {
"type": {
"type": "string",
"pattern": "^float(16|32|64)$|^decimal128\(\d+,\d+\)$"
},
"type_category": {
"enum": [
"float"
]
}
}
}
]
},
"string_types": {
"allOf": [
{
"$ref": "#/definitions/additional_col_properties"
},
{
"type": "object",
"properties": {
"type": {
"type": "string",
"pattern": "^string$|^large_string$|^utf8$|^large_utf8$"
},
"type_category": {
"enum": [
"string"
]
}
}
}
]
},
"timestamp_types": {
"allOf": [
{
"$ref": "#/definitions/additional_col_properties"
},
{
"type": "object",
"properties": {
"type": {
"type": "string",
"pattern": "^time32\((s|ms)\)$|^time64\((us|ns)\)$|^date(32|64)$|^timestamp\((s|ms|us|ns)\)$"
},
"type_category": {
"enum": [
"timestamp"
]
}
}
}
]
},
"binary_types": {
"allOf": [
{
"$ref": "#/definitions/additional_col_properties"
},
{
"type": "object",
"properties": {
"type": {
"type": "string",
"pattern": "^binary(\([0-9]+\))?$|^large_binary$"
},
"type_category": {
"enum": [
"binary"
]
}
}
}
]
},
"boolean_types": {
"allOf": [
{
"$ref": "#/definitions/additional_col_properties"
},
{
"type": "object",
"properties": {
"type": {
"type": "string",
"pattern": "^bool$|^bool_$"
},
"type_category": {
"enum": [
"boolean"
]
}
}
}
]
},
"list_types": {
"allOf": [
{
"$ref": "#/definitions/additional_col_properties"
},
{
"type": "object",
"properties": {
"type": {
"type": "string",
"pattern": "^large_list<.+>$|^list_<.+>$|^list<.+>$"
},
"type_category": {
"enum": [
"list"
]
}
}
}
]
},
"struct_types": {
"allOf": [
{
"$ref": "#/definitions/additional_col_properties"
},
{
"type": "object",
"properties": {
"type": {
"type": "string",
"pattern": "^map_<.+>$|^struct<.+>$"
},
"type_category": {
"enum": [
"struct"
]
}
}
}
]
}
},
"type": "object",
"title": "Metadata",
"description": "MoJ Data Catalogue Metadata",
"required": [
"name",
"columns"
],
"properties": {
"name": {
"type": "string",
"title": "The name of the table in the database"
},
"description": {
"type": "string",
"title": "A description of what this table contains"
},
"file_format": {
"type": "string",
"title": "The format the data is stored in"
},
"sensitive": {
"type": "boolean",
"title": "Specifies whether the table contains any sensitive data",
"default": false
},
"primary_key": {
"type": "array",
"uniqueItems": true,
"items": {
"type": "string"
},
"title": "Columns that form the primary key of the table",
"default": []
},
"partitions": {
"type": "array",
"uniqueItems": true,
"items": {
"type": "string"
},
"title": "Columns used to partition the table",
"default": []
},
"database_name": {
"type": "string",
"title": "Name of the database the table belongs to."
},
"table_location": {
"type": "string",
"title": "Path to your data can be a folder path, filepath or URL."
},
"columns": {
"type": "array",
"title": "The columns in the table. An array of objects",
"items": {
"oneOf": [
{
"$ref": "#/definitions/null_types"
},
{
"$ref": "#/definitions/int_types"
},
{
"$ref": "#/definitions/float_types"
},
{
"$ref": "#/definitions/string_types"
},
{
"$ref": "#/definitions/timestamp_types"
},
{
"$ref": "#/definitions/binary_types"
},
{
"$ref": "#/definitions/boolean_types"
},
{
"$ref": "#/definitions/list_types"
},
{
"$ref": "#/definitions/struct_types"
}
]
}
}
},
"additionalProperties": true
}

from_dict: raises error for unneeded attributes

md = Metadata.from_dict(
{
    "name": "test",
    "file_format": "csv",
    "columns": [
        {"name": "my_int", "type": "int64"},
        {"name": "my_double", "type": "float64"},
        {"name": "my_date", "type": "date64"},
        {"name": "my_timestamp", "type": "timestamp(s)"}
    ]
    
})

Raises a TypeError when trying to validate primary_key attribute:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-d026769c6f4b> in <module>
----> 1 md = Metadata.from_dict(
      2 {
      3     "name": "test",
      4     "file_format": "csv",
      5     "columns": [

~/Documents/projects/mojap-metadata/mojap_metadata/metadata/metadata.py in from_dict(cls, d)
     23         m = cls()
     24         m._data = deepcopy(d)
---> 25         m.validate()
     26         return m
     27 

~/Documents/projects/mojap-metadata/mojap_metadata/metadata/metadata.py in validate(self)
     74     def validate(self):
     75         jsonschema.validate(instance=self._data, schema=self._schema)
---> 76         self._validate_list_attribute(attribute="primary_key", columns=self.primary_key)
     77         self._validate_list_attribute(attribute="partitions", columns=self.partitions)
     78 

~/Documents/projects/mojap-metadata/mojap_metadata/metadata/metadata.py in _validate_list_attribute(self, attribute, columns)
     79     def _validate_list_attribute(self, attribute: str, columns: list) -> None:
     80         if not isinstance(columns, list):
---> 81             raise TypeError(f"'{attribute}' must be of type 'list'")
     82         if not all([isinstance(column, str) for column in columns]):
     83             raise TypeError(f"'{attribute}' must be a list of strings")

TypeError: 'primary_key' must be of type 'list'

date32 fields (days since 1970-01-01) are converted into timestamp fields

The current handling of date type columns by metadata schema means that fields specified as date32 (days since 1970-01-01) are converted into timestamp fields, when they should be date fields. No extra precision is gained by adding the timestamp, and may cause mismatch if the desired output is truncation of the timestamp to date.

A proposed solution would be to create a table_schema_2.json or equivalent with the addition of a date type_category for date32 fields. This should then become the default for tables going forward.

`glue_table_properties` only accepts and returns values as strings

this is an extension of changes discussed here and implemented here.

current situation

The current implementation allows a user to provide an additional key (glue_table_properties), which itself must be a dict of key-value pairs to the Metadata object which when passed to GlueTable().generate_from_meta with update_table_properties set to True will add the key-value pairs to to the table-properties in the glue data catalog.

i.e.

>>> meta = {
... "name": "some_table",
... ...,
... "glue_table_properties": {"key_1": "value_1", "key_2": "value_2"}
... }

>>> _ = GlueTable().generate_from_meta(
...     metadata=meta,
...     database_name="some_database",
...     table_location="s3://some/location",
...     update_table_properties=True
... )

The same is true in reverse when using GlueTable().generate_to_meta with get_table_properties set to True, a metadata object is returned with the table-properties from the glue-catalog available within the glue_table_properties key.

i.e.

>>> meta = GlueTable().generate_to_meta(
...     database="some_database",
...     table="some_table",
...     get_table_properties=True
... )

>>> meta
{
"name": "some_table",
...,
"glue_table_properties": {"key_1": "value_1", "key_2": "value_2"}
}

why we need more than strings

The current implementation requires the values in the glue_table_properties to be strings, however there are many cases where different or more complex data-types could be required. For example primary-keys are often stored as lists, and indeed this is a requirement of model-generation within create-a-derived-table which implements its own extraction of table properties and transforms certain values to lists as required (see here).

If we were to migrate model-generation to make use of generate_to_meta (which i'd very much like to do!), then we would ideally need a way for mojap-metadata to handle those data-type transformation internally, rather than having to add something custom within our own code.

Similarly when using generate_from_meta to add the table-properties to the glue-catalog it would provide a simpler user experience if the glue_table_properties dictionary could be passed with these more complex types with mojap-metadata handling the conversion to string internally before writing to the glue-catalog.

proposal

Would probably need to discuss implementation to come up with the simplest, most user-friendly approach. One possible suggestion is to provide an optional schema for glue_table_properties which mojap-metadata can then use to do any conversions it needs and raise meaningful exceptions if those transformations fail.

i.e.

>>> meta = {
... "name": "some_table",
... ...,
... "glue_table_properties": {
... "key_1": "a string",
... "key_2": 10,
... "key_3": ["a", "list"],
... "key_4": True}
... }

>>> glue_table_properties_schema = {
... "key_1":  str,
... "key_2": int,
... "key_3": list,
... "key_4": bool
... }
... }

# internally this converts the values in `glue_table_properties` into strings based on the
# `glue_table_properties_schema` schema before writing to glue
>>> _ = GlueTable().generate_from_meta(
...     metadata=meta,
...     database_name="some_database",
...     table_location="s3://some/location",
...     update_table_properties=True,
...     table_properties_schema=glue_table_properties_schema
... )

# internally this converts the strings returned from glue back to the datatypes as defined in
# `glue_table_properties_schema` schema
>>> meta = GlueTable().generate_to_meta(
...     database="some_database",
...     table="some_table",
...     get_table_properties=True,
...     table_properties_schema=glue_table_properties_schema
... )

>>> meta
{
"name": "some_table",
...,
"glue_table_properties": {
    "key_1": "a string",
    "key_2": 10,
    "key_3": ["a", "list"],
    "key_4": True
}

it seems like there is a fair amount of functionality in metadata.py which handle data-type transformations with it may be possible to use/extend for these purposes.

Binary types

  • Should the Arrow binary types sit under a binary data type category?
  • Should we add a content encoding property for strings, to specify whether they contain binary data encoded in plain text, and which encoding is used?
  • Do we need to support binary types at all at this point?
  • Other considerations?

GlueTable generate_to_meta fails on varchar(x)

It appears that if a table is created via CTAS query and has some string columns added as part of that query, the glue type becomes varchar(x) (where x is the maximum string length) rather than just varchar. This doesn't play well when trying to get the metadata from GlueTable's generate_to_meta

Improvements to table schema loading

I think this might be better off being attributed to the class? So that it is only loaded once? Also worth thinking about pointing that to a web address in the future so we can always pull the latest version the schema? This latter point might be worth adding as an issue.

Originally posted by @isichei in #15 (comment)

Support for "fuzzy" record linking with column aliases or URIs

Re this issue in the metadata_schema repo:
@RobinL "Extending the schema to allow 'fuzzy' relationships between disparate tables - standard names": moj-analytical-services/metadata_schema#2

I would like to be able to automatically detect fields that may be useful for fuzzy record linking.

For instance, we may be able to join two datasets from completely different source databases on fields like dob, first_name etc.

However, in general these fields will have different names.

This may involve

  • Extending the schema to allow such fields to be identified
  • Agreeing on a standard alias for these fields so that the column name in the dataset can be translated into the standardised version e.g. dob and birthdate may be standardised to date_of_birth

We've added an alias property for columns, so if the column has a non-standard name, the alias can provide the standard name. But we don't yet know who might maintain a definitive list of those, and where.

An idea in the readme we haven't done anything with yet, that might (?) facilitate linking is specifying some kind of URI for columns, like <repo>/metadata/<folder>/<table_name>/<column_name> or <databasename>:<tablename>:<columnname>

Should store all data in object as a dict. Then getter and setters just change dict values.

Ideally you should be able to pass a dict to the object and it store the entire contents of that dict. This would mean that reading in a config, altering some parameters and then writing would be "lossless" if you wanted to just add additional unsupported properties of you config.

This would require you to change the init, getters and setters but not that drastically.

Originally posted by @isichei in #15 (comment)

It would be useful for the class to provide a list of valid types

Similar to this method where the json schema is read in and the type_category fields are read in. There should probably be a tuple in the metadata module (or a class method to Metadata) that just provides a list of all allowable type strings. As it is not simple to extract from the json schema, this would be useful to users / for error messages.

Should support all column data types listed in arrow

(Arrows agnostic datatypes)[https://arrow.apache.org/docs/python/api/datatypes.html]

Need to consider how to deal with datetimes i.e. Timestamp[D], Timestamp[x] should be supported so might need a more complex validation other than enums

Arrowconverter.generate_from_meta doesn't include type_category

As title. arrow_pd_parser.caster assumes that each field has a type_category, indicating it should be a mandatory field, but it isn't produced by default. Either the caster should get type_category from type, or it should be produced as part of this output.

Tests failing

  • Update dependencies pyproject.toml
  • Lint if needed

Converters/Data Types: What specifically is the 'glue' format and can we use existing code from Arrow to convert data types

  • What's it's relation to hive ddl?
  • What's it's relation to Spark types
  • What's it's relation to Athena types?

If we have a full understanding of ☝️ it might reduce the amount of code we need to write converters - e.g. might help us identify pre-eisting code we can reuse

Related: is there a difference between data type conversion (i.e. being able to convert the data types from 'agnotistic metadata' to 'pandas types' to 'glue types', and metadata maniupation? Should these be separate modules? Are there official versions of these converters 'somewhere upstream' e.g. look in the metadata stored in parquet files (which is different if you write out from Spark than it is if you write out from pands) - https://github.com/moj-analytical-services/etl_manager/blob/a99646b37b257d855b1cc1114532bbe1952d7618/etl_manager/meta.py#L1024

Does Apache Arrow have any of these converters in its codebase?

FK and unique constraints need to sit at table-level

These properties need to sit outside of the column property or they won't work for composite column constraints.

"foreign_key": {
"type": "array",
"title": "The list of foreign key relationships this column has, if any",
"default": [],
"examples": [
[
{
"table": "",
"columns": []
}
]
],
"items": {
"anyOf": [
{
"type": "object",
"default": {},
"examples": [
{
"table": "",
"columns": []
}
],
"required": [
"table",
"columns"
],
"properties": {
"table": {
"type": "string",
"title": "The table name",
"default": "",
"examples": [
""
]
},
"columns": {
"type": "array",
"title": "The list of column names",
"default": []
}
},
"additionalProperties": true
}
]
}
},
"unique": {
"type": "boolean",
"title": "Specifies if the values in the column must be unique",
"default": false
},

Have publicly available schema at url

I think this might be better off being attributed to the class? So that it is only loaded once? Also worth thinking about pointing that to a web address in the future so we can always pull the latest version the schema? This latter point might be worth adding as an issue.

Originally posted by @isichei in #15 (comment)

metadata should have a database class like etl manager

Atm there is only a metadata class. Just wondering if we need a database class (or a grouped table class). This might be useful for things like looking a pk and fk relationships for example.

I would be up for not calling it a database class. A lot of our data is just groups of tables rather than extracts from RDBs would be good to ensure it is more generalised so that in can work for both.

Should default type mappings be in options rather than Coverter class?

Currently the default mappings in the GlueCoverter and (WIP) EtlManagerConverter are both defined in the Converter class. Worth wondering if these should be editable by being defined in the options for each Converter.

Other option is to allow user to apply custom function into the generate_from_meta/generate_to_meta functions. The downside is that the user cannot store the state of that additional function as a config. Not sure how often this would be needed though.

Data type naming conversion discussion

Questions

In this PR #27 I've made the time and timestamp datatypes more explicit to match Arrow. Some questions / discussion over this PR are split by headings below.

Should we drop the date type category?

Currently this will raise a validation error as dates are categorised as datetime. The schema states that date is a viable type_category but I feel like we should group times, timestamps and dates together? As they are quite similar.

Metadata(columns = [{"name": "test", "type": "date64", "type_category": "date"}]) # raises validation error

Should we use square bracket notation instead of round?

Currently to define units in types that require it e.g. timestamp we use curly brackets timestamp(s) rather than timestamp[s]. The string representation in Arrow uses square brackets so I think we should also use square brackets?

This is a typo

Think this is me copy and pasting stuff in to check tmpdir was working. Should be deleted

# content of test_tmpdir.py
def test_create_file(tmpdir):
p = tmpdir.mkdir("sub").join("hello.txt")
p.write("content")
assert p.read() == "content"

Make updates to existing metadata class

  • Should support yaml and json
  • When reading and writing metadata config. This Class should ensure that it is not lossless. I.e. if there is an unexpected property in the string (e.g. primary-key then this should be stored in the metadata object and this should be written back out when exporting the schema from the obj to JSON/YAML

Major Release V2 - Proposed Changes

[1] Change column property to Metadata Class to OrderedDict.

Something like:

class ColumnOrderedDict(OrderedDict):
    """
    An ordered dict with added methods to add columns.
    """

    def add_col(self, col: dict):
        self[col["name"]] = deepcopy(col)

    def add_list_of_cols(self, columns: List[dict]):
        for c in columns:
            self.add_col(c)

[2] Remove underscores from bool_ and list_ type names

So bool_ -> bool and list_ -> list. I think when trying to stick to closely to the arrow type names we are trying ourselves into a naming convention that doesn't make sense for us so this would be a good change for consistency.

Should glue table creation be out of scope for the glue_converter module?

In the readme I state that we should still have etl_manager just with this package running on in the innards.

I am struggling to know what would be considered out of scope or not though. If in the pandas_converter example you could cast a pandas dataframe to one that matches the metadata, is there much difference in pushing a table ddl to glue with a glue client?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.