absaoss / enceladus Goto Github PK

Dynamic Conformance Engine

License: Apache License 2.0

Scala 83.18% Java 0.04% JavaScript 10.93% CSS 0.09% HTML 0.21% Shell 1.49% COBOL 0.11% Batchfile 1.41% Dockerfile 0.14% Python 2.39%

spark scala hadoop bigdata datalake spring mongodb

enceladus's Introduction

                Copyright 2018 ABSA Group Limited
              
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
            You may obtain a copy of the License at
           http://www.apache.org/licenses/LICENSE-2.0
        
 Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
                  limitations under the License.

Enceladus

Build Status

master	develop

Documentation

What is Enceladus?
How to build
How to run
Plugins
- Built-in Plugins
How to contribute

What is Enceladus?

Enceladus is a Dynamic Conformance Engine which allows data from different formats to be standardized to parquet and conformed to group-accepted common reference (e.g. data for country designation which are DE in one source system and Deutschland in another, can be conformed to Germany).

The project consists of four main components:

REST API

The REST API exposes the Enceladus endpoints for creating, reading, updating and deleting the models, as well as other functionalities. The main models used are:

Runs: Although not able to be defined by users, Runs provide important overview of Standardization & Conformance jobs that have been carried out.
Schemas: Specifies the schema towards which the dataset will be standardized
Datasets: Specifies where the dataset will be read from on HDFS (RAW), the conformance rules that will be applied to it, and where it will land on HDFS once it is conformed (PUBLISH)
Mapping Tables: Specifies where tables with master reference data can be found (parquet on HDFS), which are used when applying Mapping conformance rules (e.g. the dataset uses Germany, which maps to the master reference DE in the mapping table)
Dataset Property Definitions: Datasets may be accompanied by properties, but these are not free-form - they are bound by system-wide property definitions.

The REST API exposes a Swagger Documentation UI which documents HTTP exposed endpoints. It can be found at REST_API_HOST/swagger-ui.html
In order to switch between latest and all (latest + legacy) endpoints, use the UI definition selector (right up corner in Swagger).

Menas

This is the user-facing web client, used to specify the standardization schema, and define the steps required to conform a dataset.
The Menas web client calls and is based on the REST API to get the needed entities.

Standardization

This is a Spark job which reads an input dataset in any of the supported formats and produces a parquet dataset with the Menas-specified schema as output.

Conformance

This is a Spark job which applies the Menas-specified conformance rules to the standardized dataset.

Standardization and Conformance

This is a Spark job which executes both Standardization and Conformance together in the same job

How to build

Build requirements:

Maven 3.5.4+
Java 8

Each module provides configuration file templates with reasonable default values. Make a copy of the *.properties.template and *.conf.template files in each module's src/resources directory removing the .template extension. Ensure the properties there fit your environment.

Build commands:

Without tests: mvn clean package -Dskip.unit.tests
With unit tests: mvn clean package
With integration tests: mvn clean package -Pintegration

Test coverage:

Test coverage: mvn clean verify -Pcode-coverage

The coverage reports are written in each module's target directory.

How to run

REST API requirements:

Tomcat 8.5/9.0 installation
MongoDB 4.0 installation
HADOOP_CONF_DIR environment variable, pointing to the location of your hadoop configuration (pointing to a hadoop installation)

Deploying REST API

Simply copy the rest-api.war file produced when building the project into Tomcat's webapps directory. Another possible method is building the Docker image based on the existing Dockerfile and deploying it as a container.

Deploying Menas

There are several ways of deploying Menas:

Tomcat deployment: copy the menas.war file produced when building the project into Tomcat's webapps directory. The "apiUrl" value in package.json should be set either before building or after building the artifact and modifying it in place
Docker deployment: build the Docker image based on the existing Dockerfile and deploy it as a container. The API_URL environment variable should be provided when running the container
CDN deployment: copy the built contents in the dist directory into your preferred CDN server. The "apiUrl" value in package.json in the dist directory should be set

Speed up initial loading time of REST API

Enable the HTTP compression
Configure spring.resources.cache.cachecontrol.max-age in application.properties of REST API for caching of static resources

Standardization and Conformance requirements:

Spark 2.4.4 (Scala 2.11) installation
Hadoop 2.7 installation
REST API running instance
REST API Credentials File in your home directory or on HDFS (a configuration file for authenticating the Spark jobs with Menas)
- Use with in-memory authentication e.g. ~/rest-api-credential.properties:

username=user
password=changeme

REST API Keytab File in your home directory or on HDFS
- Use with kerberos authentication, see link for details on creating keytab files
Directory structure for the RAW dataset should follow the convention of <path_to_dataset_in_menas>/<year>/<month>/<day>/v<dataset_version>. This date is specified with the --report-date option when running the Standardization and Conformance jobs.
_INFO file must be present along with the RAW data on HDFS as per the above directory structure. This is a file tracking control measures via Atum, an example can be found here.

Running Standardization

<spark home>/spark-submit \
--num-executors <num> \
--executor-memory <num>G \
--master yarn \
--deploy-mode <client/cluster> \
--driver-cores <num> \
--driver-memory <num>G \
--conf "spark.driver.extraJavaOptions=-Denceladus.rest.uri=<rest_api_uri:port> -Dstandardized.hdfs.path=<path_for_standardized_output>-{0}-{1}-{2}-{3} -Dhdp.version=<hadoop_version>" \
--class za.co.absa.enceladus.standardization.StandardizationJob \
<spark-jobs_<build_version>.jar> \
--rest-api-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version> \
--raw-format <data_format> \
--row-tag <tag>

Here row-tag is a specific option for raw-format of type XML. For more options for different types please see our WIKI.
In case REST API is configured for in-memory authentication (e.g. in dev environments), replace --rest-api-auth-keytab with --rest-api-credentials-file

Running Conformance

<spark home>/spark-submit \
--num-executors <num> \
--executor-memory <num>G \
--master yarn \
--deploy-mode <client/cluster> \
--driver-cores <num> \
--driver-memory <num>G \
--conf 'spark.ui.port=29000' \
--conf "spark.driver.extraJavaOptions=-Denceladus.rest.uri=<rest_api_uri:port> -Dstandardized.hdfs.path=<path_of_standardized_input>-{0}-{1}-{2}-{3} -Dconformance.mappingtable.pattern=reportDate={0}-{1}-{2} -Dhdp.version=<hadoop_version>" \
--packages za.co.absa:enceladus-parent:<version>,za.co.absa:enceladus-conformance:<version> \
--class za.co.absa.enceladus.conformance.DynamicConformanceJob \
<spark-jobs_<build_version>.jar> \
--rest-api-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version>

Running Standardization and Conformance together

<spark home>/spark-submit \
--num-executors <num> \
--executor-memory <num>G \
--master yarn \
--deploy-mode <client/cluster> \
--driver-cores <num> \
--driver-memory <num>G \
--conf "spark.driver.extraJavaOptions=-Denceladus.rest.uri=<rest_api_uri:port> -Dstandardized.hdfs.path=<path_for_standardized_output>-{0}-{1}-{2}-{3} -Dhdp.version=<hadoop_version>" \
--class za.co.absa.enceladus.standardization_conformance.StandardizationAndConformanceJob \
<spark-jobs_<build_version>.jar> \
--rest-api-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version> \
--raw-format <data_format> \
--row-tag <tag>

In case REST API is configured for in-memory authentication (e.g. in dev environments), replace --rest-api-auth-keytab with --rest-api-credentials-file

Helper scripts for running Standardization, Conformance or both together

The Scripts in scripts folder can be used to simplify command lines for running Standardization and Conformance jobs.

Steps to configure the scripts are as follows (Linux):

Copy all the scripts in scripts/bash directory to a location in your environment.
Copy enceladus_env.template.sh to enceladus_env.sh.
Change enceladus_env.sh according to your environment settings.
Use run_standardization.sh and run_conformance.sh scripts instead of directly invoking spark-submit to run your jobs.

Similar scripts exist for Windows in directory scripts/cmd.

The syntax for running Standardization and Conformance is similar to running them using spark-submit. The only difference is that you don't have to provide environment-specific settings. Several resource options, like driver memory and driver cores also have default values and can be omitted. The number of executors is still a mandatory parameter.

The basic command to run Standardization becomes:

<path to scripts>/run_standardization.sh \
--num-executors <num> \
--deploy-mode <client/cluster> \
--rest-api-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version> \
--raw-format <data_format> \
--row-tag <tag>

The basic command to run Conformance becomes:

<path to scripts>/run_conformance.sh \
--num-executors <num> \
--deploy-mode <client/cluster> \
--rest-api-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version>

The basic command to run Standardization and Conformance combined becomes:

<path to scripts>/run_standardization_conformance.sh \
--num-executors <num> \
--deploy-mode <client/cluster> \
--rest-api-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version> \
--raw-format <data_format> \
--row-tag <tag>

Similarly for Windows:

<path to scripts>/run_standardization.cmd ^
--num-executors <num> ^
--deploy-mode <client/cluster> ^
--rest-api-auth-keytab <path_to_keytab_file> ^
--dataset-name <dataset_name> ^
--dataset-version <dataset_version> ^
--report-date <date> ^
--report-version <data_run_version> ^
--raw-format <data_format> ^
--row-tag <tag>

Etc...

The list of options for configuring Spark deployment mode in Yarn and resource specification:

Option	Description
--deploy-mode cluster/client	Specifies a Spark Application deployment mode when Spark runs on Yarn. Can be either `client` or `cluster`.
--num-executors n	Specifies the number of executors to use.
--executor-memory mem	Specifies an amount of memory to request for each executor. See memory specification syntax in Spark. Examples: `4g`, `8g`.
--executor-cores mem	Specifies a number of cores to request for each executor (default=1).
--driver-cores n	Specifies a number of CPU cores to allocate for the driver process.
--driver-memory mem	Specifies an amount of memory to request for the driver process. See memory specification syntax in Spark. Examples: `4g`, `8g`.
--persist-storage-level level	Advanced Specifies the storage level to use for persisting intermediate results. Can be one of `NONE`, `DISK_ONLY`, `MEMORY_ONLY`, `MEMORY_ONLY_SER`, `MEMORY_AND_DISK` (default), `MEMORY_AND_DISK_SER`, etc. See more here.
--conf-spark-executor-memoryOverhead mem	Advanced. The amount of off-heap memory to be allocated per executor, in MiB unless otherwise specified. Sets `spark.executor.memoryOverhead` Spark configuration parameter. See the detailed description here. See memory specification syntax in Spark. Examples: `4g`, `8g`.
--conf-spark-memory-fraction value	Advanced. Fraction of (heap space - 300MB) used for execution and storage (default=`0.6`). Sets `spark.memory.fraction` Spark configuration parameter. See the detailed description here.

For more information on these options see the official documentation on running Spark on Yarn: https://spark.apache.org/docs/latest/running-on-yarn.html

The list of all options for running Standardization, Conformance and the combined Standardization And Conformance jobs:

Option	Description
--rest-api-auth-keytab filename	A keytab file used for Kerberized authentication to REST API. Cannot be used together with `--rest-api-credentials-file`.
--rest-api-credentials-file filename	A credentials file containing a login and a password used to authenticate to REST API. Cannot be used together with `--rest-api-auth-keytab`.
--dataset-name name	A dataset name to be standardized or conformed.
--dataset-version version	A version of a dataset to be standardized or conformed.
--report-date YYYY-mm-dd	A date specifying a day for which a raw data is landed.
--report-version version	A version of the data for a particular day.
--std-hdfs-path path	A path pattern where to put standardized data. The following tokens are expending in the pattern: `{0}` - dataset name, `{1}` - dataset version, `{2}`- report date, `{3}`- report version.

The list of additional options available for running Standardization:

Option	Description
--raw-format format	A format for input data. Can be one of `parquet`, `json`, `csv`, `xml`, `cobol`, `fixed-width`.
--charset charset	Specifies a charset to use for `csv`, `json` or `xml`. Default is `UTF-8`.
--cobol-encoding encoding	Specifies the encoding of a mainframe file (`ascii` or `ebcdic`). Code page can be specified using `--charset` option.
--cobol-is-text true/false	Specifies if the mainframe file is ASCII text file
--cobol-trimming-policy policy	Specifies the way leading and trailing spaces should be handled. Can be `none` (do not trim spaces), `left`, `right`, `both`(default).
--copybook string	Path to a copybook for COBOL data format
--csv-escape character	Specifies a character to be used for escaping other characters. By default '\' (backslash) is used. ^*
--csv-quote character	Specifies a character to be used as a quote for creating fields that might contain delimiter character. By default `"` is used. ^*
--debug-set-raw-path path	Override the path of the raw data (used for testing purposes).
--delimiter character	Specifies a delimiter character to use for CSV format. By default `,` is used. ^*
--empty-values-as-nulls true/false	If `true` treats empty values as `null`s
--folder-prefix prefix	Adds a folder prefix before the date tokens.
--header true/false	Indicates if in the input CSV data has headers as the first row of each file.
--is-xcom true/false	If `true` a mainframe input file is expected to have XCOM RDW headers.
--null-value string	Defines how null values are represented in a `csv` and `fixed-width` file formats
--row-tag tag	A row tag if the input format is `xml`.
--strict-schema-check true/false	If `true` processing ends the moment a row not adhering to the schema is encountered, `false` (default) proceeds over it with an entry in errCol
--trimValues true/false	Indicates if string fields of fixed with text data should be trimmed.

Most of these options are format specific. For details see the documentation.

^* Can also be specified as a unicode value in the following ways: U+00A1, u00a1 or just the code 00A1. In case empty string option needs to be applied, the keyword none can be used.

The list of additional options available for running Conformance:

Option	Description
--mapping-table-pattern pattern	A pattern to look for mapping table for the specified date. The list of possible substitutions: `{0}` - year, `{1}` - month, `{2}` - day of month. By default the pattern is `reportDate={0}-{1}-{2}`. Special symbols in the pattern need to be escaped. For example, an empty pattern can be be specified as `\'\'` (single quotes are escaped using a backslash character).
--experimental-mapping-rule true/false	If `true`, the experimental optimized mapping rule implementation is used. The default value is build-specific and is set in 'application.properties'.
--catalyst-workaround true/false	Turns on (`true`) or off (`false`) workaround for Catalyst optimizer issue. It is `true` by default. Turn this off only is you encounter timing freeze issues when running Conformance.
--autoclean-std-folder true/false	If `true`, the standardized folder will be cleaned automatically after successful execution of a Conformance job.

All the additional options valid for both Standardization and Conformance can also be specified when running the combined StandardizationAndConformance job

How to measure code coverage

./mvn clean verify -Pcode-coverage

If module contains measurable data the code coverage report will be generated on path:

{local-path}\enceladus\{module}\target\jacoco

Plugins

Standardization and Conformance support plugins that allow executing additional actions at certain times of the computation. To learn how plugins work, when and how their logic is executed, please refer to the documentation.

Built-in Plugins

The purpose of this module is to provide some plugins of additional but relatively elementary functionality. And also to serve as an example how plugins are written: detailed description

Examples

A module containing examples of the project usage.

How to contribute

Please see our Contribution Guidelines.

Extras

For Menas migration, there is a useful script available in scripts/migration/migrate_menas.py (dependencies.txt provided, to install missing ones, run pip install -r scripts/migration/requirements.txt)

enceladus's People

Contributors

Stargazers

Watchers

Forkers

vinothbigeng dwfchu davidmatusik tusharkalecam indoos magantivenkat pariksheet kevinwallimann sathya-reddy-m dvagapov kudathini khileshchauhan coppermann-absa ally-macgregor-sonarsource

enceladus's Issues

Refactor Validation Utils

We have two separate ValidationExceptions:

za.co.absa.enceladus.conformance.interpreter.rules.ValidationException (only used in Standardization)
za.co.absa.enceladus.utils.validation.ValidationException (only used in Conformance)
Conformance also dips into za.co.absa.enceladus.utils.validation.ValidationUtils before throwing its own ValidationException, which blurs the line between which ValidationException is responsible for what.

Good design dictates a clear separation between the two or a unification.

There is also some code duplication in conformance rules, specifically aimed at validation, that would best be extracted to a common location.

Copybook handling

Provide a mechanism to convert native copy book to Spark Struct so this can be uploaded and managed directly in Menas.
Copybook contains too much information to easily store in Spark Struct schema, maybe we should rather store it in Mongo as an uploaded attachment.

Database Integration Tests

We need database integration tests to check all database interactions Menas can perform.
This involves for each test case:

Fixtures being imported into a live DB instance (containerized or not)
Running the test
Cleaning up the DB state for the next test

Testing

We need to prioritise stability for version 1.0.0.
To do this, we need to make up for missing unit, integration and system tests.

Better dataset declaration / runtime parameter design

This involves the reduction of parameters submitted to the spark for both Standardization and Conformance. Instead these will be provided via Menas.
This should greatly improve the platform's integration capabilities, e.g. with scheduling tools like Oozie.

More COBOL integration

This involves Cobrix integration in Standardiation.

Utils Unit Tests

We need to improve coverage of our unit tests in the utils module.

View updates even from canceled edit window

If a user clicks edit over Basic Info of any of the models (Schemas, Datasets, Mapping tables), changes description, and then clicks cancel, the change still gets shown in the current window. It almost seems as the update went through.

This goes away after window refresh and does not send any call to database or such.

Implement type validation of default values

Menas should be able to validate that the specified default schema values are valid fo the associated data type.

Admin UI

Conformance Unit Tests

We need to improve coverage of our unit tests in the conformance module.

Performance optimisation for standardisation

Consider performance optimisation with respect to wide data (tables with thousands of columns). The options are discussed at Spark DEV Mailing Lists http://apache-spark-developers-list.1001551.n3.nabble.com/Performance-Spark-DataFrame-is-slow-with-wide-data-Polynomial-complexity-on-the-number-of-columns-is-tt24635.html

Create schema validation for existing fields with errCol name

Ensure that if the input to standardization contains the pre-defined error column name, the schema of the field matches our predefined schema, otherwise fail.

Validate parameters moved from runtime to definition time

The UI should validate that whitespaces are not present in HDFS paths, that those paths exists, etc.

Unit tests of handling arrays in XML

The unit tests should look like this:

For a given xml dataset and a schema check if the resulting Spark dataset is as expected
Tests should cover:
- empty arrays
- optional arrays
- single value arrays
- arrays of primitive values
- arrays of struct values
- array as a single attribute of a struct field

Implement optimistic locking or similar concurrency control

Add scaladocs

We lack scaladocs on most APIs, would really help to improve documentation.

Table grouping

I would like to suggest that there be a different folder structure or way of searching on Menas.

I think it would be good to have a folder per source system and then the tables that belong to that source will be in there instead of there just being a list of tables with no indication of which source system they are from.

Handling string formatting in represented by string numerics

Handle string based formatting that source may output e.g.:
"1,000.00" = 1000.00

Entity names shouldn't have whitespace

Dataset/Mapping table/Schema names should not be allowed to have whitespace as that can cause issues

Add performance summary to UI

It would be helpful to add some performance metrics to Run screen, such as:

input data size (in GB)
output data size (in GB)
Time it took to do standardization
Time it took to do dynamic conformance
Total processing time
For each checkpoint it would be helpful to show the elapsed time as a difference between start and end processing time, if available (for source and raw it may not be available).
Maybe also the number of records (they already shown as control measurements), could be part of performance summary as well.

An input data size and an output data size should be provided by backend. All other information can be derived from control checkpoints i guess.
There is a generic key/map pair in _INFO file. I guess we can use that to provide information about size of raw, standardized and published datasets. Also, possibly it should include information about standardization and conformance elapsed time.

The raw folder size, std folder size and conformed folder size can be saved in metadata fields of an _INFO file and Run object during std/dynconf jobs execution.

End-to-end Testing

We need to set up real e2e system tests which include everything

Allow entity Import/Export

Background

Currently, entity definitions in Menas cannot be exported. The only thing you can export is a file uploaded for a Schema entity.

Feature

It would be useful if users could export and import any entity definition as a json.

An open question is how to handle conflicts (cases where a dataset definition with the same name and version already exists). This conflict resolution should be drawn up in a shared doc and once resolved posted here.

This task is similar in nature to #594, but meant for manual single item import/export, while #594 is meant for direct promotion via HTTP and in bulk.

Expected task

API
UI adjustment

Use RESTful http methods in Menas API

Currently disable calls calls to the new rest API use HTTP GET requests, but GET should be idempotent. Instead we should use HTTP DELETE.
POST to create entity.
PUT to update entity.

Standardization Unit Tests

We need to improve coverage of our unit tests in the standardization module.

Menas Rest API Integration Tests

We need to add integration tests for the Menas Rest API.

Dataset Conformance Rules CRUD

We need a view in Menas to add/update/delete/reorder Conformance Rules.

New Menas Build

Elaborate conformance rule descriptions

Conformance rules should have a more elaborate description.

Authorization and permissions

User and admin roles need to be differentiated. User Read/Write permissions need to be restricted to either their own entities or there could be some sort of group restrictions.

Validate duplicate target columns for MT default values

Audit Trails

Timestamp in Unix Epoch

Source may contain Timestamp data as milliseconds since the Unix Epoch:

1528797510650 = 2018/09/12 09:58:30.650
As this is not a standard SimpleDateFormat there would need to be a custom Stadardisation specific string, e.g.: UNIXEPOCH

Optimize the array type conformance by merging conformance rules

Array transformations are expensive because we explode and collapse arrays for each conformance rule operating on array columns. We could improve performance by merging rules that operate on the same array columns and decreasing to a single explosion-collapse.

Similarity Schema Checks

We can implement a set of routines that provide schema difference give 2 schemes to track schema changes between versions. That may include:

The list of new fields
The list of deleted fields
The list of changed properties of fields (type, nullability, etc)

But this will require some UI work as well to visualize the difference. Probably the easiest way is to display the above lists as text boxes.

Runs view

The dataset should have a runs page showing run metadata, as per old Menas.
Reminder: add thousands separator.

Model versioning

With the new UI, all changes even micro changes are incurring multiple versions, this is good and bad as we have a full roll back on any change to a very granular point.
The proposed idea, is to have a separate version system for tracking changes in Menas (when creating and updating definition) compared to version of the dataset we run when doing spark-submit, we will have too many versions due to micro-changes that Menas is going to introduce

Warning on outdated entity references

We could put a warning on views that have entity references pointing entities with old versions.
Example:

Schema A with latest version 5
Dataset X references schema Schema A (version4)
Show warning next to Schema A (version 4) on Dataset A view: "Schema A has newer version 5"

Historical view for entities

Have a view of each entity so that we can see historical versions (ie those that have been "deleted")

Add additional metadata for datasets

Allow users to capture the following information for dataset level definition:

Business Description
Frequency

Frequency definitely could feed into our discussion for integration with Oozie and schedulers

Currently we recommend user populate as additionalInfo metadata in INFO file, which should live in Dataset definition rather than "run" metadata (INFO file)

Uber JAR Standardisation and Conformance

Provide 1 JAR artifact for both Conformance and Standardisation.
This will simplify configuration files and also reduce risk of running mismatched STD and CONF versions by users.

Old data lingers even if all items in view are deleted

After user deletes all the items in the view from the left column, the last item is still showing in the main view. The user can still go through all the tabs of it.

This should be disabled or nulled, same as in the left column where it just says No Data.

Handle schema insertion where schema has 0 fields

Currently conformance allows for 0 field schemas to be inserted and used, this is only picked up when running Enceladus. This is usually an error with schema definition generation or manual manipulation

Capture outbound schema back into Menas

Post Conformance result schema should be published back into Menas so we have inbound and outbound data schema
Maybe turning Menas into a central schema store and allow harvesting from other tools

Search, Sort and Filter entities

Users should be able to search filter and sort their schemas/mapping tables/datasets in the Menas UI for easier navigation.

Schema editor

Users should be able to construct and validate their Schemas through the Menas UI.

Rerun overwrite and also data access reader

Enable an efficient and clean way to implement rerun and also expected override capability for data loads (ie End of Month, End of Day and etc)

Drop (Do not persist) fields that are not registered in schema

Do not write parquet file with attributes not registered in the schema, this is usually a symptom where source has attributes, but the schema in the schema repository has not been updated.

Want to use this to ensure owners keep up with schema changes and give control of data distribution for attributes to data owners

Create sample end to end working sample for custom rule

This involves creating an example module with an implementation of a Spark job using the enceladus-conformance as a library to implement a CustomConformanceRule and its respective interpreter.

Enceladus and Menas model compatibility

Currently we have no way of ensuring that the models used in Enceladus (std and conformance) are compatible versions with the ones stored in Menas. We can add a header to the HTTP requests that says what model version Enceladus is using, then Menas can check compatibility and respond accordingly.

If the header is not specified Menas should ignore the compatibility check to avoid breaking compatibility for people using Encealdus as a library.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.