Giter VIP home page Giter VIP logo

opendatadiscovery-specification's Issues

Is there any idea for RPC Data?

For a big, exist system, it's difficult to use data discovery tech from database. So we want to do the same things from rpc, just like Response.Field as Table.Column. Have ODD considered this scene of data? Is there any experience?

DataQualityTest Categorization

Goal

The goal is to categorize tests into two distinct categories:

  1. Assertion Tests: These are tests that run at specific points in time and primarily validate specific conditions or behaviors.
  2. Anomaly Detection Tests: These tests are designed to identify anomalies or deviations in data, and their outcomes are influenced by the temporal aspects and lifetime of the data.

Decisions

  1. Due to property type already used to specify expectation type by name, decided to introduce new property category.

Option 1: Categorizing Anomaly Detection Subtypes

In this option, we establish and classify common subtypes for anomaly detection.

Pros:

  • Simplifies the process of grouping tests by their subtypes because all possible values are predefined.

Cons:

  • Requires specification changes to incorporate new subtypes, which may involve additional administrative effort.

Specification:

...
DataQualityTestExpectationCategory:
    type: string
    enum:
      - ASSERTION
      - VOLUME_ANOMALY
      - FRESHNESS_ANOMALY
      - COLUMN_VALUES_ANOMALY
      - SCHEMA_CHANGE

DataQualityTestExpectation:
    type: object
    properties:
      type:
        type: string
        example: "expect_table_row_count_to_be_between"
      category:
          $ref: '#/components/schemas/DataQualityTestExpectationCategory'
    additionalProperties:
      type: string
...

Code Example:

test_anomaly=DataQualityTestExpectation(
	type="volume_anomalies",
	category=DataQualityTestExpectationCategory.VOLUME_ANOMALY
)

test_assertion=DataQualityTestExpectation(
	type="expect_table_row_count_to_be_between",
	category=DataQualityTestExpectationCategory.ASSERTION
)

Option 2. Simplifying the Categorization

In this approach, we define only the main categories, and the specific type of a test, whether it's an ASSERTION or an ANOMALY_DETECTION, is determined by the DataQualityTestExpectation.type property.

Pros:

  • Offers flexibility as any value can be assigned to the DataQualityTestExpectation.type property, allowing for custom categorization.
  • Streamlines the process and avoids the need to create new subtypes for anomaly detection.

Cons:

  • May make it challenging to group anomaly tests by their subtypes since the categorization is solely dependent on the DataQualityTestExpectation.type property.

Specification:

...
DataQualityTestExpectationCategory:
      type: string
      enum:
        - ASSERTION
        - ANOMALY_DETECTION
DataQualityTestExpectation:
    type: object
    properties:
      type:
        type: string
        example: "expect_table_row_count_to_be_between"
      category:
          $ref: '#/components/schemas/DataQualityTestExpectationCategory'
    additionalProperties:
      type: string
...

Code Example:

test_anomaly=DataQualityTestExpectation(
	type="volume_anomalies",
	category=DataQualityTestExpectationCategory.ANOMALY_DETECTION
)

test_assertion=DataQualityTestExpectation(
	type="expect_table_row_count_to_be_between",
	category=DataQualityTestExpectationCategory.ASSERTION
)

Add part to ingest Relationships between Data Entities

We need to prepare spec to ingest relationship between data entities:

  1. We assume that relationship could be only between 2 data entities (and self-reference included)
  2. We prepare only 2 types of relationship at the moment: between relations (tables/files/view/etc.) and between graph nodes
  3. We assume that there would be this list of attributes for different types:
    3.1 Between relations: 1) cardinality (One-to-Zero-One-or-More, One-to-One-or-More, One-to-Zero-or-One, One-to-Exactly-1); 2) Identifying/Non-identifying; 3) ODDRNs of data entities for beginning/end; 4) ODDRNs of data set fields for beginning/end (there could be composite foreign keys);
    3.2 Between graph nodes: 1) Relationship Types (name); 2) Direction; 3) List of Attributes (key-value); 4) ODDRNs of data entities for beginning/end;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.