datastrato / gravitino Goto Github PK

World's most powerful data catalog service with providing a high-performance, geo-distributed and federated metadata lake.

Home Page: https://datastrato.ai/docs/

License: Apache License 2.0

Java 92.27% Shell 0.69% Dockerfile 0.17% JavaScript 3.48% CSS 0.01% TypeScript 0.50% Python 2.88%

datalake lakehouse metadata federated-query stratosphere metalake skycomputing data-catalog ai-catalog model-catalog

gravitino's Introduction

Gravitino

Introduction

Gravitino is a high-performance, geo-distributed, and federated metadata lake. It manages the metadata directly in different sources, types, and regions. It also provides users with unified metadata access for data and AI assets.

Gravitino aims to provide several key features:

Single Source of Truth for multi-regional data with geo-distributed architecture support.
Unified Data and AI asset management for both users and engines.
Security in one place, centralizing the security for different sources.
Built-in data management and data access management.

Contributing to Gravitino

Gravitino is open source software available under the Apache 2.0 license. For information on how to contribute to Gravitino please see the Contribution guidelines.

Online documentation

You can find the latest Gravitino documentation in the doc folder. This README file only contains basic setup instructions.

Building Gravitino

You can build Gravitino using Gradle. Currently you can build Gravitino on Linux and macOS, Windows isn't supported.

To build Gravitino, please run:

./gradlew clean build -x test

If you want to build a distribution package, please run:

./gradlew compileDistribution -x test

to build a distribution package.

Or:

./gradlew assembleDistribution -x test

to build a compressed distribution package.

The directory distribution contains the generated binary distribution package.

For the details of building and testing Gravitino, please see How to build Gravitino.

Quick start

Configure and start the Gravitino server

If you already have a binary distribution package, go to the directory of the decompressed package.

Before starting the Gravitino server, please configure the Gravitino server configuration file. The configuration file, gravitino.conf, is in the conf directory and follows the standard property file format. You can modify the configuration within this file.

To start the Gravitino server, please run:

./bin/gravitino.sh start

To stop the Gravitino server, please run:

./bin/gravitino.sh stop

Using Trino with Gravitino

Gravitino provides a Trino connector to access the metadata in Gravitino. To use Trino with Gravitino, please follow the trino-gravitino-connector doc.

Development guide

License

Gravitino is under the Apache License Version 2.0, See the LICENSE for the details.

_{Apache®, Apache Hadoop®, Apache Hive™, Apache Iceberg™, Apache Kafka®, Apache Spark™, Apache Submarine™, Apache Thrift™ and Apache Zeppelin™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.}

gravitino's People

Contributors

Stargazers

Watchers

Forkers

justinmclean tongwei1105 qqqttt123 southernriver diqiu50 fanng1 haiyang1987 stenicholas zhangjianhrm yuqi1129 pan3793 zhangruichao meitianjinbu wolfboys hddong yanniszhou summaryzb bin-albin luciferyang jeffery-zhangj yuqi421127 mchades jerryshao zhaomin1423 turbofei zixi0825 huangxiaopingrd josh0yeh clearvive danhuawang yanghua xunliu jiaoqingbo microbearz hzxiongyinke lisancao xiaoyuyao nliver ch3yne yxac cxzl25 bwbwchen brandboat 724thomas hiirrxnn shaofengshi austin362667 henriquepaes1 sophietech88 coolderli zhoukangcn shameellamba yijhenlin yao-mr kevinztw lanznx liujinhui1994 bknbkn huyuanfeng2018 188xuhe unknowntpo lleohao-opensource-group xnge xloya puchengy evalaiyc98 rohankumardubey frankyang0529 teoteo520 arthursxl8 cottage14 winkerdu liuxiaocs7 zhangbutao chenyulin0719 xiacongling nirmalraki liun03 neusoftrhl codehysteria28 zhuzilong2013 nancy64-bit kevinjqliu orol116 shihes nk1506 lambaaryan011 raghits zivali surajbora59 henrybear327 caicancai huangzhaobo99 lw-yang peter19960903 hanwxx ajayl83 caican00 wangtaohz cbwleft

gravitino's Issues

[Improvement] Pick a better name for the project

What would you like to be improved?

Usually, a successful project should have an outstanding name in the beginning:

it should be easy to be discovered (like: search over web) and remember, so usually to be one word (not words combination) and not the word that rarely in use.
it should be meaningful and usually can hint what project's intention to do.
it won't conflict with some well-known projects, such as hadoop, spark, flink, kubernetes, etc.

How should we improve?

No response

Implement open source EDW connector (CK or Doris, TBD)

Leave detailed comments to be added by assignee.
Also detailed design doc is required.

[Subtask] Maintain the tables in the hive Using HiveCatalog

Describe the subtask

This issue tracks the work of implementing the maintain the tables in the hive Using namespace SPI.

Parent issue

[Subtask] Design and implement client Metalake APIs

Describe the subtask

This issue tracks the work of implementing a client interface to manipulate lakehouses.

Parent issue

#63

[Improvement] Add https support for Jetty Server

What would you like to be improved?

Currently we only implemented the http ServerConnect for embedded jetty server, it would be better to add the https support for security and completeness.

How should we improve?

Simply adding Jetty https ServerConnect should be enough.

[EPIC] Geo-distributed architecture support for Graviton

Describe the proposal

This issue tracks the works of design and implements a geo-distributed HA architecture for Graviton. The owner of this epic should have a design in detail and break down into subtasks.

Task list

No response

[EPIC] Extending to add more heterogeneous catalog support

Describe the proposal

This issues tracks the work of extending the catalog ability to support more heterogeneous catalog, like file catalog, ML catalog, and so on. The epic task should have a design in detail and break down to subtasks.

Task list

[EPIC] Design and implement hive metadata catalog

Describe the proposal

The hive metadata connector is the hive data source core system in Graviton to describe how the connector creates and runs.

This umbrella issue tracks the whole design and implementation of Graviton's hive metadata connector(not data connector) system, including:

Maintain the database in the hive Using namespace SPI.
Maintain the table in the hive Using namespace SPI.
Waiting to be added

Task list

Implement Hive Metastore connector

Leave detailed comments to be added by assignee.
Note. This is an umbrella issue, needs to break into sub-issues.
Also detailed design doc is required.

[Subtask] Add Catalog REST operations for Graviton Server

Describe the subtask

This issue tracks the work of implementing Catalog REST operations for Graviton Server.

Parent issue

[Subtask] Implement the on-wire protocol for Schema System

Describe the subtask

This task aims to implement the on-wire protocol for schema system. As specified in the design doc, we will choose to use JSON format for the on-wire protocol and implement the JSON serialization/deserialization for schema system.

Parent issue

[Subtask] Update the RFC-1 to adapt the new design of schema system

Describe the subtask

This task tracks the work of updating the RFC-1 schema spec to adapt to the new design.

Parent issue

#43

[EPIC] Graviton Query Engine Adaption

Describe the proposal

This epic tracks the work about adapting Graviton catalog's ability to the specific query engine. With this, query engine could use Graviton to manage the metadata from underlying system.

Task list

[SUBTASK] Metadata and Type Spec design

This issue tracks the design of metadata spec, including memory structure, on-wire protocol and storage layout.

This is the subtask of parent issue #3

[EPIC] Design and implement metadata storage interface

This epic tracks the work of designing metadata entity storage interface and achieve one implementation.

The works are:

[FEATURE] Adding new interfaces to manipulate the metadata entities

Describe the feature

This issue aims to design and implement the new interfaces for metadata manipulating, including GET, UPDATE, CREATE and DELETE of metadata.

The issue and related PR only focus on the implementation of interfaces, not the specific implementations.
The interfaces are used internally, and will not expose to the end users.

Motivation

The implementation of metadata manipulation interfaces decouples the interface from specific implementation. And this will be used by Graviton Server.

Describe the solution

No response

Additional context

No response

[Subtask] Metadata and Type Spec Implementation

Describe the subtask

This issue tracks the work of implementing metadata and type spec (specifically memory structure of metadata schema).

Parent issue

[Subtask] Design and implement client `Catalog` manipulation interface

Describe the subtask

This issue tracks the works of implementing a client Catalog manipulation interface for Graviton.

Parent issue

#57

[Improvement] Refactor the gradle build file to unify all the dependencies

What would you like to be improved?

Currently dependencies are defined across the build files, which makes dependency management quite mess, we should refactor the current build file to manage all the dependencies in a single place.

How should we improve?

One solution is to introduce version lock mechanism like Iceberg, but the palantir version lock has lots of limitations, one of the big problems is that it only supports one version for one dependency, it is hard to extend to support multi-versions in one project.

Another solution is to introduce spring plugin to manage dependency versions in one place.

[Subtask] Implement hive connctor

Describe the subtask

No response

Parent issue

[EPIC] Redesign Graviton's Schema and Connector System

Describe the proposal

Current Graviton's Schema system follows "lakehouse/zone/table" logical structure, while the mapping physical table has different physical structure "metasouce/db/table", it's quite confuse for user to understand these two metadata structures, also it's hard to manage the metadata. So we decide to redesign the schema system to make it simpler and easy to understand.

The details of thinking can be found here (https://docs.google.com/document/d/1Wrd9HHJF2wLDhsvKix1VVKWLi_RdDhw-X3rqg-4lwtw/edit#heading=h.4adypg9kzjky).

Task list

[Subtask] Add metadata connector plugin framework

Describe the subtask

This task aims to implement the Metadate Connector plugin Framework. As specified in the design doc, we will implement a plug-in factory, and let every metadata connector type be created and run through it.

Parent issue

[Improvement] Refactor `Lakehouse` terminology to `Metalake` in Graviton

What would you like to be improved?

Currently, we are using terminolog Lakehouse to represent the top-level repository for metadata management. The fact is that we just manages the metadata, so it is not so accurate to use Lakehouse, instead I propose Metalake to represent the metadata lake.

How should we improve?

No response

[Subtask] Client API interface design for Graviton

Describe the subtask

This issue tracks the work of designing the client API interface for Graviton. The client API is mainly used for the users to interact with Graviton to manipulate metadata, also this client API is the common client that will suit to query engines like Spark, Trino later on.

Parent issue

#53

[EPIC] Design and implement REST API for metadata CRUD

This epic issue aims to design and implement the REST API framework for Graviton.

The REST API design follows Microsoft's REST API design specs (https://learn.microsoft.com/en-us/azure/architecture/best-practices/api-design#define-api-operations-in-terms-of-http-methods).
Besides, we use MediaType's versioning mechanism to control the API version.
We use Jersey for REST API implementation. The purpose of using Jersey is that Jersey is much light-weighted than Springboot, and we don't want things out of control with so many injection mechanisms introduced by Springboot.

Subtasks are:

[Subtask] Maintain the namespaces in the hive Using HiveCatalog

Describe the subtask

This issue tracks the work of implementing the maintain the database in the hive Using namespace SPI.

Parent issue

#58

[Subtask] Introduce Jersey test framework

Describe the subtask

This issue tracks the work of introducing Jersey test framework for REST APIs, so that we could verify in UTs.

Parent issue

We might want to consider a name change from Graviton

AWS has Graviton. https://aws.amazon.com/ec2/graviton/ and have registered the "AWS Graviton" trademark in multiple countries. Class 42 trademarks include software and hardware. These people also exist https://www.gravitonusa.com. This might become an issue upon entry into the ASF Incubator due to those existing names. Do we want to keep the name or come up with something else?

[Improvement] Remove unnecessary transient dependencies when including Hive dependencies

What would you like to be improved?

During implementing HiveCatalog, we unexpectedly introduce so many unnecessary transient dependencies, which will easily cause conflicts and increase the package size. We should clean all of them.

How should we improve?

Exclude all the unnecessary dependencies.

[bug] Missing `BINARY` type in schema.proto

Substrait simple type BINARY not found in schema.proto, do we need to add it?

[Subtask] Metadata Connector Spec design

Describe the subtask

This issue tracks the design of Metadata Connector, including the blow part :

Connector plugin run mode

Parent issue

[EPIC] Design and implement generic connector interface

The metadata connector system is the core system in Graviton to describe the connector how to create, and run, including SPI(service provider interface), different data source's field type mapping, and memory structure.

This umbrella issue tracks the whole design and implementation of Graviton's metadata connector(not data connector) system, including:

Connector factory design & implementation.
Connector extract metadata job schedule design & implementation.
Connector plug-in framework.
Unified metadata type definite.
Type converter design & implementation.
Metadata multiple version design & implementation.
Typical connector design & implementation.

Subtasks are list in here:

[Improvement] Add cors filter for Jetty server

What would you like to be improved?

To improve the security and avoid attacks to our embedded server, we should add cors filter support for our Jetty server.

How should we improve?

Implement a cors filter and register into Jetty should be enough.

[Subtask] Implement Zone CREATE/GET/UPDATE/DELETE REST APIs

Describe the subtask

This issue aims to track the work of implementing REST APIs for "Zone" CREATE/GET/UPDATE/DELETE operations.

Parent issue

[EPIC] Implement a service architecture for unified catalog

Leave detailed comments to be added by assignee.

[Subtask] Design metadata storage interface

Describe the subtask

This issue tracks the work of designing a metadata storage interface for Graviton.

The interface should support storing entities to the underlying system.
The interface should be generic enough to support different underlying storage (like kv store, relational DB, or others).

Parent issue

Have contribitors sign ICLAs

To donate a project to the ASF we either need a Software grant and ICLA from contributors who don't work for Datastrato or ICLAs from everyone. Setting up a manual or automated ICLA system would make the donation process easier.

Suggested ICLA text based on v2.2 of the ASF's ICLA documents - I changed Foundation to company and removed mention of non-profit and CCLAs. I also simplified the form a little and added a GitHub id as sometimes matching GitHub names to real names also causes issues. We can start with a manual process first.

Datastrato Individual Contribitor License Agremment

This document is based on the Apache ICLA V2.2.

Thank you for your interest in Datastrato. (the "Company"). To clarify the intellectual property license granted with Contributions from any person or entity, the Company must have on file a signed Individual Contributor License Agreement ("ICLA") from each Contributor, indicating agreement with the license terms below. This agreement is for your protection as a Contributor as well as the protection of the Company and its users. It does not change your rights to use your own Contributions for any other purpose.

Please complete and sign this Agreement, and then email a copy to [email protected].

Read this document carefully before signing and keep a copy for your records.
GitHub Name: ____________________________________________________
Full Name: ______________________________________________________
Postal Address: _________________________________________________
_________________________________________________
Country: _________________________________________________
E-Mail: ______________________________________________________

You accept and agree to the following terms and conditions for Your Contributions (present and future) that you submit to the Company. In return, the Company shall not use Your Contributions in a way that is contrary to open source in effect at the time of the Contribution. Except for the license granted herein to the Company and recipients of software distributed by the Company, You reserve all right, title, and interest in and to Your Contributions.

Definitions.
"You" (or "Your") shall mean the copyright owner or legal entity authorized by the copyright owner that is making this Agreement with the Company. For legal entities, the entity making a contribution and all other entities that control, are controlled by, or are under common control with that entity are considered to be a single Contributor. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "Contribution" shall mean any original work of authorship, including any modifications or additions to an existing work, that is intentionally submitted by You to the Company for inclusion in, or documentation of, any of the products owned or managed by the Company (the "Work"). For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Company or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Company for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by You as "Not a Contribution."
Grant of Copyright License. Subject to the terms and conditions of this Agreement, You hereby grant to the Company and to recipients of software distributed by the Company a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute Your Contributions and such derivative works.
Grant of Patent License. Subject to the terms and conditions of this Agreement, You hereby grant to the Company and to recipients of software distributed by the Company a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by You that are necessarily infringed by Your Contribution(s) alone or by combination of Your Contribution(s) with the Work to which such Contribution(s) was submitted. If any entity institutes patent litigation against You or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that your Contribution, or the Work to which you have contributed, constitutes direct or contributory patent infringement, then any patent licenses granted to that entity under this Agreement for that Contribution or Work shall terminate as of the date such litigation is filed.
You represent that you are legally entitled to grant the above license. If your employer(s) has rights to intellectual property that you create that includes your Contributions, you represent that you have received permission to make Contributions on behalf of that employer, that your employer has waived such rights for your Contributions to the Company.
You represent that each of Your Contributions is Your original creation (see section 7 for submissions on behalf of others). You represent that Your Contribution submissions include complete details of any third-party license or other restriction (including, but not limited to, related patents and trademarks) of which you are personally aware and which are associated with any part of Your Contributions.
You are not expected to provide support for Your Contributions, except to the extent You desire to provide support. You may provide support for free, for a fee, or not at all. Unless required by applicable law or agreed to in writing, You provide Your Contributions on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE.
Should You wish to submit work that is not Your original creation, You may submit it to the Company separately from any Contribution, identifying the complete details of its source and of any license or other restriction (including, but not limited to, related patents, trademarks, and license agreements) of which you are personally aware, and conspicuously marking the work as "Submitted on behalf of a third-party: [named here]".
You agree to notify the Company of any facts or circumstances of which you become aware that would make these representations inaccurate in any respect.

Please sign: __________________________________ Date: ________________

[Subtask] REST client implementation for Graviton client

Describe the subtask

This issue tracks the work of implementing a REST client for Graviton client. With this REST client, Graviton client could issue request to Graviton server to manipulate metadata.

Parent issue

#57

[Subtask] Implement the hive client in the HiveCatalog

Describe the subtask

This issue tracks the work of implementing the create hive client in the HiveCatalog.

This hive client is created by reflection and can support hive2 and hive3.

Parent issue

[Epic] choose and implement one underlying storage (prototype)

Describe the subtask

This issue is a follow-up issue of #48 , to implement one underlying storage for Graviton.

A simple candidate would be rocksdb, or we could choose FDB or other distributed storage for performance and fault tolerance.

Parent issue

Task list

[Subtask] Implement Tenant CREATE/GET/DELETE related REST interface

Describe the subtask

This subtask aims to create Tenant operation( including CREATE, GET, and DELETE) REST APIs for Graviton.

Parent issue

[Subtask] Add Metadata Schema System protobuf Serde support

Describe the subtask

This task aims to add support of protobuf serialization and deserialization support for schema system. With this system could persist and schema to storage, or communicate between servers using grpc.

Parent issue

[Epic] Client implementation for Graviton

Describe the subtask

This issue tracks the work of implementing the common client for Graviton. With this client, users could manipulate metadata from Graviton to the underlying systems.

Parent issue

#53

Subtasks

[Subtask] Design and implement entity serialization and deserialization interface for storage system

Describe the subtask

This issue aims to track the work of designing entity serde interface for Graviton storage system.

The serde interface and the implementation will be used by the EntityStore module to serialize and deserialize the entity when interacting with the underlying storage system.

Parent issue

[Subtask] Refactor the schema system

Describe the subtask

This issue tracks to refactor the schema system:

Simplify the current "tenant/lakehouse/zone/table" logical structure.
reorganize the project structure.
Abstract the common entities and make them inheritable.

Parent issue

#43

[FEATURE] Add Config System for Unified Catalog

Describe the feature

This task aims to add config system for the project.

Motivation

Config system is the cornerstone of the project, we should introduce a good Config system.

Describe the solution

No response

Additional context

No response

[EPIC] Add service packaging and launching process and scripts

Describe the proposal

This epic tracks the work of achieving the packaging and launching process for Graviton. This includes:

Define and implement the process to package the project.
Define and implement the process to publish the project.
Add scripts and folders for launching GravitonServer.
Add docker files for GravitonServer (nice to have).

Task list

MVN Central Repository

https://issues.sonatype.org/browse/OSSRH-93806

test for new project

[Subtask] Implement Lakehouse CREATE/GET/UPDATE/DELETE REST APIs

Describe the subtask

This issue aims to track the work of implementing REST APIs for "Lakehouse" CREATE/GET/UPDATE/DELETE operations.

Parent issue

[EPIC] Design and implement the basic metadata spec

Metadata schema system is the core system in Unified Catalog to describe the entities, including memory structure, on-wire format and storage layout.

This umbrella issue tracks the whole design and implementation of Unified Catalog' schema system, including:

Metadata entity design & implement.
Entity serialization and deserialization protocol design & implement.
Entity storage layout design & implement.

Subtasks are listed here:

Metadata and Type Spec design ( lakehouse, zone, table, column and other basic entities), including memory structure, on-wire and storage layout design. #11
Metadata and Type Spec implementation (memory structure). #12
Metadata on-wire protocol implement. #16
Medata storage layout implement. #21