Giter VIP home page Giter VIP logo

datastrato / gravitino Goto Github PK

View Code? Open in Web Editor NEW
336.0 19.0 150.0 14.58 MB

World's most powerful data catalog service with providing a high-performance, geo-distributed and federated metadata lake.

Home Page: https://datastrato.ai/docs/

License: Apache License 2.0

Java 92.27% Shell 0.69% Dockerfile 0.17% JavaScript 3.48% CSS 0.01% TypeScript 0.50% Python 2.88%
datalake lakehouse metadata federated-query stratosphere metalake skycomputing data-catalog ai-catalog model-catalog

gravitino's Introduction

Gravitino

GitHub Actions Build GitHub Actions Integration Test License Contributors Release Open Issues Last Committed OpenSSF Best Practices

Introduction

Gravitino is a high-performance, geo-distributed, and federated metadata lake. It manages the metadata directly in different sources, types, and regions. It also provides users with unified metadata access for data and AI assets.

Gravitino Architecture

Gravitino aims to provide several key features:

  • Single Source of Truth for multi-regional data with geo-distributed architecture support.
  • Unified Data and AI asset management for both users and engines.
  • Security in one place, centralizing the security for different sources.
  • Built-in data management and data access management.

Contributing to Gravitino

Gravitino is open source software available under the Apache 2.0 license. For information on how to contribute to Gravitino please see the Contribution guidelines.

Online documentation

You can find the latest Gravitino documentation in the doc folder. This README file only contains basic setup instructions.

Building Gravitino

You can build Gravitino using Gradle. Currently you can build Gravitino on Linux and macOS, Windows isn't supported.

To build Gravitino, please run:

./gradlew clean build -x test

If you want to build a distribution package, please run:

./gradlew compileDistribution -x test

to build a distribution package.

Or:

./gradlew assembleDistribution -x test

to build a compressed distribution package.

The directory distribution contains the generated binary distribution package.

For the details of building and testing Gravitino, please see How to build Gravitino.

Quick start

Configure and start the Gravitino server

If you already have a binary distribution package, go to the directory of the decompressed package.

Before starting the Gravitino server, please configure the Gravitino server configuration file. The configuration file, gravitino.conf, is in the conf directory and follows the standard property file format. You can modify the configuration within this file.

To start the Gravitino server, please run:

./bin/gravitino.sh start

To stop the Gravitino server, please run:

./bin/gravitino.sh stop

Using Trino with Gravitino

Gravitino provides a Trino connector to access the metadata in Gravitino. To use Trino with Gravitino, please follow the trino-gravitino-connector doc.

Development guide

  1. How to build Gravitino
  2. How to test Gravitino
  3. How to publish Docker images

License

Gravitino is under the Apache License Version 2.0, See the LICENSE for the details.

Apache®, Apache Hadoop®, Apache Hive™, Apache Iceberg™, Apache Kafka®, Apache Spark™, Apache Submarine™, Apache Thrift™ and Apache Zeppelin™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

gravitino's People

Contributors

ajayl83 avatar caican00 avatar ch3yne avatar charliecheng630 avatar clearvive avatar coolderli avatar diqiu50 avatar fanng1 avatar frankyang0529 avatar henrybear327 avatar hiirrxnn avatar jerryshao avatar justinmclean avatar lauraxia123 avatar lw-yang avatar mchades avatar noidname01 avatar pan3793 avatar qqqttt123 avatar shaofengshi avatar stenicholas avatar unknowntpo avatar xiaozcy avatar xloya avatar xunliu avatar yijhenlin avatar yuqi1129 avatar yxac avatar zhoukangcn avatar zivali avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gravitino's Issues

[Improvement] Pick a better name for the project

What would you like to be improved?

Usually, a successful project should have an outstanding name in the beginning:

  1. it should be easy to be discovered (like: search over web) and remember, so usually to be one word (not words combination) and not the word that rarely in use.
  2. it should be meaningful and usually can hint what project's intention to do.
  3. it won't conflict with some well-known projects, such as hadoop, spark, flink, kubernetes, etc.

How should we improve?

No response

[Improvement] Add https support for Jetty Server

What would you like to be improved?

Currently we only implemented the http ServerConnect for embedded jetty server, it would be better to add the https support for security and completeness.

How should we improve?

Simply adding Jetty https ServerConnect should be enough.

[EPIC] Geo-distributed architecture support for Graviton

Describe the proposal

This issue tracks the works of design and implements a geo-distributed HA architecture for Graviton. The owner of this epic should have a design in detail and break down into subtasks.

Task list

No response

[EPIC] Extending to add more heterogeneous catalog support

Describe the proposal

This issues tracks the work of extending the catalog ability to support more heterogeneous catalog, like file catalog, ML catalog, and so on. The epic task should have a design in detail and break down to subtasks.

Task list

[EPIC] Design and implement hive metadata catalog

Describe the proposal

The hive metadata connector is the hive data source core system in Graviton to describe how the connector creates and runs.

This umbrella issue tracks the whole design and implementation of Graviton's hive metadata connector(not data connector) system, including:

  1. Maintain the database in the hive Using namespace SPI.
  2. Maintain the table in the hive Using namespace SPI.
  3. Waiting to be added

Task list

Implement Hive Metastore connector

Leave detailed comments to be added by assignee.
Note. This is an umbrella issue, needs to break into sub-issues.
Also detailed design doc is required.

[Subtask] Implement the on-wire protocol for Schema System

Describe the subtask

This task aims to implement the on-wire protocol for schema system. As specified in the design doc, we will choose to use JSON format for the on-wire protocol and implement the JSON serialization/deserialization for schema system.

Parent issue

#3

[EPIC] Graviton Query Engine Adaption

Describe the proposal

This epic tracks the work about adapting Graviton catalog's ability to the specific query engine. With this, query engine could use Graviton to manage the metadata from underlying system.

Task list

[FEATURE] Adding new interfaces to manipulate the metadata entities

Describe the feature

This issue aims to design and implement the new interfaces for metadata manipulating, including GET, UPDATE, CREATE and DELETE of metadata.

  1. The issue and related PR only focus on the implementation of interfaces, not the specific implementations.
  2. The interfaces are used internally, and will not expose to the end users.

Motivation

The implementation of metadata manipulation interfaces decouples the interface from specific implementation. And this will be used by Graviton Server.

Describe the solution

No response

Additional context

No response

[Improvement] Refactor the gradle build file to unify all the dependencies

What would you like to be improved?

Currently dependencies are defined across the build files, which makes dependency management quite mess, we should refactor the current build file to manage all the dependencies in a single place.

How should we improve?

One solution is to introduce version lock mechanism like Iceberg, but the palantir version lock has lots of limitations, one of the big problems is that it only supports one version for one dependency, it is hard to extend to support multi-versions in one project.

Another solution is to introduce spring plugin to manage dependency versions in one place.

[EPIC] Redesign Graviton's Schema and Connector System

Describe the proposal

Current Graviton's Schema system follows "lakehouse/zone/table" logical structure, while the mapping physical table has different physical structure "metasouce/db/table", it's quite confuse for user to understand these two metadata structures, also it's hard to manage the metadata. So we decide to redesign the schema system to make it simpler and easy to understand.

The details of thinking can be found here (https://docs.google.com/document/d/1Wrd9HHJF2wLDhsvKix1VVKWLi_RdDhw-X3rqg-4lwtw/edit#heading=h.4adypg9kzjky).

Task list

[Subtask] Add metadata connector plugin framework

Describe the subtask

This task aims to implement the Metadate Connector plugin Framework. As specified in the design doc, we will implement a plug-in factory, and let every metadata connector type be created and run through it.

Parent issue

#7

[Improvement] Refactor `Lakehouse` terminology to `Metalake` in Graviton

What would you like to be improved?

Currently, we are using terminolog Lakehouse to represent the top-level repository for metadata management. The fact is that we just manages the metadata, so it is not so accurate to use Lakehouse, instead I propose Metalake to represent the metadata lake.

How should we improve?

No response

[Subtask] Client API interface design for Graviton

Describe the subtask

This issue tracks the work of designing the client API interface for Graviton. The client API is mainly used for the users to interact with Graviton to manipulate metadata, also this client API is the common client that will suit to query engines like Spark, Trino later on.

Parent issue

#53

[EPIC] Design and implement REST API for metadata CRUD

This epic issue aims to design and implement the REST API framework for Graviton.

  1. The REST API design follows Microsoft's REST API design specs (https://learn.microsoft.com/en-us/azure/architecture/best-practices/api-design#define-api-operations-in-terms-of-http-methods).
  2. Besides, we use MediaType's versioning mechanism to control the API version.
  3. We use Jersey for REST API implementation. The purpose of using Jersey is that Jersey is much light-weighted than Springboot, and we don't want things out of control with so many injection mechanisms introduced by Springboot.

Subtasks are:

[EPIC] Design and implement generic connector interface

The metadata connector system is the core system in Graviton to describe the connector how to create, and run, including SPI(service provider interface), different data source's field type mapping, and memory structure.

This umbrella issue tracks the whole design and implementation of Graviton's metadata connector(not data connector) system, including:

  1. Connector factory design & implementation.
  2. Connector extract metadata job schedule design & implementation.
  3. Connector plug-in framework.
  4. Unified metadata type definite.
  5. Type converter design & implementation.
  6. Metadata multiple version design & implementation.
  7. Typical connector design & implementation.

Subtasks are list in here:

[Improvement] Add cors filter for Jetty server

What would you like to be improved?

To improve the security and avoid attacks to our embedded server, we should add cors filter support for our Jetty server.

How should we improve?

Implement a cors filter and register into Jetty should be enough.

[Subtask] Design metadata storage interface

Describe the subtask

This issue tracks the work of designing a metadata storage interface for Graviton.

  • The interface should support storing entities to the underlying system.
  • The interface should be generic enough to support different underlying storage (like kv store, relational DB, or others).

Parent issue

#4

Have contribitors sign ICLAs

To donate a project to the ASF we either need a Software grant and ICLA from contributors who don't work for Datastrato or ICLAs from everyone. Setting up a manual or automated ICLA system would make the donation process easier.

Suggested ICLA text based on v2.2 of the ASF's ICLA documents - I changed Foundation to company and removed mention of non-profit and CCLAs. I also simplified the form a little and added a GitHub id as sometimes matching GitHub names to real names also causes issues. We can start with a manual process first.

Datastrato Individual Contribitor License Agremment

This document is based on the Apache ICLA V2.2.

Thank you for your interest in Datastrato. (the "Company"). To clarify the intellectual property license granted with Contributions from any person or entity, the Company must have on file a signed Individual Contributor License Agreement ("ICLA") from each Contributor, indicating agreement with the license terms below. This agreement is for your protection as a Contributor as well as the protection of the Company and its users. It does not change your rights to use your own Contributions for any other purpose.

Please complete and sign this Agreement, and then email a copy to [email protected].

Read this document carefully before signing and keep a copy for your records.
GitHub Name: ____________________________________________________
Full Name: ______________________________________________________
Postal Address: _________________________________________________
_________________________________________________
Country: _________________________________________________
E-Mail: ______________________________________________________

You accept and agree to the following terms and conditions for Your Contributions (present and future) that you submit to the Company. In return, the Company shall not use Your Contributions in a way that is contrary to open source in effect at the time of the Contribution. Except for the license granted herein to the Company and recipients of software distributed by the Company, You reserve all right, title, and interest in and to Your Contributions.

  1. Definitions.
    "You" (or "Your") shall mean the copyright owner or legal entity authorized by the copyright owner that is making this Agreement with the Company. For legal entities, the entity making a contribution and all other entities that control, are controlled by, or are under common control with that entity are considered to be a single Contributor. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "Contribution" shall mean any original work of authorship, including any modifications or additions to an existing work, that is intentionally submitted by You to the Company for inclusion in, or documentation of, any of the products owned or managed by the Company (the "Work"). For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Company or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Company for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by You as "Not a Contribution."

  2. Grant of Copyright License. Subject to the terms and conditions of this Agreement, You hereby grant to the Company and to recipients of software distributed by the Company a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute Your Contributions and such derivative works.

  3. Grant of Patent License. Subject to the terms and conditions of this Agreement, You hereby grant to the Company and to recipients of software distributed by the Company a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by You that are necessarily infringed by Your Contribution(s) alone or by combination of Your Contribution(s) with the Work to which such Contribution(s) was submitted. If any entity institutes patent litigation against You or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that your Contribution, or the Work to which you have contributed, constitutes direct or contributory patent infringement, then any patent licenses granted to that entity under this Agreement for that Contribution or Work shall terminate as of the date such litigation is filed.

  4. You represent that you are legally entitled to grant the above license. If your employer(s) has rights to intellectual property that you create that includes your Contributions, you represent that you have received permission to make Contributions on behalf of that employer, that your employer has waived such rights for your Contributions to the Company.

  5. You represent that each of Your Contributions is Your original creation (see section 7 for submissions on behalf of others). You represent that Your Contribution submissions include complete details of any third-party license or other restriction (including, but not limited to, related patents and trademarks) of which you are personally aware and which are associated with any part of Your Contributions.

  6. You are not expected to provide support for Your Contributions, except to the extent You desire to provide support. You may provide support for free, for a fee, or not at all. Unless required by applicable law or agreed to in writing, You provide Your Contributions on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE.

  7. Should You wish to submit work that is not Your original creation, You may submit it to the Company separately from any Contribution, identifying the complete details of its source and of any license or other restriction (including, but not limited to, related patents, trademarks, and license agreements) of which you are personally aware, and conspicuously marking the work as "Submitted on behalf of a third-party: [named here]".

  8. You agree to notify the Company of any facts or circumstances of which you become aware that would make these representations inaccurate in any respect.

Please sign: __________________________________ Date: ________________

[Subtask] Refactor the schema system

Describe the subtask

This issue tracks to refactor the schema system:

  1. Simplify the current "tenant/lakehouse/zone/table" logical structure.
  2. reorganize the project structure.
  3. Abstract the common entities and make them inheritable.

Parent issue

#43

[FEATURE] Add Config System for Unified Catalog

Describe the feature

This task aims to add config system for the project.

Motivation

Config system is the cornerstone of the project, we should introduce a good Config system.

Describe the solution

No response

Additional context

No response

[EPIC] Add service packaging and launching process and scripts

Describe the proposal

This epic tracks the work of achieving the packaging and launching process for Graviton. This includes:

  1. Define and implement the process to package the project.
  2. Define and implement the process to publish the project.
  3. Add scripts and folders for launching GravitonServer.
  4. Add docker files for GravitonServer (nice to have).

Task list

MVN Central Repository

[EPIC] Design and implement the basic metadata spec

Metadata schema system is the core system in Unified Catalog to describe the entities, including memory structure, on-wire format and storage layout.

This umbrella issue tracks the whole design and implementation of Unified Catalog' schema system, including:

  1. Metadata entity design & implement.
  2. Entity serialization and deserialization protocol design & implement.
  3. Entity storage layout design & implement.

Subtasks are listed here:

  • Metadata and Type Spec design ( lakehouse, zone, table, column and other basic entities), including memory structure, on-wire and storage layout design. #11
  • Metadata and Type Spec implementation (memory structure). #12
  • Metadata on-wire protocol implement. #16
  • Medata storage layout implement. #21

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.