projectnessie / nessie Goto Github PK

Nessie: Transactional Catalog for Data Lakes with Git-like semantics

License: Apache License 2.0

Java 97.48% Scala 1.19% HTML 0.11% CSS 0.01% Python 0.03% ANTLR 0.06% Smarty 0.17% Kotlin 0.46% Shell 0.50% Makefile 0.01%

aws-lambda data git iceberg java spark

nessie's People

Contributors

Stargazers

Watchers

Forkers

jacques-n rymurr laurentgo frank-dkvan ryantse yourfrienddhruv naren-dremio askant98 stevelorddremio rdettai xuqianjin-stars snazy andrioni sandhyasun shangshengtung chesterxgchen kbendick wangmiao1981 seantbooker ericsun2 sathya-reddy-m liuq4360 ajantha-bhat laura-sanches-movile nastra tardunge laopeng2021 brendanreeves crmiguez emg110 dnielsen dimas-b omarsmak harshm-dev duydb81 chenliang613 maheshsapkal supriya-kharsan jamindy chethanuk franklinharry alexrogalskiy hpolloni xn137 tomekl007 newffy wuchunfu zhangjun0x01 liujinhui1994 shursulei maddisondavid kkxiaotikk kerossin keithgchapman tmnd1991 vemulapalli-aditya4 xuzhiwen1255 dalaro mehul2500 panxing4game liusongff kamalbhavyam sagar-agarwal2404 fpj vladimiryushkevich sbchapin willshen jebnix czy006 bosoon rxm7706 gg-big-org blucz adutra rohankumardubey nahguam joeporterdev julwin nonoelgringo br0nstein xiaoyao1999hn lauridsjmikkelsen naveenkrdremio nk1506 alexmercedcoder vinibeni2801 benhuds abstratt c-thiel rsohlot marvin-roesch pratyakshsharma maxfirman paliwalashish atiftariq1 gongx eternalerrors abmo-x leewjae alcidesmorales

nessie's Issues

Update BranchController consumers to use VersionStore interface

Support scheduling of Delta manifest outputs and HMS updates

To support tools that can't work directly with Delta Lake and Nessie, enable manifest exports to be written automatically on commit of identified branches.

Proposal to change some Rest Apis

Add javadocs and pydocs to automated site deployment scripts

Will use https://github.com/projectnessie/projectnessie.github.io/ for storage of the resulting assets.

Should we remove backend modules and the Dynamo impl of JGit?

Implement Merge and Transplant for Dynamo Version Store

Add automatic packaging for releases of docker, spark jar, lambda, etc

Spark2 Reader/Writer Support

Currently we can only read/write to Iceberg from the Nessie writer.

We should look at bridging the V1-based Delta writer to the V2 Nessie writer and look at how Hive suppoprt might work

Consider exposing history or commit log as a table in spark similar to Delta/Iceberg

Refactor JGitVersionStore and improve test coverage

Add integration tests that start the quarkus server with each backing store

While testing, I realized that the dynamo backing store initialization within quarkus was broken. We need to get tests integration tests in place that actually test the stores in the server, not just independently.

Modify Delta to accept custom table names

Currently Delta accepts only paths on a filesystem or hive table names as valid Delta tables. This makes it hard to pass to the NessieLogStore information about branch or hash. We have to modify nessie to accept table names in the format <tablename>@<branch>#<hash>.

This likely means we have to investigate how the DeltaTables are cached as they are currently cached by path only.

Simplify distribution and server modules.

single DI strategy - CDI
single server - Quarkus
single distribution module - to create serverless, docker image etc

Publish Maven artifacts to GH/OSSRH repositories

When a new GH release is created, automatically update maven artifacts to GH and eventually OSSRH repositories for general consumption

Set up readthedocs and pyup for pynessie

Re-run perf tests

Perf tests should be re-run using new algorithm/endpoints to prove speed-up.

Note this may require tweaking the trracing settings on the client

Add automatic replication of Nessie Iceberg tables into HMS (one-way sync)

rename nessie_client to pynessie

Design garbage collection algorithm and add any data structures required to base objects to support

We plan to implement GC in a few months. However, we should make sure that any basic object changes are done before then to ensure that we avoid an issue where GC can't be run until objects are rewritten. Writing up the design now should avoid that from happening.

Update VersionStore to store a small value associated with each key

There are multiple use cases where being able to filter by object key type is useful. For example, if I want to list all tables in Hive within a database that also contains Delta Lake tables, I should be able to filter that without having to read the current value for each key. My current thinking is this could be optimized by an extension whereby we save a 2 or 4 byte attribute along with each key and value. This would be used for storing things like object type to allow efficient getKeys() like operations for subsets of operations. The api changes would be independent of #81 but the efficient implementation would depend on that issue/change.

[python] keep setup.py and requirements.txt

The list of required package is present in at least two places setup.py and requirements.txt, both being kind of authoritative.

Should check if we can keep somehow in sync?

See comment by @rymurr in #165 (comment) for context

Move Iceberg client to Iceberg project

Update CLI to match Git syntax

Update CLI to latest APIs

Implement getKeys() for DynamoDB versioned in reasonably efficient manner

Get keys() is an extremely expensive operation in the dynamo versionstore. Just to get the keys (not the values) will likely require:

1x l1 retrieval
151x l2 retrievals
199x l3 retrievals

This is ~30,050 records we have to get from DynamoDB. If we assume we use batchget, we can retrieve 100 records at a time. This still equates to 300 simultaneous dynamo requests (and substantial read units consumed).

Two possible paths for improvement:

If we lighten the requirement of getKeys() to be eventually consistent (or possibly mostly correct), we should be able to create eventually consistent L2 key lists. we can create create an eventually consistent index of values for each L1 on occasion and then patch
We can explore doing a snapshot + incremental version of this operation where every n l1s we create a list for each l2. Then we lookup all deltas between the snapshot and the current state.

We'll also have to think through the garbage collection dynamics of such a system.

How to use nessie from localhost?

Hi,

Thanks for the great work on nessie, really looking forward to this project.

Hoping you can help clarify the intent of using nessie on localhost....

$ docker run -p 19120:19120 projectnessie/nessie
...
2020-10-06 01:29:26,472 INFO  [io.quarkus] (main) nessie-quarkus 0.1-SNAPSHOT native (powered by Quarkus 1.8.1.Final) started in 0.030s. Listening on: http://0.0.0.0:19120
...

$ pip install pynessie
...
Successfully installed...
...

$ nessie create-branch my_branch
pynessie.error.NessiePermissionException: Not permissioned to view entity at : 403 Client Error: Forbidden for url: http://localhost:19120/api/v1/trees/branch/my_branch

Is the "Forbidden" error a result of not having logged into the UI at localhost? http://localhost:19120 redirects to http://localhost:19120/login but it is unclear where these credentials are actually set?

Does something need to be defined in ~/.config/nessie/config.yaml before this will work?

Thanks again.

Complete initial delta lake reader

Build native images

Add image natives build (Linux) to the project

Create a tool that import Iceberg snapshots

Iceberg snapshots map well to Nessie commits. We should evaluate the creation of a tool which will scan one or more Iceberg tables and import the snapshots from all tables in approximate time order into a Nessie database. This will allow new operation with data that existed before Nessie.

Define updated site information architecture

Come up with an update site information architecture

Finish JGitVersionStore: Transplant and Merge operations

Push Delta changes to Delta Lake

Setup site to be automatically pushed via github action

https://github.com/projectnessie/projectnessie.github.io holds the output of mkdocs generate run on /site/

Move HMS Bridge Service tests to use Maven's toolchains to run tests with JDK8

It looks like with Maven 3.6.2 you can grab the envrionment variables in the toolchain file. We need to figure out how to configure the toolchain with an action used in the build workflow and then trigger the tests based on that.

Review code cov for dynamo store and HMS bridge

Create a workflow with dispatch for releases

Evaluate validation rules for legal branch and tag names.

I think we should probably adopt some naming rules for branches and tags. For hive, they only allow the following for database names: [a-zA-z_0-9]+

I'm not sure whether Git enforces this but it feels like it would best to keep things readable. (I'm not sure if that actually means constraining to latin but at least avoiding the null character seems like a good idea...)

Update DynamoStore for tracing and metrics

Update Jgit store to support local file storage and implement VersionStore interface

Add option for existing HMS based Iceberg tables to reference a Nessie pointer

Update the Hive Catalog to be able to route to the Nessie Catalog if the Hive pointer is a Nessie pointer.

Extend test coverage in python

Currently test coverage is low in python and relies on mocking.

I would like to use somethign like pytest-vcr to record actual REST responses and test against those

Update UI to latest model/apis

Our Spark3 Catalog needs work

Spark3 catalog, especially the Nessie one. Is a bit squishy at the moment. Needs a lot of work:

change ref in SQL statement
change ref in Catalog with property change
Default catalog vs Session catalog etc
Delegate catalog
see Disabled test in nessie-spark3 module

Test new Spark stuff on Databricks 6.x and 7.x

Add support for simple table and partition crud for HMS

Make sure native images use url-connection-client

AWS SDK support 2 http client types. The apache one and the url-connection one. The apache one is the default but has a slow startup time. We should make sure the url-connection one is used in the native images.

https://docs.aws.amazon.com/sdk-for-java/v2/developer-guide/client-configuration-starttime.html

Split REST APIs and Merge server and client JAX-RS definitions

REST APIs are very complicated right now:

two definitions (one in client and one in services)
all in one file
mix of openapi annotations, jax-rs annotations make it hard to read

We should:

split the api into multiple files
separate jax-rs and openapi annotations for readability
have only 1 definition for REST apis across client/server

Update VersionStore interface to support storing an extended object in addition to the base value

Right now, serialized values are restricted to a relatively small size in DynamoVersionStore (or they should be). We need to enhance the store interface so we support an extended value that is stream based and can be read independently of the base value. This should provide a relatively clean interface working with larger objects.

add json flag to python/cli

Make json output optional via eg --json