Giter VIP home page Giter VIP logo

nessie's People

Contributors

adutra avatar ajantha-bhat avatar c-thiel avatar dalaro avatar dependabot-preview[bot] avatar dependabot[bot] avatar dimas-b avatar harshm-dev avatar highstead avatar jacques-n avatar kamalbhavyam avatar kerossin avatar kylep-dremio avatar laurentgo avatar maheshsapkal avatar naren-dremio avatar nastra avatar nk1506 avatar omarsmak avatar pratyakshsharma avatar pyup-bot avatar renovate[bot] avatar ryantse avatar rymurr avatar snazy avatar stevelorddremio avatar tiwalter avatar tomekl007 avatar vladimiryushkevich avatar xn137 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nessie's Issues

Spark2 Reader/Writer Support

Currently we can only read/write to Iceberg from the Nessie writer.

We should look at bridging the V1-based Delta writer to the V2 Nessie writer and look at how Hive suppoprt might work

Modify Delta to accept custom table names

Currently Delta accepts only paths on a filesystem or hive table names as valid Delta tables. This makes it hard to pass to the NessieLogStore information about branch or hash. We have to modify nessie to accept table names in the format <tablename>@<branch>#<hash>.

This likely means we have to investigate how the DeltaTables are cached as they are currently cached by path only.

Re-run perf tests

Perf tests should be re-run using new algorithm/endpoints to prove speed-up.

Note this may require tweaking the trracing settings on the client

Update VersionStore to store a small value associated with each key

There are multiple use cases where being able to filter by object key type is useful. For example, if I want to list all tables in Hive within a database that also contains Delta Lake tables, I should be able to filter that without having to read the current value for each key. My current thinking is this could be optimized by an extension whereby we save a 2 or 4 byte attribute along with each key and value. This would be used for storing things like object type to allow efficient getKeys() like operations for subsets of operations. The api changes would be independent of #81 but the efficient implementation would depend on that issue/change.

Implement getKeys() for DynamoDB versioned in reasonably efficient manner

Get keys() is an extremely expensive operation in the dynamo versionstore. Just to get the keys (not the values) will likely require:

1x l1 retrieval
151x l2 retrievals
199x l3 retrievals

This is ~30,050 records we have to get from DynamoDB. If we assume we use batchget, we can retrieve 100 records at a time. This still equates to 300 simultaneous dynamo requests (and substantial read units consumed).

Two possible paths for improvement:

  • If we lighten the requirement of getKeys() to be eventually consistent (or possibly mostly correct), we should be able to create eventually consistent L2 key lists. we can create create an eventually consistent index of values for each L1 on occasion and then patch
  • We can explore doing a snapshot + incremental version of this operation where every n l1s we create a list for each l2. Then we lookup all deltas between the snapshot and the current state.

We'll also have to think through the garbage collection dynamics of such a system.

How to use nessie from localhost?

Hi,

Thanks for the great work on nessie, really looking forward to this project.

Hoping you can help clarify the intent of using nessie on localhost....

$ docker run -p 19120:19120 projectnessie/nessie
...
2020-10-06 01:29:26,472 INFO  [io.quarkus] (main) nessie-quarkus 0.1-SNAPSHOT native (powered by Quarkus 1.8.1.Final) started in 0.030s. Listening on: http://0.0.0.0:19120
...
$ pip install pynessie
...
Successfully installed...
...

$ nessie create-branch my_branch
pynessie.error.NessiePermissionException: Not permissioned to view entity at : 403 Client Error: Forbidden for url: http://localhost:19120/api/v1/trees/branch/my_branch

Is the "Forbidden" error a result of not having logged into the UI at localhost? http://localhost:19120 redirects to http://localhost:19120/login but it is unclear where these credentials are actually set?

Does something need to be defined in ~/.config/nessie/config.yaml before this will work?

Thanks again.

Create a tool that import Iceberg snapshots

Iceberg snapshots map well to Nessie commits. We should evaluate the creation of a tool which will scan one or more Iceberg tables and import the snapshots from all tables in approximate time order into a Nessie database. This will allow new operation with data that existed before Nessie.

Evaluate validation rules for legal branch and tag names.

I think we should probably adopt some naming rules for branches and tags. For hive, they only allow the following for database names: [a-zA-z_0-9]+

I'm not sure whether Git enforces this but it feels like it would best to keep things readable. (I'm not sure if that actually means constraining to latin but at least avoiding the null character seems like a good idea...)

Extend test coverage in python

Currently test coverage is low in python and relies on mocking.

I would like to use somethign like pytest-vcr to record actual REST responses and test against those

Our Spark3 Catalog needs work

Spark3 catalog, especially the Nessie one. Is a bit squishy at the moment. Needs a lot of work:

  • change ref in SQL statement
  • change ref in Catalog with property change
  • Default catalog vs Session catalog etc
  • Delegate catalog
  • see Disabled test in nessie-spark3 module

Split REST APIs and Merge server and client JAX-RS definitions

REST APIs are very complicated right now:

  • two definitions (one in client and one in services)
  • all in one file
  • mix of openapi annotations, jax-rs annotations make it hard to read

We should:

  • split the api into multiple files
  • separate jax-rs and openapi annotations for readability
  • have only 1 definition for REST apis across client/server

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.