projectnessie / nessie Goto Github PK
View Code? Open in Web Editor NEWNessie: Transactional Catalog for Data Lakes with Git-like semantics
Home Page: https://projectnessie.org
License: Apache License 2.0
Nessie: Transactional Catalog for Data Lakes with Git-like semantics
Home Page: https://projectnessie.org
License: Apache License 2.0
To support tools that can't work directly with Delta Lake and Nessie, enable manifest exports to be written automatically on commit of identified branches.
Will use https://github.com/projectnessie/projectnessie.github.io/ for storage of the resulting assets.
Currently we can only read/write to Iceberg from the Nessie writer.
We should look at bridging the V1-based Delta writer to the V2 Nessie writer and look at how Hive suppoprt might work
While testing, I realized that the dynamo backing store initialization within quarkus was broken. We need to get tests integration tests in place that actually test the stores in the server, not just independently.
Currently Delta accepts only paths on a filesystem or hive table names as valid Delta tables. This makes it hard to pass to the NessieLogStore information about branch or hash. We have to modify nessie to accept table names in the format <tablename>@<branch>#<hash>
.
This likely means we have to investigate how the DeltaTable
s are cached as they are currently cached by path only.
When a new GH release is created, automatically update maven artifacts to GH and eventually OSSRH repositories for general consumption
Perf tests should be re-run using new algorithm/endpoints to prove speed-up.
Note this may require tweaking the trracing settings on the client
We plan to implement GC in a few months. However, we should make sure that any basic object changes are done before then to ensure that we avoid an issue where GC can't be run until objects are rewritten. Writing up the design now should avoid that from happening.
There are multiple use cases where being able to filter by object key type is useful. For example, if I want to list all tables in Hive within a database that also contains Delta Lake tables, I should be able to filter that without having to read the current value for each key. My current thinking is this could be optimized by an extension whereby we save a 2 or 4 byte attribute along with each key and value. This would be used for storing things like object type to allow efficient getKeys() like operations for subsets of operations. The api changes would be independent of #81 but the efficient implementation would depend on that issue/change.
The list of required package is present in at least two places setup.py
and requirements.txt
, both being kind of authoritative.
Should check if we can keep somehow in sync?
See comment by @rymurr in #165 (comment) for context
Get keys() is an extremely expensive operation in the dynamo versionstore. Just to get the keys (not the values) will likely require:
1x l1 retrieval
151x l2 retrievals
199x l3 retrievals
This is ~30,050 records we have to get from DynamoDB. If we assume we use batchget, we can retrieve 100 records at a time. This still equates to 300 simultaneous dynamo requests (and substantial read units consumed).
Two possible paths for improvement:
We'll also have to think through the garbage collection dynamics of such a system.
Hi,
Thanks for the great work on nessie, really looking forward to this project.
Hoping you can help clarify the intent of using nessie on localhost....
$ docker run -p 19120:19120 projectnessie/nessie
...
2020-10-06 01:29:26,472 INFO [io.quarkus] (main) nessie-quarkus 0.1-SNAPSHOT native (powered by Quarkus 1.8.1.Final) started in 0.030s. Listening on: http://0.0.0.0:19120
...
$ pip install pynessie
...
Successfully installed...
...
$ nessie create-branch my_branch
pynessie.error.NessiePermissionException: Not permissioned to view entity at : 403 Client Error: Forbidden for url: http://localhost:19120/api/v1/trees/branch/my_branch
Is the "Forbidden" error a result of not having logged into the UI at localhost? http://localhost:19120 redirects to http://localhost:19120/login but it is unclear where these credentials are actually set?
Does something need to be defined in ~/.config/nessie/config.yaml
before this will work?
Thanks again.
Add image natives build (Linux) to the project
Iceberg snapshots map well to Nessie commits. We should evaluate the creation of a tool which will scan one or more Iceberg tables and import the snapshots from all tables in approximate time order into a Nessie database. This will allow new operation with data that existed before Nessie.
Come up with an update site information architecture
https://github.com/projectnessie/projectnessie.github.io holds the output of mkdocs generate run on /site/
It looks like with Maven 3.6.2 you can grab the envrionment variables in the toolchain file. We need to figure out how to configure the toolchain with an action used in the build workflow and then trigger the tests based on that.
I think we should probably adopt some naming rules for branches and tags. For hive, they only allow the following for database names: [a-zA-z_0-9]+
I'm not sure whether Git enforces this but it feels like it would best to keep things readable. (I'm not sure if that actually means constraining to latin but at least avoiding the null character seems like a good idea...)
Update the Hive Catalog to be able to route to the Nessie Catalog if the Hive pointer is a Nessie pointer.
Currently test coverage is low in python and relies on mocking.
I would like to use somethign like pytest-vcr to record actual REST responses and test against those
Spark3 catalog, especially the Nessie one. Is a bit squishy at the moment. Needs a lot of work:
nessie-spark3
moduleAWS SDK support 2 http client types. The apache one and the url-connection one. The apache one is the default but has a slow startup time. We should make sure the url-connection one is used in the native images.
https://docs.aws.amazon.com/sdk-for-java/v2/developer-guide/client-configuration-starttime.html
REST APIs are very complicated right now:
We should:
Right now, serialized values are restricted to a relatively small size in DynamoVersionStore (or they should be). We need to enhance the store interface so we support an extended value that is stream based and can be read independently of the base value. This should provide a relatively clean interface working with larger objects.
Make json output optional via eg --json
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.