Warning: This repository is ARCHIVED.
The Rust Object Store crate has moved to https://github.com/apache/arrow-rs/tree/master/object_store
License: Other
Warning: This repository is ARCHIVED.
The Rust Object Store crate has moved to https://github.com/apache/arrow-rs/tree/master/object_store
#45 left out the oauth2 part. We should set up some emulated oauth2 endpoint to actually test these code paths as well.
There is no link to the github repository which makes it hard to find the source code
Problem
The S3 implementation currently has a SemaphoreClient
that limits concurrent requests, this is useful beyond just S3, confuses the implementation, and dates from a time when ObjectStore was not object-safe.
Proposal
I would like a LimitStore
that wraps a <T: ObjectStore>
much like ThrottledStore
and provides a configurable concurrency limit.
$ cargo test
...
---- local::tests::test_list_root stdout ----
thread 'local::tests::test_list_root' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidPath { source: Canonicalize { path: "/.VolumeIcon.icns", source: Os { code: 2, kind: NotFound, message: "No such file or directory" } } }', src/local.rs:692:20
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
local::tests::test_list_root
test result: FAILED. 29 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2.30s
same commit works on linux
Currently GoogleCloudStorage
does not support get range requests, despite them being supported by the API (https://cloud.google.com/storage/docs/json_api/v1/objects/get)
The cloud-storage crate doesn't yet support rewrite_object
with precondition, which we need for this. So this might be somewhat blocked on #18.
TDLR I would like to propose donating this project to the Apache Arrow project https://arrow.apache.org/
Object stores are increasing important for analytic systems as more data is located in such systems; @yjshen donated an object store abstraction to Arrow Datafusion to allow Datafusion to read from local files, S3, hdfs, and others. In apache/datafusion#2489 the DataFusion community is proposing migrating from this original object store abstraction, part of the DataFusion project (part of apache arrow) to the code in this crate.
The code in this crate was originally developed by InfluxData, largely by @carols10cents, for InfluxDB IOx. @tustvold has since extracted the code and released it as its own crate. Upon consideration, as described above, for the long term health of both this code and the arrow-rs
and arrow-datafusion
projects, moving it to be an official part of Arrow would be beneficial and we would like to donate it to the community
There is additional background here apache/datafusion#2677 (comment)
This ticket hopefully can serve as a discussion on the form this donation can take. Some options:
ListResult has next_token: Option (Token passed to the API for the next page of list results)
ObjectStore has nowhere a method that accepts such token, thus no idea why this is part of the API.
fn list: List all the objects with the given prefix
fn list_with_delimiter: List objects with the given prefix and an implementation specific delimiter.
-> Should all such objects be returned? Or only some? (And how many? Leave it to underlying impl?)
Users have an expectation that a correctly configured AWS credentials or profile file will cause systems to assume configured roles. Unfortunately Rusoto does not do this out of the box (issue, pr).
One possible way to mitigate this is by migrating to aws-sdk-rust, but does not seem compatible with #18 .
copy_if_not_exists
is not implementable in S3 without some external lock. Hence delta-rs created dynamodb-lock. We should aim to implement this method with an optional feature that includes dynamodb-lock. How to configure the lock client is an open question.
main was green but now fails with:
failures:
---- aws::tests::s3_test stdout ----
thread 'aws::tests::s3_test' panicked at 'assertion failed: `(left == right)`
left: `[Path { raw: "a/b/c/foo.file" }]`,
right: `[Path { raw: "a/b%2Fc/foo.file" }]`', src/lib.rs:411:9
stack backtrace:
The test is either flaky (with low chance to pass) or depends on some external state which changed after the root cause got merged.
ObjectStores might want to be extended with additional traits that use the underlying client. For example, in delta-rs we would want to extend the client to implement a version of rename that failed if the destination already exists (see delta-io/delta-rs#610).
AmazonS3
does expose a method .client()
:
Lines 533 to 535 in 5f488c5
MicrosoftAzure
only has a private field container_client
:
Lines 73 to 74 in 5f488c5
GoogleCloudStorage
only has a private field client
:
Lines 103 to 104 in 5f488c5
They all will return a different type, but maybe we should implement for each a client()
method like the S3 has?
Problem
Currently this crate uses upstream crates to provide interaction with object storage. This comes with a few downsides:
Proposal
This crate does not intend to cover more than the basic APIs of each store, which boils down to just a couple of different request types. I would like therefore like to propose:
This will allow:
It's up to whoever picks this up where to start, but I would suggest starting with GCS as:
If that goes well, we can then look to move onto the others
Additional Context
One idea I was exploring in datafusion-contrib/datafusion-objectstore-s3#54 was implementing the AsyncWrite
trait as an abstraction over multi-part upload. Does that seem like an agreeable addition to this crate?
Multi-part uploads are helpful when uploading large files. For example, you can write parquet files one row group at a time, uploading each row groups data as a part (though more likely there is some buffering in between to get good part sizes). This is the approach taken in Arrow C++ S3 FileSystem. In fact, if we could even upload parts in parallel for better throughput in some scenarios (something AWS recommends).
It seems that GCS supports this through their S3-compatible API (docs) and Azure Blob store has some notion of "block blobs" that might be applicable (docs).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.