Giter VIP home page Giter VIP logo

object_store_rs's Introduction

object_store_rs's People

Contributors

alamb avatar crepererum avatar domodwyer avatar kodiakhq[bot] avatar mkmik avatar pauldix avatar roeap avatar timvw avatar tustvold avatar wjones127 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

object_store_rs's Issues

OAuth emulation

#45 left out the oauth2 part. We should set up some emulated oauth2 endpoint to actually test these code paths as well.

Extract Connection Limiting Logic out Of S3

Problem

The S3 implementation currently has a SemaphoreClient that limits concurrent requests, this is useful beyond just S3, confuses the implementation, and dates from a time when ObjectStore was not object-safe.

Proposal

I would like a LimitStore that wraps a <T: ObjectStore> much like ThrottledStore and provides a configurable concurrency limit.

Test fails on main on macos

$ cargo test
...
---- local::tests::test_list_root stdout ----
thread 'local::tests::test_list_root' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidPath { source: Canonicalize { path: "/.VolumeIcon.icns", source: Os { code: 2, kind: NotFound, message: "No such file or directory" } } }', src/local.rs:692:20
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    local::tests::test_list_root

test result: FAILED. 29 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2.30s

same commit works on linux

Propose donating object_store_rs to Apache Arrow project

TDLR I would like to propose donating this project to the Apache Arrow project https://arrow.apache.org/

Rationale

  1. A common, high quality object store abstraction for communicating with various remote object stores is useful for a range of projects and usecases.
  2. A library with a common API to access remote object stores is directly aligned with the Arrow mission of providing building blocks for modern high performance analytics systems
  3. The clear governance of Apache Arrow offers the best chance to build a unified and strong community around this crate, hopefully both increasing its adoption and attracting community contributions for its long term evolution and maintenance

Background

Object stores are increasing important for analytic systems as more data is located in such systems; @yjshen donated an object store abstraction to Arrow Datafusion to allow Datafusion to read from local files, S3, hdfs, and others. In apache/datafusion#2489 the DataFusion community is proposing migrating from this original object store abstraction, part of the DataFusion project (part of apache arrow) to the code in this crate.

Provenance

The code in this crate was originally developed by InfluxData, largely by @carols10cents, for InfluxDB IOx. @tustvold has since extracted the code and released it as its own crate. Upon consideration, as described above, for the long term health of both this code and the arrow-rs and arrow-datafusion projects, moving it to be an official part of Arrow would be beneficial and we would like to donate it to the community

There is additional background here apache/datafusion#2677 (comment)

This ticket hopefully can serve as a discussion on the form this donation can take. Some options:

  1. Move code into the arrow-datafusion repository
  2. Move code into the arrow-rs repository
  3. Move code to an apache/arrow-object-store-rs repository
  4. Move code to datafusion-contrib

Remove next_token from ListResult

ListResult has next_token: Option (Token passed to the API for the next page of list results)
ObjectStore has nowhere a method that accepts such token, thus no idea why this is part of the API.

fn list: List all the objects with the given prefix

fn list_with_delimiter: List objects with the given prefix and an implementation specific delimiter.
-> Should all such objects be returned? Or only some? (And how many? Leave it to underlying impl?)

Support assuming roles directly when using AWS S3

Users have an expectation that a correctly configured AWS credentials or profile file will cause systems to assume configured roles. Unfortunately Rusoto does not do this out of the box (issue, pr).

One possible way to mitigate this is by migrating to aws-sdk-rust, but does not seem compatible with #18 .

Implement copy_if_not_exists for AWS

copy_if_not_exists is not implementable in S3 without some external lock. Hence delta-rs created dynamodb-lock. We should aim to implement this method with an optional feature that includes dynamodb-lock. How to configure the lock client is an open question.

Flaky test aws::tests::s3_test

main was green but now fails with:

failures:

---- aws::tests::s3_test stdout ----
thread 'aws::tests::s3_test' panicked at 'assertion failed: `(left == right)`
  left: `[Path { raw: "a/b/c/foo.file" }]`,
 right: `[Path { raw: "a/b%2Fc/foo.file" }]`', src/lib.rs:411:9
stack backtrace:

The test is either flaky (with low chance to pass) or depends on some external state which changed after the root cause got merged.

https://app.circleci.com/pipelines/github/influxdata/object_store_rs/169/workflows/18a945aa-947b-4cd3-aeaf-80aa66530648/jobs/1009

https://app.circleci.com/pipelines/github/influxdata/object_store_rs/171/workflows/b5e5dfed-3272-4074-b9ae-8e95be60d12d/jobs/999

Ensure all ObjectStores have publicly available client

ObjectStores might want to be extended with additional traits that use the underlying client. For example, in delta-rs we would want to extend the client to implement a version of rename that failed if the destination already exists (see delta-io/delta-rs#610).

AmazonS3 does expose a method .client():

object_store_rs/src/aws.rs

Lines 533 to 535 in 5f488c5

impl AmazonS3 {
/// Get a client according to the current connection limit.
async fn client(&self) -> SemaphoreClient {

MicrosoftAzure only has a private field container_client:

pub struct MicrosoftAzure {
container_client: Arc<ContainerClient>,

GoogleCloudStorage only has a private field client:

object_store_rs/src/gcp.rs

Lines 103 to 104 in 5f488c5

pub struct GoogleCloudStorage {
client: Client,

They all will return a different type, but maybe we should implement for each a client() method like the S3 has?

Move Away From SDKs

Problem

Currently this crate uses upstream crates to provide interaction with object storage. This comes with a few downsides:

  • Missing features, e.g. range support, conditionals, etc...
  • Inconsistent error handling, e.g. it can be next to impossible to get the HTTP status code
  • Extreme dependency bloat

Proposal

This crate does not intend to cover more than the basic APIs of each store, which boils down to just a couple of different request types. I would like therefore like to propose:

  • Move to using a reqwest client directly
  • Use serde to serialize payloads
  • Use ring to handle signatures/etc...
  • Use rustls to handle TLS

This will allow:

  • Consistent error handling, retries, etc...
  • Smaller dependency footprint
  • New features without waiting on upstreams
  • Simpler codebase

It's up to whoever picks this up where to start, but I would suggest starting with GCS as:

  • It will unlock range requests
  • The authentication logic is simpler than say AWS
  • The JSON API is relatively straightforward

If that goes well, we can then look to move onto the others

Additional Context

#15 (comment)

AsyncWrite over multi-part upload

One idea I was exploring in datafusion-contrib/datafusion-objectstore-s3#54 was implementing the AsyncWrite trait as an abstraction over multi-part upload. Does that seem like an agreeable addition to this crate?

Multi-part uploads are helpful when uploading large files. For example, you can write parquet files one row group at a time, uploading each row groups data as a part (though more likely there is some buffering in between to get good part sizes). This is the approach taken in Arrow C++ S3 FileSystem. In fact, if we could even upload parts in parallel for better throughput in some scenarios (something AWS recommends).

It seems that GCS supports this through their S3-compatible API (docs) and Azure Blob store has some notion of "block blobs" that might be applicable (docs).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.