thoucheese / cloud-storage-rs Goto Github PK

View Code? Open in Web Editor NEW

124.0 124.0 86.0 322 KB

A crate for uploading files to Google cloud storage, and for generating download urls.

License: MIT License

Rust 99.86% Shell 0.14%

cloud-storage-rs's People

Contributors

Stargazers

Watchers

cloud-storage-rs's Issues

Please add optional headers for calls

There are times when we would like to specify additional headers when calling create or similar functions.

It would be nice to possibly have a separate set of methods that include this ability as a "mix-in".

Creation of bucket with `Lifecycle` conditions incomplete

Some conditions are not available in Condition.
https://cloud.google.com/storage/docs/json_api/v1/buckets#lifecycle
https://cloud.google.com/storage/docs/lifecycle#conditions
These are missing:

customTimeBefore
daysSinceCustomTime
daysSinceNoncurrentTime
noncurrentTimeBefore

Also separate:
I think the documentation for ActionType::Delete is incorrect I think it does not delete the bucket but the objects inside of it.
https://cloud.google.com/storage/docs/lifecycle#delete

I can make a PR for this (and other issues) at some point in the future, but just an issue to keep track of it for now.

Does `list` get all possible objects?

Am I correct to assume that Object::list actually fetches all items? The ListRequest object accepts the page_token and max_results but looking at

    pub async fn list(
        bucket: &str,
        list_request: ListRequest,
    )

it actually does iterate over page_tokens and fetches the whole list. I really wish this was more clear somewhere

sync client: Object functions fail after ~60 minutes

After about 60 minutes after starting my service, I always run into the following error:

error: Reqwest(reqwest::Error { kind: Request, url: "https://www.googleapis.com/oauth2/v4/token", source: hyper::Error(Connect, ConnectError("dns error", ResolveError { kind: Proto(ProtoError { kind: Message("could not send request"), backtrack: None })

I am doing Object::download_sync at this time. I think this coincides with the initial token expiration. I don't think the dns request is failing (running on GKE), but struggling to find out where the error actually is...

Running version 0.6.

Deserialization errors

It appears that the serde attributes for at least some of the structs don't match what GCP is actually returning. For instance, a bucket has a StorageClass of "ARCHIVE" which is not in the StorageClass enum. I'm happy to add missing items as a PR, but is there a canonical source for the schema (specifically with regards to whether and item is an Option)? I haven't been able to find one, but I'm reluctant to start making a bunch of fields optional, such as the cors field.

Support for Tokio 0.3

The crate currently depends on Tokyo ^0.2, but maybe it would be beneficial to update to the new version since it's considered a beta before 1.0.

If you try to run a program that uses both the cloud-storage and the tokio 0.3 crates, you get a "thread 'main' panicked at 'not currently running on the Tokio runtime." error.

For now, I can use the tokio-compat-02 crate as a workaround, and it could also be used when updating cloud-storage to use tokio 0.3, while its other dependencies also update themselves to use it.

Async?

Would you be open to a pull request adding an async API that uses reqwest::Client rather than reqwest::blocking::Client? If that's not something you would want to maintain, I understand :)

Refactor the library into a client to remove globals

In addition to making the code hard to reason about, having globals makes it difficult to use multiple instances of things at a time like multiple service accounts, multiple tokio runtimes (when dealing with sync code), and concurrent tests.

Refactoring this library into a Client with Object and Bucket types with more than static methods would be ideal. A modest rewrite would be nice.

create_streamed should take Stream<Item = u8> to be consistent with download_streamed

but the whole thing still feels weird since basically every other library takes Item=Bytes. Especially interop with reqwest::wrap_stream. I think that the Download with fn size and keeping Item=Bytes was actually the correct approach.

BucketClient::list Returns Error on successful response

BucketClient::list appears to always return an error:

[src/main.rs:6] client.bucket().list().await = Err(
    Reqwest(
        reqwest::Error {
            kind: Decode,
            source: Error("data did not match any variant of untagged enum GoogleResponse", line: 0, column: 0),
        },
    ),
)

It also seems to have some dbg!()s left in--it runs the query twice just to debug print it (which is how we can see that the response successfully gets the buckets).

ObjectClient goes into an infinite loop if it gets a permission error while trying to list the contents of a bucket

It should not retry in this case, as it is not going to magically start working without changing the permissions.

    let x = client.object().list(&bucket, Default::default()).await?;
    let count = x
        .map_err(|e| dbg!(anyhow!(e).context("getting data from GCS")))
        .map_ok(|l| {
            println!("got ObjectList with {} items", l.items.len());
            l
        }).count().await;

Also, the documentation claims that “This function will repeatedly query Google and merge the responses into one.”, which is untrue. This function instead returns a Stream of Result<ObjectList>s, as shown clearly by the return value. You might want to fix the description while you’re there.

Consider deferring authentication to another crate

I've used (and contributed to) the gcp_auth crate, which IMO is a pretty nice and simple way to deal with authentication/authorization for GCP. Maybe it would be nice to keep this crate focused on specific Cloud Storage APIs and integrate with it (or another existing crate) for authn/authz?

cloud-storage is a little noisy.

(reported on 0.8.3) Using println! and dbg! macros in src/resources/object.rs spit out a bunch of stuff to stdout that makes it difficult to focus on my app. Would it be possible to suppress this, or use the log/env_logger crates (or similar) to put this output in its own channel?

resumable upload support

for uploading large files to GCS, it's recommended to use resumable uploads.

If there's interest I'd be down to contribute this upstream to this crate.

More details about how to do it here: https://cloud.google.com/storage/docs/performing-resumable-uploads

Allow setting metadata atomically on initial upload

The "create and then update" workflow does not work well in some cases.

E.g. I am setting obj.cache_control = Some("no-store".to_string()); obj.update().await to prevent GCS's default forced cache of one hour (which cannot be circumvented from the browser).

Two issues:

1. Sometimes that update request succeeds but the metadata of the object is not updated for some reason (I think this may be an error Google's end when writing the same object many times).
1. If I am creating/updating the object, it may have been replaced by the time the update runs.

Is it possible to add metadata to Object::create so that this happens in one single request instead of two?

This would solve the two issues above by making object-creation-with-metadata a single atomic step.

Is this the right repo for cloud-storage-rs?

This repo is linked from this crate.

I want to know if this is correct since the crate version on crate.io and on the readme is different...

Thanks.

Object create and create_streamed always return an Error::Other on a non–200 response

The documentation implies that errors from Google become an Error::Google, but https://docs.rs/cloud-storage/latest/src/cloud_storage/client/object.rs.html#126 always creates an Error::Other.

Other("{\n  \"error\": {\n    \"code\": 403,\n    \"message\": \"[email protected] does not have storage.objects.delete access to the Google Cloud Storage object.\",\n    \"errors\": [\n      {\n        \"message\": \"[email protected] does not have storage.objects.delete access to the Google Cloud Storage object.\",\n        \"domain\": \"global\",\n        \"reason\": \"forbidden\"\n      }\n    ]\n  }\n}\n")

Usage of global hyper::Client (via reqwest) leads to errors when executors exit

The current implementation has one global Client:

cloud-storage-rs/src/lib.rs

Line 118 in 563124a

static ref CLIENT: reqwest::Client = reqwest::Client::new();

While reusing a single hyper / reqwest client is a good idea to allow connection reuse, this implementation can lead to errors from hyper when the executor exits:

dispatch dropped without returning error

This is explained in hyperium/hyper#2112:

The Client spawns background tasks to monitor the HTTP connection status, and if the executor drops it before it determines the connection was closed, it panics with that message

The easiest way to trigger this is to use something like #[tokio::test] with a large number of tests. Each test will spin up its own executor that ends when the test does.

Although less common, it's intended to be totally possible to have multiple executors in a "normal" execution, including starting and stopping them.

Set download url headers

The current download urls present the default headers set by Google upon downloading. Sometimes it is necessary to set custom headers here (such as Content-Disposition to make the browser download files).

Sync object list method returning an async Stream

Im not sure if im missing something, but the sync ObjectClient list method seems to return an async Stream.

Changes from 0.8.4 to 0.9.0

https://docs.rs/cloud-storage/0.8.4/cloud_storage/struct.Object.html#method.list_sync :

pub fn list_sync(
    bucket: &str,
    list_request: ListRequest
) -> Result<Vec<ObjectList>, Error>

was transformed to:

https://docs.rs/cloud-storage/0.9.0/cloud_storage/sync/struct.ObjectClient.html#method.list

pub fn list(
    &self,
    bucket: &'a str,
    list_request: ListRequest
) -> Result<impl Stream<Item = Result<ObjectList>> + 'a>

Is this intentional ? Shouldn't this method return a Result<Vec<ObjectList>, Error> as its global-client counterpart ?

0.11.0-rc.1 changed `cloud_storage::client::Client` to `!Send`

This makes it impossible to reuse the same Client in a multi-threaded tokio runtime.

Lazy SERVICE_ACCOUNT

At the moment service account is initialized lazily, this makes things inconvenient. I would like to pass json string with all credentials during compile time. One option is to wrap SERVICE_ACCOUNT into Mutex then it would be possible to initialize it at the beginning of the program manually but this solution is not elegant either. Other option could global state/context initialized by the user at any point in time and passed as a reference to the function that consumes SERVICE_ACCOUNT and TOKEN_CACHE.
What do you think?

Allow for modification of the reqwest client

Recently, we encountered an error we traced back to a lack of timeout when the reqwest object makes a call. It would be helpful to either be able to specify the timeout value, or ideally, have access to the client attribute so that reqwest behaviors can be altered when needed.

(e.g. Client::with_client(req: reqwest::Client), which is similar to Client::with_cache(...))

Alternates implementations always welcome.

Sync ObjectClient create_streamed loading whole file into ram

The behaviour of create_streamed method for the sync ObjectClient does not conforms to its description:

From https://docs.rs/cloud-storage/0.10.2/src/cloud_storage/sync/object.rs.html

     /// Create a new object. This works in the same way as `ObjectClient::create`, except it does not need
    /// to load the entire file in ram.
    pub fn create_streamed<R>(
        &self,
        bucket: &str,
        mut file: R,
        length: impl Into<Option<u64>>,
        filename: &str,
        mime_type: &str,
    ) -> crate::Result<Object>
    where
        R: std::io::Read + Send + 'static,
    {
        let mut buffer = Vec::new();
        file.read_to_end(&mut buffer)
            .map_err(|e| crate::Error::Other(e.to_string()))?;

While the function descriptions says it does not need to load the whole file into memory, the method
starts by loading the whole file into a buffer. It leads to out-of-me.mory errors when trying to upload files that cannot fit into memory.

Wrong environment variable being used?

I noticed that you're using SERVICE_ACCOUNT instead of GOOGLE_APPLICATION_CREDENTIALS for the environment variable, is there any particular reason for this? I think GOOGLE_APPLICATION_CREDENTIALS should probably be used since it how GCP services will provide the service account file.

Creating upload urls

Currently it is possible to create download urls to grant users unauthenticated access to files. It is however not possible to create upload urls, to allow users to upload files.

Force bucket name to lowercase

Google bucket names are not allowed to contain uppercase characters.
The library should maybe filter/cast the bucket name strings to force these rules on the string.

For more info: https://cloud.google.com/storage/docs/naming-buckets

Read + Seek from object?

Hello,

I need a Read + Seek for a GCS object. Is there a way to get this? Thanks!

Best, Oliver

Documentation Updates

Thanks for creating this crate! As I'm getting started, I'm noticing some errors / shortcomings in the documentation which I want to document here so it can be updated.

In READEME.md

The service account requires the permission devstorage.full_control

It doesn't look like this is a valid permission anymore.

In READEME.md
Example

// create a new Bucket
let new_bucket = NewBucket { name: "mybucket", ..Default::default() }
let bucket = Bucket::create(new_bucket).await?;
// upload a file to our new bucket
let content = b"Your file is now on google cloud storage!";
bucket.upload(content, "folder/filename.txt", "application/text").await?;

no method named upload found for struct cloud_storage::Bucket

In the features section of docs.rs

global-client -- This feature flag does not enable additional features.

However there are methods gated by this feature. Ex: Bucket::list

Missing permissions
Bucket::read requires the storage.buckets.get permission which is not mentioned in the README. This permission is not included in the Service Account Token Creator or Storage Object Admin roles.

Support `authorized_user`

Would you be willing to support more authentication types? Especially authorized_user?

I wrote a small CLI tool that I like to run locally. It would be much easier to adopt if cloud-storage could leverage the existing authentication of the gcloud command-line tool, which seems to write ~/.config/gcloud/application_default_credentials.json with content like this:

{
  "client_id": "...some email",
  "client_secret": "...some random string",
  "quota_project_id": "...some project id",
  "refresh_token": "...some random string",
  "type": "authorized_user"
}

Would you be interested in a PR that is able to read this file instead of only ServiceAccounts?

(Skimming through the code, I would probably make ServiceAccount part of an enum that serde distinguishes by the type).

ObjectClient.list has an annoying return type

It returns a Stream of Result<ObjectList>, which on one level is fine because it is exactly what Google makes available. At least for my use–case it would simplify things if there were a method that returned a Stream of Object instead.

Return `content-length` header with download stream

download_streamed would ideally return the content-length header (or be a stream with an accurate size_hint()) to obviate an unnecessary extra request to fetch the object size.

Use a GCP authentication library instead of a .json file

It would be great to remove the dependency of the GOOGLE_CLOUD_CREDENTIALS or SERVICE_ACCOUNT environment variables. As they are not provided in Google CloudBuild or CloudRun environments.
I would recommend using a library such as go-auth or yup-oauth2 where we could use the tokens to authenticate.

Can we download a file as a stream of bytes?

Thanks for this awesome, easy to use library. I'm a beginner and the ability to just have something work with few lines of code is a great feeling. I was looking at the Object API, and it seems like we can download the file as a vec<u8>, which if I'm not wrong, will download the entire file into memory? Is there any interest in supporting a streaming download? I did notice the download_url API. Is that the recommended way to download large files?

Get `Reason` from `GoogleError`

The api does not expose the reason, message or other data for GoogleError directly.

I can not do the following:

let error: GoogleError = { /* error from somewhere */ };
match error.reason {
    BadRequest => { println!("Some message"); }
    ParseError => { println!("Some other message"); }
    _ => { println!("Other errors"); }
}

Also GoogleError does not implement Display so can not just use it in a format! or println! statement. (only Debug is implemented)

Make env-config side-effects optional

Currently, this library sources credentials for performing operations exclusively from env. Although this is convenient most of the time, it introduces two problems:

Side effects, any consuming binary or library now needs to support and document these environment variables even if completely unrelated to its own config management
Unable to inject config/credentials into the library, meaning it's impossible to perform config gathering ONCE at the beginning of the app, and then pass relevant config to subsystems; note that this also makes it impossible to hot-load a new config into the binary consuming this lib, it must be completely terminated and restarted.

As it stands, the library config is self-contained within the ServiceAccount struct after loading from env. Allowing for optional passing in of this struct on creation of the client should solve this issue. All downstream code would need to be updated to use this optional config over the globally-defined crate::SERVICE_ACCOUNT however.

Add if-match (and other optional headers) to create object

It would be useful for optional headers (especially if-match) to be implemented so that logic can be on the server-side. Ideally, all optional headers should be accessible in this library to support full functionality of Google Cloud's JSON API. I just happen to need if-match and if-none-match for my use case.

Consider making _sync methods use BasicScheduler

The #[tokio::main] attr attached to all _sync methods spawns a multi-threaded runtime, which can impact performance of heavy users of those APIs.
Although I'd recommend removing the _sync methods entirely to make explicit the overhead of using _sync, it can be lessened somewhat by making a wrapper method that uses the basic scheduler, which is what reqwest does.

[Perf] Partial Response object listing support

Support the partial response support https://cloud.google.com/storage/docs/json_api#partial-response to speed up.

For a bucket which contains more than 60K

Without optim: 1m40s
WIth partial response: 40s

Support Workload Identity (getting access tokens from the metadata server)

The recommended way for applications in GKE to access GCP services is now via Workload Identity (WI). This replaces the need to use service accounts to provide credentials, with the token obtained from the GKE metadata server.

Are there plans for this crate to support this mode of authentication? As mentioned in #92, gcp_auth is a good candidate to handle auth. Its documentation states that it supports getting the token from the metadata server.

Are there plans to work on #92, or support WI auth in some other way? I looked through the code of this crate, but couldn't find a clear place where alternative auth methods could be slotted in.

Object::download does not return error if object does not exist

Based on my understanding

val object = Object::download(&self.bucket.name, path.as_ref()).await?;

should return object if available or error (Error::Google) in case it does not.

When tested with file which does not exist:

debug!("{:?}", String::from_utf8(object.clone()).unwrap());

prints:

No such object: whatever_object_name

log clearly says that file is 404

DEBUG reqwest::async_impl::client  > response '404 Not Found' for https://www.googleapis.com/storage/...

Am I missing something or there is problem with download method ?

Using latest 0.7 version

Thanks

Missing LICENSE

Thanks for your work here!

Would you mind including a LICENSE file to clarify how we're permitted to use this? In the US anyway, I think the default is exclusive copyright (no one else is permitted to use).

Edit: Just noticed Cargo.toml has MIT specified

Moving/renaming object with `update` doesn't work

The module docs include this example:

Renaming/moving a file

let client = Client::default();
let mut object = client.object().read("mybucket", "myfile").await?;
object.name = "mybetterfile".to_string();
client.object().update(&object).await?;

However, this uses the name of the given object to know what to apply the changes to. So when you try to rename a file, Google will tell you that it wasn't found (since it doesn't exist yet).

At the very least, this example should be removed and the docs updated. (update, as well as compose, seem to have the same entry as read in the docs). We can perform the operation with a rewrite/copy to the same bucket, but that's less than ideal, especially because rewrite has a dbg left in.

Issue with permissions

Hello,

I'm using this library for work to read grom google cloud storage, however I'm getting a permission error.

I'm assuming this:

Authorization can be granted using the SERVICE_ACCOUNT environment variable, which should contain path to the service-account-*******.json file that contains the Google credentials. The service account requires the permission devstorage.full_control. This is not strictly necessary, so if you need this fixed, let me know!

Has something to do with it?

     Running `target\debug\something.exe`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Other("got error response from Google: [email protected] does not have storage.buckets.get access to the Google Cloud Storage bucket.")', src\main.rs:5:18

Would love to see this fixed as I definitely do have read permissions + file create + write permissions, just no full access.

Remove direct dependency to openssl

The library currently offers features allowing to choose whether to use rustls or openssl as a TLS backend.
However, it also declares a direct, non optional dependency to openssl. If we keep this dependency, having a rustls feature doesn't really makes much sense.

Openssl is also quite difficult to depend on when doing cross compilation (especially since cross does not support it anymore) so having the option to use this crate without openssl would be very helpful.

The library cannot deserialize Bucket::list()

I'm getting a response


&buckets = Err(
    Reqwest(
        reqwest::Error {
            kind: Decode,
            source: Error("data did not match any variant of untagged enum GoogleResponse", line: 0, column: 0),
        },
    ),
)

I'm pretty sure I have the correct rights as when I didn't have them I was receiving an empty list. I am yet to manage to find out what is actually then returned as a response for this request

Allow usage w/ an emulator

It would be nice if cloud-storage would allow users to connect to an emulator like https://github.com/oittaa/gcp-storage-emulator or https://github.com/fsouza/fake-gcs-server for local testing. For this the following behavior changes are required:

Disable auth. Other libs call this AnonymousCredentials, but I think it's just that they don't specify any auth headers at all (ref).
Allow to specify a custom endpoint instead of a hardcoded BASE_URL.

Download Url results in permissions error

When accessing a download url for an object that was successfully stored I'm seeing this:

<Error>
<Code>AccessDenied</Code>
<Message>Access denied.</Message>
<Details>
Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.
</Details>
</Error>

Even though I have given the account roles/storage.objectAdmin access to the bucket:

resource google_storage_bucket_iam_member "artifacts-admin" {
  bucket = google_storage_bucket.artifacts.name
  role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.artifacts-service-account.email}"
}

Any idea what permission is missing here?

Update:
This is an example of the URL that was created:

https://storage.googleapis.com/testflight-mn-artifacts/dev-credit-notes/masht.pdf?X-Goog-Algorithm=GOOG4-RSA-SHA256\u0026X-Goog-Credential=testflight-artifacts-account%40misthos-network.iam.gserviceaccount.com%2F20200826%2Fhenk%2Fstorage%2Fgoog4_request\u0026X-Goog-Date=20200826T113518Z\u0026X-Goog-Expires=600\u0026X-Goog-SignedHeaders=host\u0026X-Goog-Signature=4a89db80bf1876d80b45b85b8a37519adb46fd66cbc5747ada3a0b47e9ca65a02029bbac0e5c4f844e30af3c6ed2989660709ce98a2b2b890a34a99b66d325f443d2c482460a2da9e09d1b130db70239dd3f8a81ac664770a2749ad7351b3fc884bcec57a0bebc7ad96bff44273f87a68c4783508ae44428e2651cdd846cfbbe997999df107463b49b2c9d0310b23b4588ae4f2c36b582bbaef5d773ca176a2c205195f9a46587ce65704c850f50dc1a0f6eb94119dba952266b2173247c84848742f169cc83cda68d07c5f895b66ea1ec2fbc858cd145f02be140ae0a262f7767b10be21e77c7ba64331b683cee78cffd4e17c6bdf4e5c3088e46e982616f17

I'm wondering if the html escaping (eg \u0026) plays a role.

Support Objects rewrite completely/correctly

Currently Objects rewrite is not fully supported.
The is because the requester should check if the done flag is set to true, as described here:

This method copies data using multiple requests so large objects can be copied with a normal length timeout per request rather than one very long timeout for a single request. In such cases, you must keep calling the endpoint until the rewrite response done flag is true. If the flag is false, include the rewriteToken that's returned in the response in the subsequent request. [...]
--- https://cloud.google.com/storage/docs/json_api/v1/objects/rewrite

In order to properly support this, the request should be repeated as described by the documentation.
Right now the code does not check if the flag is set to done. And just returns the Object.
Although it does look like (reading for documentation) the resource item will only be present if the rewrite is completed in that 1 transaction. So the deserialization will fail because the resource in RewriteResponse is not marked as Option<Object>.

So currently the function call will fail on the deserialization if the object failed to transfer. Which is less then ideal. It should do multiple transactions or fail by checking if the done flag was set to false.

ObjectAccessControl fails on objects with slashes in their name

Slashes in the object name need to be URL encoded. The work around I have done is to do that encoding prior to calling ObjectAccessControl. I.E:

let filename = "foo/bar"
Object::create("bucket", content, filename).await?;
let filename = filename.replace("/", "%2F");
ObjectAccessControl::create("bucket", &filename, &acl).await?;

thoucheese / cloud-storage-rs Goto Github PK

cloud-storage-rs's People

Contributors

Stargazers

Watchers

Forkers

cloud-storage-rs's Issues

Recommend Projects

Recommend Topics

Recommend Org