thoucheese / cloud-storage-rs Goto Github PK
View Code? Open in Web Editor NEWA crate for uploading files to Google cloud storage, and for generating download urls.
License: MIT License
A crate for uploading files to Google cloud storage, and for generating download urls.
License: MIT License
There are times when we would like to specify additional headers when calling create
or similar functions.
It would be nice to possibly have a separate set of methods that include this ability as a "mix-in".
Some conditions are not available in Condition
.
https://cloud.google.com/storage/docs/json_api/v1/buckets#lifecycle
https://cloud.google.com/storage/docs/lifecycle#conditions
These are missing:
Also separate:
I think the documentation for ActionType::Delete
is incorrect I think it does not delete the bucket but the objects inside of it.
https://cloud.google.com/storage/docs/lifecycle#delete
I can make a PR for this (and other issues) at some point in the future, but just an issue to keep track of it for now.
Am I correct to assume that Object::list actually fetches all items? The ListRequest
object accepts the page_token
and max_results
but looking at
pub async fn list(
bucket: &str,
list_request: ListRequest,
)
it actually does iterate over page_tokens and fetches the whole list. I really wish this was more clear somewhere
After about 60 minutes after starting my service, I always run into the following error:
error: Reqwest(reqwest::Error { kind: Request, url: "https://www.googleapis.com/oauth2/v4/token", source: hyper::Error(Connect, ConnectError("dns error", ResolveError { kind: Proto(ProtoError { kind: Message("could not send request"), backtrack: None })
I am doing Object::download_sync
at this time. I think this coincides with the initial token expiration. I don't think the dns request is failing (running on GKE), but struggling to find out where the error actually is...
Running version 0.6.
It appears that the serde
attributes for at least some of the structs don't match what GCP is actually returning. For instance, a bucket has a StorageClass of "ARCHIVE" which is not in the StorageClass enum. I'm happy to add missing items as a PR, but is there a canonical source for the schema (specifically with regards to whether and item is an Option
)? I haven't been able to find one, but I'm reluctant to start making a bunch of fields optional, such as the cors
field.
The crate currently depends on Tokyo ^0.2, but maybe it would be beneficial to update to the new version since it's considered a beta before 1.0.
If you try to run a program that uses both the cloud-storage
and the tokio 0.3
crates, you get a "thread 'main' panicked at 'not currently running on the Tokio runtime." error.
For now, I can use the tokio-compat-02 crate as a workaround, and it could also be used when updating cloud-storage
to use tokio 0.3, while its other dependencies also update themselves to use it.
Would you be open to a pull request adding an async API that uses reqwest::Client
rather than reqwest::blocking::Client
? If that's not something you would want to maintain, I understand :)
In addition to making the code hard to reason about, having globals makes it difficult to use multiple instances of things at a time like multiple service accounts, multiple tokio runtimes (when dealing with sync code), and concurrent tests.
Refactoring this library into a Client
with Object
and Bucket
types with more than static methods would be ideal. A modest rewrite would be nice.
but the whole thing still feels weird since basically every other library takes Item=Bytes
. Especially interop with reqwest::wrap_stream. I think that the Download
with fn size
and keeping Item=Bytes
was actually the correct approach.
BucketClient::list
appears to always return an error:
[src/main.rs:6] client.bucket().list().await = Err(
Reqwest(
reqwest::Error {
kind: Decode,
source: Error("data did not match any variant of untagged enum GoogleResponse", line: 0, column: 0),
},
),
)
It also seems to have some dbg!()
s left in--it runs the query twice just to debug print it (which is how we can see that the response successfully gets the buckets).
It should not retry in this case, as it is not going to magically start working without changing the permissions.
let x = client.object().list(&bucket, Default::default()).await?;
let count = x
.map_err(|e| dbg!(anyhow!(e).context("getting data from GCS")))
.map_ok(|l| {
println!("got ObjectList with {} items", l.items.len());
l
}).count().await;
Also, the documentation claims that “This function will repeatedly query Google and merge the responses into one.”, which is untrue. This function instead returns a Stream
of Result<ObjectList>
s, as shown clearly by the return value. You might want to fix the description while you’re there.
I've used (and contributed to) the gcp_auth crate, which IMO is a pretty nice and simple way to deal with authentication/authorization for GCP. Maybe it would be nice to keep this crate focused on specific Cloud Storage APIs and integrate with it (or another existing crate) for authn/authz?
(reported on 0.8.3) Using println!
and dbg!
macros in src/resources/object.rs
spit out a bunch of stuff to stdout that makes it difficult to focus on my app. Would it be possible to suppress this, or use the log
/env_logger
crates (or similar) to put this output in its own channel?
for uploading large files to GCS, it's recommended to use resumable uploads.
If there's interest I'd be down to contribute this upstream to this crate.
More details about how to do it here: https://cloud.google.com/storage/docs/performing-resumable-uploads
The "create and then update" workflow does not work well in some cases.
E.g. I am setting obj.cache_control = Some("no-store".to_string()); obj.update().await
to prevent GCS's default forced cache of one hour (which cannot be circumvented from the browser).
Two issues:
Is it possible to add metadata to Object::create
so that this happens in one single request instead of two?
This would solve the two issues above by making object-creation-with-metadata a single atomic step.
This repo is linked from this crate.
I want to know if this is correct since the crate version on crate.io and on the readme is different...
Thanks.
The documentation implies that errors from Google become an Error::Google, but https://docs.rs/cloud-storage/latest/src/cloud_storage/client/object.rs.html#126 always creates an Error::Other.
Other("{\n \"error\": {\n \"code\": 403,\n \"message\": \"[email protected] does not have storage.objects.delete access to the Google Cloud Storage object.\",\n \"errors\": [\n {\n \"message\": \"[email protected] does not have storage.objects.delete access to the Google Cloud Storage object.\",\n \"domain\": \"global\",\n \"reason\": \"forbidden\"\n }\n ]\n }\n}\n")
The current implementation has one global Client
:
Line 118 in 563124a
While reusing a single hyper / reqwest client is a good idea to allow connection reuse, this implementation can lead to errors from hyper when the executor exits:
dispatch dropped without returning error
This is explained in hyperium/hyper#2112:
The
Client
spawns background tasks to monitor the HTTP connection status, and if the executor drops it before it determines the connection was closed, it panics with that message
The easiest way to trigger this is to use something like #[tokio::test]
with a large number of tests. Each test will spin up its own executor that ends when the test does.
Although less common, it's intended to be totally possible to have multiple executors in a "normal" execution, including starting and stopping them.
The current download urls present the default headers set by Google upon downloading. Sometimes it is necessary to set custom headers here (such as Content-Disposition to make the browser download files).
Im not sure if im missing something, but the sync ObjectClient list method seems to return an async Stream.
Changes from 0.8.4 to 0.9.0
https://docs.rs/cloud-storage/0.8.4/cloud_storage/struct.Object.html#method.list_sync :
pub fn list_sync(
bucket: &str,
list_request: ListRequest
) -> Result<Vec<ObjectList>, Error>
was transformed to:
https://docs.rs/cloud-storage/0.9.0/cloud_storage/sync/struct.ObjectClient.html#method.list
pub fn list(
&self,
bucket: &'a str,
list_request: ListRequest
) -> Result<impl Stream<Item = Result<ObjectList>> + 'a>
Is this intentional ? Shouldn't this method return a Result<Vec<ObjectList>, Error>
as its global-client counterpart ?
This makes it impossible to reuse the same Client
in a multi-threaded tokio runtime.
At the moment service account is initialized lazily, this makes things inconvenient. I would like to pass json string with all credentials during compile time. One option is to wrap SERVICE_ACCOUNT into Mutex then it would be possible to initialize it at the beginning of the program manually but this solution is not elegant either. Other option could global state/context initialized by the user at any point in time and passed as a reference to the function that consumes SERVICE_ACCOUNT and TOKEN_CACHE.
What do you think?
Recently, we encountered an error we traced back to a lack of timeout when the reqwest
object makes a call. It would be helpful to either be able to specify the timeout value, or ideally, have access to the client
attribute so that reqwest behaviors can be altered when needed.
(e.g. Client::with_client(req: reqwest::Client)
, which is similar to Client::with_cache(...)
)
Alternates implementations always welcome.
The behaviour of create_streamed method for the sync ObjectClient does not conforms to its description:
From https://docs.rs/cloud-storage/0.10.2/src/cloud_storage/sync/object.rs.html
/// Create a new object. This works in the same way as `ObjectClient::create`, except it does not need
/// to load the entire file in ram.
pub fn create_streamed<R>(
&self,
bucket: &str,
mut file: R,
length: impl Into<Option<u64>>,
filename: &str,
mime_type: &str,
) -> crate::Result<Object>
where
R: std::io::Read + Send + 'static,
{
let mut buffer = Vec::new();
file.read_to_end(&mut buffer)
.map_err(|e| crate::Error::Other(e.to_string()))?;
While the function descriptions says it does not need to load the whole file into memory, the method
starts by loading the whole file into a buffer. It leads to out-of-me.mory errors when trying to upload files that cannot fit into memory.
I noticed that you're using SERVICE_ACCOUNT
instead of GOOGLE_APPLICATION_CREDENTIALS
for the environment variable, is there any particular reason for this? I think GOOGLE_APPLICATION_CREDENTIALS
should probably be used since it how GCP services will provide the service account file.
Currently it is possible to create download urls to grant users unauthenticated access to files. It is however not possible to create upload urls, to allow users to upload files.
Google bucket names are not allowed to contain uppercase characters.
The library should maybe filter/cast the bucket name strings to force these rules on the string.
For more info: https://cloud.google.com/storage/docs/naming-buckets
Hello,
I need a Read + Seek for a GCS object. Is there a way to get this? Thanks!
Best, Oliver
Thanks for creating this crate! As I'm getting started, I'm noticing some errors / shortcomings in the documentation which I want to document here so it can be updated.
In READEME.md
The service account requires the permission devstorage.full_control
It doesn't look like this is a valid permission anymore.
In READEME.md
Example
// create a new Bucket
let new_bucket = NewBucket { name: "mybucket", ..Default::default() }
let bucket = Bucket::create(new_bucket).await?;
// upload a file to our new bucket
let content = b"Your file is now on google cloud storage!";
bucket.upload(content, "folder/filename.txt", "application/text").await?;
no method named upload
found for struct cloud_storage::Bucket
In the features section of docs.rs
global-client -- This feature flag does not enable additional features.
However there are methods gated by this feature. Ex: Bucket::list
Missing permissions
Bucket::read
requires the storage.buckets.get
permission which is not mentioned in the README. This permission is not included in the Service Account Token Creator
or Storage Object Admin
roles.
Would you be willing to support more authentication types? Especially authorized_user
?
I wrote a small CLI tool that I like to run locally. It would be much easier to adopt if cloud-storage could leverage the existing authentication of the gcloud
command-line tool, which seems to write ~/.config/gcloud/application_default_credentials.json
with content like this:
{
"client_id": "...some email",
"client_secret": "...some random string",
"quota_project_id": "...some project id",
"refresh_token": "...some random string",
"type": "authorized_user"
}
Would you be interested in a PR that is able to read this file instead of only ServiceAccounts?
(Skimming through the code, I would probably make ServiceAccount part of an enum that serde distinguishes by the type).
It returns a Stream
of Result<ObjectList>
, which on one level is fine because it is exactly what Google makes available. At least for my use–case it would simplify things if there were a method that returned a Stream
of Object
instead.
download_streamed
would ideally return the content-length
header (or be a stream with an accurate size_hint()
) to obviate an unnecessary extra request to fetch the object size
.
GOOGLE_CLOUD_CREDENTIALS
or SERVICE_ACCOUNT
environment variables. As they are not provided in Google CloudBuild or CloudRun environments.Thanks for this awesome, easy to use library. I'm a beginner and the ability to just have something work with few lines of code is a great feeling. I was looking at the Object API, and it seems like we can download the file as a vec<u8>
, which if I'm not wrong, will download the entire file into memory? Is there any interest in supporting a streaming download? I did notice the download_url API. Is that the recommended way to download large files?
The api does not expose the reason
, message
or other data for GoogleError
directly.
I can not do the following:
let error: GoogleError = { /* error from somewhere */ };
match error.reason {
BadRequest => { println!("Some message"); }
ParseError => { println!("Some other message"); }
_ => { println!("Other errors"); }
}
Also GoogleError
does not implement Display
so can not just use it in a format!
or println!
statement. (only Debug
is implemented)
Currently, this library sources credentials for performing operations exclusively from env. Although this is convenient most of the time, it introduces two problems:
As it stands, the library config is self-contained within the ServiceAccount
struct after loading from env. Allowing for optional passing in of this struct on creation of the client should solve this issue. All downstream code would need to be updated to use this optional config over the globally-defined crate::SERVICE_ACCOUNT
however.
It would be useful for optional headers (especially if-match) to be implemented so that logic can be on the server-side. Ideally, all optional headers should be accessible in this library to support full functionality of Google Cloud's JSON API. I just happen to need if-match and if-none-match for my use case.
The #[tokio::main]
attr attached to all _sync
methods spawns a multi-threaded runtime, which can impact performance of heavy users of those APIs.
Although I'd recommend removing the _sync
methods entirely to make explicit the overhead of using _sync
, it can be lessened somewhat by making a wrapper method that uses the basic scheduler, which is what reqwest does.
Support the partial response support https://cloud.google.com/storage/docs/json_api#partial-response to speed up.
For a bucket which contains more than 60K
The recommended way for applications in GKE to access GCP services is now via Workload Identity (WI). This replaces the need to use service accounts to provide credentials, with the token obtained from the GKE metadata server.
Are there plans for this crate to support this mode of authentication? As mentioned in #92, gcp_auth is a good candidate to handle auth. Its documentation states that it supports getting the token from the metadata server.
Are there plans to work on #92, or support WI auth in some other way? I looked through the code of this crate, but couldn't find a clear place where alternative auth methods could be slotted in.
Based on my understanding
val object = Object::download(&self.bucket.name, path.as_ref()).await?;
should return object if available or error (Error::Google) in case it does not.
When tested with file which does not exist:
debug!("{:?}", String::from_utf8(object.clone()).unwrap());
prints:
No such object: whatever_object_name
log clearly says that file is 404
DEBUG reqwest::async_impl::client > response '404 Not Found' for https://www.googleapis.com/storage/...
Am I missing something or there is problem with download
method ?
Using latest 0.7
version
Thanks
Thanks for your work here!
Would you mind including a LICENSE
file to clarify how we're permitted to use this? In the US anyway, I think the default is exclusive copyright (no one else is permitted to use).
Edit: Just noticed Cargo.toml
has MIT specified
The module docs include this example:
Renaming/moving a file
let client = Client::default(); let mut object = client.object().read("mybucket", "myfile").await?; object.name = "mybetterfile".to_string(); client.object().update(&object).await?;
However, this uses the name of the given object to know what to apply the changes to. So when you try to rename a file, Google will tell you that it wasn't found (since it doesn't exist yet).
At the very least, this example should be removed and the docs updated. (update
, as well as compose
, seem to have the same entry as read
in the docs). We can perform the operation with a rewrite
/copy
to the same bucket, but that's less than ideal, especially because rewrite has a dbg
left in.
Hello,
I'm using this library for work to read grom google cloud storage, however I'm getting a permission error.
I'm assuming this:
Authorization can be granted using the SERVICE_ACCOUNT environment variable, which should contain path to the service-account-*******.json file that contains the Google credentials. The service account requires the permission devstorage.full_control. This is not strictly necessary, so if you need this fixed, let me know!
Has something to do with it?
Running `target\debug\something.exe`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Other("got error response from Google: [email protected] does not have storage.buckets.get access to the Google Cloud Storage bucket.")', src\main.rs:5:18
Would love to see this fixed as I definitely do have read permissions + file create + write permissions, just no full access.
The library currently offers features allowing to choose whether to use rustls or openssl as a TLS backend.
However, it also declares a direct, non optional dependency to openssl. If we keep this dependency, having a rustls feature doesn't really makes much sense.
Openssl is also quite difficult to depend on when doing cross compilation (especially since cross does not support it anymore) so having the option to use this crate without openssl would be very helpful.
I'm getting a response
&buckets = Err(
Reqwest(
reqwest::Error {
kind: Decode,
source: Error("data did not match any variant of untagged enum GoogleResponse", line: 0, column: 0),
},
),
)
I'm pretty sure I have the correct rights as when I didn't have them I was receiving an empty list. I am yet to manage to find out what is actually then returned as a response for this request
It would be nice if cloud-storage
would allow users to connect to an emulator like https://github.com/oittaa/gcp-storage-emulator or https://github.com/fsouza/fake-gcs-server for local testing. For this the following behavior changes are required:
When accessing a download url for an object that was successfully stored I'm seeing this:
<Error>
<Code>AccessDenied</Code>
<Message>Access denied.</Message>
<Details>
Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.
</Details>
</Error>
Even though I have given the account roles/storage.objectAdmin
access to the bucket:
resource google_storage_bucket_iam_member "artifacts-admin" {
bucket = google_storage_bucket.artifacts.name
role = "roles/storage.objectAdmin"
member = "serviceAccount:${google_service_account.artifacts-service-account.email}"
}
Any idea what permission is missing here?
Update:
This is an example of the URL that was created:
https://storage.googleapis.com/testflight-mn-artifacts/dev-credit-notes/masht.pdf?X-Goog-Algorithm=GOOG4-RSA-SHA256\u0026X-Goog-Credential=testflight-artifacts-account%40misthos-network.iam.gserviceaccount.com%2F20200826%2Fhenk%2Fstorage%2Fgoog4_request\u0026X-Goog-Date=20200826T113518Z\u0026X-Goog-Expires=600\u0026X-Goog-SignedHeaders=host\u0026X-Goog-Signature=4a89db80bf1876d80b45b85b8a37519adb46fd66cbc5747ada3a0b47e9ca65a02029bbac0e5c4f844e30af3c6ed2989660709ce98a2b2b890a34a99b66d325f443d2c482460a2da9e09d1b130db70239dd3f8a81ac664770a2749ad7351b3fc884bcec57a0bebc7ad96bff44273f87a68c4783508ae44428e2651cdd846cfbbe997999df107463b49b2c9d0310b23b4588ae4f2c36b582bbaef5d773ca176a2c205195f9a46587ce65704c850f50dc1a0f6eb94119dba952266b2173247c84848742f169cc83cda68d07c5f895b66ea1ec2fbc858cd145f02be140ae0a262f7767b10be21e77c7ba64331b683cee78cffd4e17c6bdf4e5c3088e46e982616f17
I'm wondering if the html escaping (eg \u0026
) plays a role.
Currently Objects rewrite is not fully supported.
The is because the requester should check if the done
flag is set to true
, as described here:
This method copies data using multiple requests so large objects can be copied with a normal length timeout per request rather than one very long timeout for a single request. In such cases, you must keep calling the endpoint until the rewrite response
done
flag istrue
. If the flag isfalse
, include therewriteToken
that's returned in the response in the subsequent request. [...]
--- https://cloud.google.com/storage/docs/json_api/v1/objects/rewrite
In order to properly support this, the request should be repeated as described by the documentation.
Right now the code does not check if the flag is set to done. And just returns the Object.
Although it does look like (reading for documentation) the resource
item will only be present if the rewrite is completed in that 1 transaction. So the deserialization will fail because the resource
in RewriteResponse
is not marked as Option<Object>
.
So currently the function call will fail on the deserialization if the object failed to transfer. Which is less then ideal. It should do multiple transactions or fail by checking if the done
flag was set to false
.
Slashes in the object name need to be URL encoded. The work around I have done is to do that encoding prior to calling ObjectAccessControl. I.E:
let filename = "foo/bar"
Object::create("bucket", content, filename).await?;
let filename = filename.replace("/", "%2F");
ObjectAccessControl::create("bucket", &filename, &acl).await?;
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.