Comments (5)
Hello! One quick question, this is just once it was upload, just to serve? Otherwise, what would happen in this two scenarios:
- The storage somehow corrupts the file. Then is not our responsibility (isn't it?).
- The file gets corrupted in the wire (between Invenio and Storage) and we have no means of crosschecking if we do not calculate the checksum in Invenio.
from invenio-s3.
I see just one problem with that approach, but perhaps I'm not aware of a possible solution or it's handled otherwise ;)
When using MultiPart-Uploads checksums are calculated for every part that's uploaded. Then it's used to calculate the final checksum. Is the result really usable? The normal checksum type in S3 is md5, but I can't imagine thats correct for multipart uploads.
from invenio-s3.
@wgresshoff you are definetly right, and I don't have an answer for that
from invenio-s3.
@ppanero The checksum that it's stored to Invenio's database gets calculated at upload time, i.e. we do the content digest and the hex hash. So yes, it's just to do integrity checks afterward.
Now, if we want to verify the file integrity we could do two things, (i) ask the storage server for the checksum and compare with the one we have stored (from the upload) or (ii) calculate the checksum on our end and compare it with the one we have stored.
The first option is only doable right now for smaller files, i.e. no multipart uploads, as soon as you upload a big file you get an Etag
that is the combination of the hashes of each of its parts (what @wgresshoff pointed out)
The problem with the second option is that it's time-consuming, you have to read the entire file, but it works for all small and big files. Plus if you use a service, say AWS S3, you have to pay for the extra traffic.
Perhaps "the middle way" might be the solution here, if we can get it from the server, use it, otherwise calculate it...
from invenio-s3.
Middle way seems the best trade-off, thanks for the explanations :)
from invenio-s3.
Related Issues (13)
- File upload fails for big files
- docs: domain name in init.py HOT 1
- make new release HOT 5
- config: support for multiple S3 locations HOT 7
- remove python 2 support HOT 4
- Update docs following new files rest release HOT 1
- invenio-s3 repeats the bucket name as path HOT 1
- global: migration to github-actions from travis
- Number of parts are not correctly calculated
- Typo in config variable `S3_ACCCESS_KEY_ID` `S3_ACCESS_KEY_ID` (too many Cs)
- Make the region name configurable too HOT 1
- Upload speed to S3 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from invenio-s3.