Right now when asl for the checksum of a file we digest the file and calculate it appl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Use checksum from storage server instead of calculating it always about invenio-s3 HOT 5 OPEN

inveniosoftware commented on June 6, 2024

Use checksum from storage server instead of calculating it always

from invenio-s3.

Comments (5)

ppanero commented on June 6, 2024

Hello! One quick question, this is just once it was upload, just to serve? Otherwise, what would happen in this two scenarios:

The storage somehow corrupts the file. Then is not our responsibility (isn't it?).
The file gets corrupted in the wire (between Invenio and Storage) and we have no means of crosschecking if we do not calculate the checksum in Invenio.

from invenio-s3.

wgresshoff commented on June 6, 2024

I see just one problem with that approach, but perhaps I'm not aware of a possible solution or it's handled otherwise ;)
When using MultiPart-Uploads checksums are calculated for every part that's uploaded. Then it's used to calculate the final checksum. Is the result really usable? The normal checksum type in S3 is md5, but I can't imagine thats correct for multipart uploads.

from invenio-s3.

egabancho commented on June 6, 2024

@wgresshoff you are definetly right, and I don't have an answer for that ☺️

from invenio-s3.

egabancho commented on June 6, 2024

@ppanero The checksum that it's stored to Invenio's database gets calculated at upload time, i.e. we do the content digest and the hex hash. So yes, it's just to do integrity checks afterward.

Now, if we want to verify the file integrity we could do two things, (i) ask the storage server for the checksum and compare with the one we have stored (from the upload) or (ii) calculate the checksum on our end and compare it with the one we have stored.

The first option is only doable right now for smaller files, i.e. no multipart uploads, as soon as you upload a big file you get an Etag that is the combination of the hashes of each of its parts (what @wgresshoff pointed out)

The problem with the second option is that it's time-consuming, you have to read the entire file, but it works for all small and big files. Plus if you use a service, say AWS S3, you have to pay for the extra traffic.

Perhaps "the middle way" might be the solution here, if we can get it from the server, use it, otherwise calculate it...

from invenio-s3.

ppanero commented on June 6, 2024

Middle way seems the best trade-off, thanks for the explanations :)

from invenio-s3.

Use checksum from storage server instead of calculating it always about invenio-s3 HOT 5 OPEN

Comments (5)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent