Consider the case that you have a storage service that supports multiple hashing algor

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

V3 idea: Convert Digest.hash into a oneof for each hashing algorithm about remote-apis HOT 8 OPEN

bazelbuild commented on June 23, 2024

V3 idea: Convert Digest.hash into a oneof for each hashing algorithm

from remote-apis.

Comments (8)

ulfjack commented on June 23, 2024

That would require a protocol change every time a new digest function is introduced. I'd prefer to go the other direction and make the digest function more flexible instead. For example, I think that we might want to introduce a digest function where the function itself over a single blob can be parallelized in some form.

from remote-apis.

EricBurnett commented on June 23, 2024

Another option would be to simply add a second string field with the scheme. Slightly larger payload, but works for everyone without additional proto changes.

I don't think that's the hard part though, unfortunately. The challenge we found in supporting multiple hashing schemes is that systems expect consistency. E.g. Execute() involves building trees of digests and messages referenced by hash, and relies on determinism of this all. Clients and Workers need to be using the same hashing algorithm, etc. During an inline transition from SHA1 to SHA256 we did have to have our CAS support two for a while, and switched based on observed hash length; if we expected to do it regularly we'd probably want some indicator in the Digest method to make it clear. But as it required effectively a whole-ecosystem change (client, Execution API server, workers, etc all agreeing), we chose to drop dual support as quick as we could, and considered that a transitional hack rather than a long-term state. And if you can avoid dual-supporting within one logical namespace, you can avoid being explicit about the algorithm at all at the Digest level, because it's implicit and there's only one valid choice. (This is what we do today, with the algorithm in the Capabilities API.)

May be different if you're doing say Remote Caching only and so it's only clients that have to agree with themselves, but in that scenario it also seems simplest to point them at two different instances of the CAS (or namespaces within) each that supports only one algorithm than to have one logical instance with blobs of two algorithms in parallel. Otherwise, dual support requires a lot more than just being clear in the Digest, unfortunately.

Edit: Corrected 'compression' and 'encryption' to 'hashing' in text above - brain fail.

from remote-apis.

mostynb commented on June 23, 2024

Specifying a hash function explicitly might be useful for other scenarios besides a single client migrating from one hash to another. For example, multiple clients operating on different codebases/languages but sharing build cache because it's cheaper to have one large/fast storage service than multiple pre-partitioned caches.

Would adding an enum field be simpler than oneof?

Or if we want more flexibility, add an int field to the Digest message that refers to an index in the digest_function list returned by the capabilites API.

from remote-apis.

ulfjack commented on June 23, 2024

If you want it to be cheaper, then you probably want the same digest as well so you can share cache hits? Even if you have different languages on different OSs, there's a good chance that some of the stuff can be shared.

from remote-apis.

mostynb commented on June 23, 2024

If you want it to be cheaper, then you probably want the same digest as well so you can share cache hits?

Yes, of course. IMO clients should be encouraged to use the same hash function, but not forced to do so. Server implementations can support cache hits for the same blob with different hash functions if they want to give cache hits even with clients using different hash functions.

Even if you have different languages on different OSs, there's a good chance that some of the stuff can be shared.

I'm skeptical that this would help much in practice, but maybe my use cases are different to yours.

from remote-apis.

ulfjack commented on June 23, 2024

Regarding my previous post: I can imagine significant overlap between clients, and I can also imagine cases where there's no overlap.

Regardless of that, I'm not convinced. I spent some more time thinking about it, and it seems to me that the current protocol allows you to do what you are describing and that your actual technical problems lie elsewhere.

As a counter-proposal, you could just have your service provide multiple endpoints for clients that want to use different digest algorithms, i.e., one endpoint for sha256 and one for md5. For example, you could use a single unified storage service using an s3-like API, and then run separate proxies using REAPI for each of the digest algorithms. Think about it: what are the technical implications of such a setup?

My conclusions are that you can't avoid the following:
Either you must a) convert everything to a single digest algorithm in the proxy, or you must b) partition the storage to avoid overwriting each others data.

a) This means you have to re-digest N times for every blob write since you need a lookup from digest to blob and you need N digest algorithms. However, your assumption is that the different clients do not share cache entries, which means this cost is not warranted.

b) This is easy enough - either the digests already have different value spaces (e.g., different lengths), or you prefix the digest with the algorithm name for the key. I.e., sha256:ababab81782 and md5:1234567. You can share the storage on the backend, and you don't get any cross-client caching.

This approach seems perfectly viable and requires no additional complexity in the protocol that would affect all clients and servers, not just the use case described here.

All in all, I would vote against this proposal based on the grounds that the complexity shouldn't be in the protocol - if you want to do this, it can be done in a way that only you have to pay for the complexity rather than everyone.

I could be convinced if there was a case made that a large number of users end up with this complexity, but I don't see how that could possibly be true, especially for smaller users (and it seems fine for larger users to have to carry the complexity burden themselves).

from remote-apis.

mostynb commented on June 23, 2024

b) This is easy enough - either the digests already have different value spaces (e.g., different lengths), or you prefix the digest with the algorithm name for the key. I.e., sha256:ababab81782 and md5:1234567. You can share the storage on the backend, and you don't get any cross-client caching.

I think most implementations would choose to allow different hash functions each with their own keyspace, and not bother maintaining cross-hash indexes.

But IMO a separate field in the Digest would be better than embedding this in the hash string. My preference would be an enum value rather than a string- it's not that often that implementations will want to add new hash functions, and this avoids the hassle of maintaining consistently capitalised/styled/etc string values. And we already have such an enum.

@EricBurnett: I'm a bit confused about your mention of encryption and compression above, could you provide a little more background on that?

from remote-apis.

EricBurnett commented on June 23, 2024

@mostynb whoops! Those should both have said 'hashing', I clearly had my head in some other space when writing that, apologies for the confusion. I have corrected the comment above, it should hopefully be clear now.

from remote-apis.

V3 idea: Convert Digest.hash into a oneof for each hashing algorithm about remote-apis HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent