Comments (8)
That would require a protocol change every time a new digest function is introduced. I'd prefer to go the other direction and make the digest function more flexible instead. For example, I think that we might want to introduce a digest function where the function itself over a single blob can be parallelized in some form.
from remote-apis.
Another option would be to simply add a second string field with the scheme. Slightly larger payload, but works for everyone without additional proto changes.
I don't think that's the hard part though, unfortunately. The challenge we found in supporting multiple hashing schemes is that systems expect consistency. E.g. Execute() involves building trees of digests and messages referenced by hash, and relies on determinism of this all. Clients and Workers need to be using the same hashing algorithm, etc. During an inline transition from SHA1 to SHA256 we did have to have our CAS support two for a while, and switched based on observed hash length; if we expected to do it regularly we'd probably want some indicator in the Digest method to make it clear. But as it required effectively a whole-ecosystem change (client, Execution API server, workers, etc all agreeing), we chose to drop dual support as quick as we could, and considered that a transitional hack rather than a long-term state. And if you can avoid dual-supporting within one logical namespace, you can avoid being explicit about the algorithm at all at the Digest level, because it's implicit and there's only one valid choice. (This is what we do today, with the algorithm in the Capabilities API.)
May be different if you're doing say Remote Caching only and so it's only clients that have to agree with themselves, but in that scenario it also seems simplest to point them at two different instances of the CAS (or namespaces within) each that supports only one algorithm than to have one logical instance with blobs of two algorithms in parallel. Otherwise, dual support requires a lot more than just being clear in the Digest, unfortunately.
Edit: Corrected 'compression' and 'encryption' to 'hashing' in text above - brain fail.
from remote-apis.
Specifying a hash function explicitly might be useful for other scenarios besides a single client migrating from one hash to another. For example, multiple clients operating on different codebases/languages but sharing build cache because it's cheaper to have one large/fast storage service than multiple pre-partitioned caches.
Would adding an enum field be simpler than oneof?
Or if we want more flexibility, add an int field to the Digest message that refers to an index in the digest_function list returned by the capabilites API.
from remote-apis.
If you want it to be cheaper, then you probably want the same digest as well so you can share cache hits? Even if you have different languages on different OSs, there's a good chance that some of the stuff can be shared.
from remote-apis.
If you want it to be cheaper, then you probably want the same digest as well so you can share cache hits?
Yes, of course. IMO clients should be encouraged to use the same hash function, but not forced to do so. Server implementations can support cache hits for the same blob with different hash functions if they want to give cache hits even with clients using different hash functions.
Even if you have different languages on different OSs, there's a good chance that some of the stuff can be shared.
I'm skeptical that this would help much in practice, but maybe my use cases are different to yours.
from remote-apis.
Regarding my previous post: I can imagine significant overlap between clients, and I can also imagine cases where there's no overlap.
Regardless of that, I'm not convinced. I spent some more time thinking about it, and it seems to me that the current protocol allows you to do what you are describing and that your actual technical problems lie elsewhere.
As a counter-proposal, you could just have your service provide multiple endpoints for clients that want to use different digest algorithms, i.e., one endpoint for sha256 and one for md5. For example, you could use a single unified storage service using an s3-like API, and then run separate proxies using REAPI for each of the digest algorithms. Think about it: what are the technical implications of such a setup?
My conclusions are that you can't avoid the following:
Either you must a) convert everything to a single digest algorithm in the proxy, or you must b) partition the storage to avoid overwriting each others data.
a) This means you have to re-digest N times for every blob write since you need a lookup from digest to blob and you need N digest algorithms. However, your assumption is that the different clients do not share cache entries, which means this cost is not warranted.
b) This is easy enough - either the digests already have different value spaces (e.g., different lengths), or you prefix the digest with the algorithm name for the key. I.e., sha256:ababab81782 and md5:1234567. You can share the storage on the backend, and you don't get any cross-client caching.
This approach seems perfectly viable and requires no additional complexity in the protocol that would affect all clients and servers, not just the use case described here.
All in all, I would vote against this proposal based on the grounds that the complexity shouldn't be in the protocol - if you want to do this, it can be done in a way that only you have to pay for the complexity rather than everyone.
I could be convinced if there was a case made that a large number of users end up with this complexity, but I don't see how that could possibly be true, especially for smaller users (and it seems fine for larger users to have to carry the complexity burden themselves).
from remote-apis.
b) This is easy enough - either the digests already have different value spaces (e.g., different lengths), or you prefix the digest with the algorithm name for the key. I.e., sha256:ababab81782 and md5:1234567. You can share the storage on the backend, and you don't get any cross-client caching.
I think most implementations would choose to allow different hash functions each with their own keyspace, and not bother maintaining cross-hash indexes.
But IMO a separate field in the Digest would be better than embedding this in the hash string. My preference would be an enum value rather than a string- it's not that often that implementations will want to add new hash functions, and this avoids the hassle of maintaining consistently capitalised/styled/etc string values. And we already have such an enum.
@EricBurnett: I'm a bit confused about your mention of encryption and compression above, could you provide a little more background on that?
from remote-apis.
@mostynb whoops! Those should both have said 'hashing', I clearly had my head in some other space when writing that, apologies for the confusion. I have corrected the comment above, it should hopefully be clear now.
from remote-apis.
Related Issues (20)
- Compression: Further specify ByteStream WriteResponse committed_size field for compressed blobs HOT 1
- How does GetCapabilities() interact with authorization & unknown instance names? HOT 3
- Git tags for minor versions
- Non-Bazel Server Example HOT 3
- Replace message Tree with a topologically sorted varint delimited stream of Directory messages HOT 1
- REv3 idea: Make is_topologically_sorted the default, and eliminate tag bytes
- Let exit_code be better aligned with C/POSIX
- REv3 idea: Make use of digest_function in requests mandatory
- REv3: Use IPLD (CIDs, DAG-PB, etc.)
- Chyba
- CAS: Existence Caching in Intermediate Caches (user experience report) HOT 2
- Please tag REv2 2.1.0 2.2.0 [...] HOT 6
- API extension for Git hashes HOT 1
- Googleapis is outdated HOT 1
- Should we make a resolution to NOT have a v3? HOT 2
- Support compression with external dictionary HOT 6
- Add supported_max_cas_entry_size property to CacheCapabilities HOT 2
- Bazel version to use to run hooks/pre-commit unclear HOT 1
- REv3: Reduce asymmetry between O(n) output files and O(1) output directories HOT 2
- Platform standardization HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from remote-apis.