Comments (24)
You certainly want to use a KeyManager/Provider. I'd suggest that you take a look at the API that I made for ORC.
You are going to want to support the cloud Key Management Servers (KMS) and thus you need to be compatible with their APIs.
Given that you have a KeyManager, you want to record the key name to encrypt with. On the read side, you pretty much need to put the encrypted local key and IV into the file's metadata. You absolutely can't have a fixed IV for the table and shouldn't have a fixed local key for the table.
For column encryption, it becomes more interesting. There you have:
- a key name
- the columns to encrypt with which key
- how to mask the unencrypted version of the data (defaulting to nullify)
- a local key per file / encrypted column, which will be stored encrypted in the file's metadata
from iceberg.
I should also mention that ORC's KeyProvider API is a subset of Hadoop's and the default implementation will use the Hadoop implementation. On the cloud providers, we'll need to create an implementation for each service.
from iceberg.
I was considering using Palantir's hadoop-crypto library to do the actual encryption portion of things. What do you think about this package?
Column encryption is interesting; on our side we haven't explored this yet, and thus would not really be able to handle per-column encryption, and need to, in the meantime, only encrypt at the top file layer. That is to say, our internal storage solution doesn't handle storing multiple keys to decrypt different portions of the same file. You'll also notice this as such in the hadoop-crypto library. So whatever solution we come up with should be able to handle a full file encryption or a per-column encryption. I suppose though a file would only be able to be encrypted one way or the other way strictly; if we encrypt the whole file, you more or less lose all the benefits of per-column encryption.
Additionally a key part of performance is reducing the number of round trips made to the key storage backend, particularly if the backend supports batch operations. So it's ideal if the KeyManager
could support getting and putting multiple keys at once, as well as implementing the Spark Data Source and other Iceberg clients to contact the backend as few times as possible.
from iceberg.
Tagging @vinooganesh @yifeih to follow.
from iceberg.
I understand that column encryption take file format support and that isn't available yet, although it will be available for ORC soon.
I haven't looked at the details of Palantir's hadoop-crypto library, but the approach looks good. For per-file encryption, I would:
- Define the key name in the Iceberg table metadata.
- When writing, call the KMS to generate a random local key and the corresponding encrypted bytes. Create a random IV for each file. When you update the manifest for the file, add the key name, key version, encryption algorithm, iv, and encrypted local key.
- When reading, use the metadata from the manifest to have the KMS decrypt the key for that file. Use the decrypted key and iv to decrypt the file as needed.
The relevant features:
- The master key stays in the KMS and is never given to the user.
- There is only one trip per a file to the KMS during reading or writing.
- The encryption never reuses a local key/iv pair. Reuse of those pairs is very very bad.
- If the user keeps a local key, it can only be used to decrypt that file.
- Rolling new versions of master keys is supported.
from iceberg.
Looking a little deeper at hadoop-crypto, they are doing the key management themselves. I think most users would be better served by using a KMS. Amazon's KMS is here. Using a KMS means that you don't send big secrets to the job, which radically lowers your potential for screw ups.
from iceberg.
I think we can build a top level interface that, when implemented with KMS, could accomplish those goals, but, could also be implemented by other means to support other key storage strategies. More precisely I really don't want to tie us to KMS as the key storage backend for all Iceberg users. Here's a rough sketch of a design that I think could have that flexibility:
- Define an
EncryptionMetadata
struct that is a column in theDataFile
schema. It is optional. The struct has the following schema:
struct EncryptionMetadata {
String keyName();
long keyVersion();
String encryptedKeyAlgorithm();
byte[] iv();
byte[] encryptedKeyBytes();
}
- Define the structure
EncryptionKey
as follows; it's simply the real key (more or less just this KeyMaterial
struct EncryptionKey {
byte[] keyBytes();
byte[] iv();
}
- Define a
KeyManager
that handles both read and generation of these keys.
struct GeneratedEncryptionKey {
EncryptionMetadata keyMetadata();
EncryptionKey key();
}
interface KeyManager {
EncryptionKey resolveEncryptionkey(EncryptionMetadata metadata);
// Optional, can just default to calling the individual resolve(...) multiple times.
List<EncryptionKey> resolveEncryptionKeyBatch(List<EncryptionMetadata> metadatas);
GeneratedEncryptionkey generateEncryptionKey(String filePath);
// Optional, can just default to calling the individual generate(...) multiple times.
List<GeneratedEncryptionKey> generateEncryptionKeyBatch(List<String> filePaths);
}
- We remark that
GeneratedEncryptionKey
as a pair of decrypted and encrypted key is required because we need both the decrypted key to actually encrypt the output stream wherever we're doing the encrypted-write itself, as well as the encrypted key to store in the manifest file after the fact.
Now let's suppose we were to implement all of the above with KMS as the storage backend. I'd suppose we could provide this as the default implementation that ships with Iceberg.
KeyManager#resolveEncryptionKey
goes to the KMS and asks to decrypt the key with the given metadata's name and the encrypted bytes. KMS uses the master key and passes back the decrypted key.KeyManager
converts that into anEncryptionkey
object and returns the value.KeyManager#generateEncryptionKey
generates a local key, tells KMS to encrypt it and gets back the encrypted key, Return the pair of decrypted and encrypted key in aGeneratedEncryptionKey
struct.
Regarding hadoop-crypto, I was considering it less for the key storage layer and more so for its in-memory representation of keys and its generation of ciphers from those key objects. Though the interface shouldn't expose hadoop-crypto objects, naturally.
Finally to cap it all off I'd imagine KeyManager
instances can be provided by implementations of TableOperations
and must thus be serializable to be sent to e.g. Spark executors.
Thoughts on the above proposal?
from iceberg.
I'm actually going to go ahead and put this into a prototype - we can add further discussion there.
from iceberg.
Ok, I'd separate out the key information from the encryption data:
struct KeyDescription {
String keyName;
int keyVersion;
}
struct FileKey {
KeyDescription key;
byte[] decryptedKey;
byte[] encryptedKey;
}
struct FileEncryptionMetadata {
KeyDescription key;
byte[] encryptedKey;
byte[] iv;
String algorithm;
}
interface KeyManager {
// Create a key for a single file.
FileKey createFileKey(String keyName);
// Decrypt a single file key
byte[] decryptKey(KeyDescription key, byte[] encryptedKey);
// Decrypt a list of keys
List<byte[]> decryptKeys(KeyDescription key, List<byte[]> encryptedKeys);
}
The KMS doesn't need the path, iv, or the algorithm. But I think this is starting to look good.
from iceberg.
And of course, you could add List<FileKey> createFileKeys(String keyName, int keys);
from iceberg.
For creation it would be helpful to pass in the file path, and in fact I'm wondering if the name should always be the URI of the file. Thoughts? We rely on path-based lookup for our key storage system.
from iceberg.
You typically don't want a new master key per a file, just a local file key. So the KMS shouldn't know or care about file or table paths. So for example, you could have an entire set of tables protected with the "pii" key.
- The "user" table property has encrypt with "pii" key.
- Each file has a unique file key that the KeyManager encrypts with the "pii" master key.
So to write, the job needs to generate random file keys and have them encrypted. The KeyManager just needs the master key name and gives back the key version, the encrypted file key, and the unencrypted file key. To read, we need to decrypt the file key, so the job passes the master key name, the master key version, and the encrypted key to the KeyManager.
from iceberg.
So to write, the job needs to generate random file keys and have them encrypted. The KeyManager just needs the master key name and gives back the key version, the encrypted file key, and the unencrypted file key. To read, we need to decrypt the file key, so the job passes the master key name, the master key version, and the encrypted key to the KeyManager.
This is different from the way we model key storage. We roughly follow this model: https://github.com/palantir/hadoop-crypto/blob/ac8d4474ee667121873bf0abf0674d83c78d8b90/crypto-keys/src/main/java/com/palantir/crypto2/keys/KeyStorageStrategy.java#L27, where for a given FileSystem the encryption key name is derived from the file path: https://github.com/palantir/hadoop-crypto/blob/6d9e05a1e667f150f7d98435e93a0dd6f3ea5c08/hadoop-crypto/src/main/java/com/palantir/crypto2/hadoop/EncryptedFileSystem.java#L115. Because we use hadoop-crypto right now for existing encryption we need to match the existing model. I was wondering how we could do this with the Iceberg API while still enabling the KMS model you propose.
from iceberg.
Let me also explain in more detail. In our system we don't have a mapping from file paths to encryption key names. We can't store that without changing the way our storage solution reasons about encryption. We are using Iceberg as an ephemeral representation of the data that will be read by Spark and written via Spark. On the write side specifically, our KeyManager
plugin will need the file paths to store the encryption keys, because that's the only metadata we will have available to us on the read side. In other words, because some internal key storage and metastore solutions don't have a notion of a key name per file, we should pass in the file path to the KeyManager
APIs so that they can use that as an additional option for their situations.
from iceberg.
One other way we could solve for both use cases would be as follows:
interface KeyManager {
default String generateNewKeyName(String filePath) {
return "iceberg-key-" + UUID.randomUUID();
}
// Create a key for a single file.
FileKey createFileKey(String keyName);
// Decrypt a single file key
byte[] decryptKey(KeyDescription key, byte[] encryptedKey);
// Decrypt a list of keys
List<byte[]> decryptKeys(KeyDescription key, List<byte[]> encryptedKeys);
}
What this allows is for implementations that want to write their own key name derivation to do so. So for example the name could be derived from the path, but by default for the traditional KMS use case it is completely random. The key name generator could also be its own module / interface.
from iceberg.
Ok, thanks for explaining. I can at least see at least some disconnect now, which helps.
So from the hadoop-crypto point of view, the keys are the "file keys" in the way that I was thinking about it. They are writing the file key out as a side file and encrypting with a public/private key. That is relatively expensive and doubles the number of s3 objects. So their model would fit into my proposal except that they have a single global master key (their public/private keys).
I get that your implementation hashes the path to generate the file key, but I don't see how you secure it. Obviously the hash of the path isn't a secret. :)
from iceberg.
We would generate the name of the key using the path, but the bytes of the key given that name are generated securely. (Edit: Meaning, the name of the key has an association with the path, but the bytes of the key are generated independent of the path or the key name.) Additionally the built-in encryption solution that stores the key alongside the file is a default implementation, but one can choose to implement KeyStorageStrategy
with another backend.
from iceberg.
I'd also argue that Iceberg can't assume paths are immutable. So using the hash of the path to reconstruct the secret key won't work.
from iceberg.
Sure, and the default implementation doesn't have to. I think the default implementation shouldn't actually assume the file path has any association with the key. The idea is that it doesn't hurt to have the API include the file URI somewhere when generating keys, because the key manager can thus know what it is encrypting, and that can influence how it wants to make the key name and the storage metadata of that key.
from iceberg.
We should also loop in @ggershinsky, who is working on the Parquet encryption spec.
from iceberg.
Parquet encryption has additional metadata parameters (such as per-file aadPrefix's, column keys, etc) - but, since this is basically a column encryption format, it might be too early to factor this in. In any case, whenever needed feel free to ping me, I'd be glad to assist with the Parquet part.
from iceberg.
@rdblue @omalley I moved the architecture to https://docs.google.com/document/d/1LptmFB7az2rLnou27QK_KKHgjcA5vKza0dWj4h8fkno/edit?usp=sharing. Please take a look and comment, and then we can continue with implementation.
from iceberg.
@mccheah, can you also start a thread on the dev list to point out this spec? I think other people will probably be interested that aren't necessarily following the github issues.
from iceberg.
This is done. Thanks everyone!
from iceberg.
Related Issues (20)
- Questions on some requirement in view spec HOT 3
- End-of-life Flink 1.16 is still referenced in docs HOT 3
- Make ManifestEntry and ManifestReader.liveEntries() as public HOT 5
- Improve remove_orphan_files performance by using "inventory listing" HOT 2
- Should we remove the use of versionHintFile from the entire FileSystemCatalog? HOT 12
- Bump `HiveCatalog` hive-metastore dependency to Hive 4 HOT 11
- For Hive 2.1.1 version support HOT 3
- Flink sink writes duplicate data in upsert mode HOT 9
- SQL UDFs in Iceberg
- why spark ddl rename iceberg table name not change location? does it matter? HOT 1
- Iceberg could support Hive1.1.0? HOT 2
- Flink: handle rescale (up or down) better in range partitioner
- Rest Catalog - Bug in list namespaces. Namespaces not underneath the parent namespace are returned HOT 6
- Storage Partitioned Join (SPJ) fails when >2 tables are joined HOT 1
- Iceberg supports binlog logs
- flink + iceberg can't read realtime data from glue + s3 data with PRIMARY KEY unique table HOT 6
- Iceberg fails ACID verification test HOT 14
- Prioritize the catalog setting method to polaris-catalog. HOT 2
- Max Length (And Padded) String Types
- Express REST server capabilities
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from iceberg.