We want to support encrypting and decrypting data that is recorded in Iceberg tables.

I was considering using Palantir's <a href="https://github.com/palantir/hadoop-crypto"

Tagging <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

Encryption in Data Files,about apache/iceberg

Comments (24)

omalley commented on September 28, 2024

You certainly want to use a KeyManager/Provider. I'd suggest that you take a look at the API that I made for ORC.

https://github.com/apache/orc/blob/280cb5a6a6122ad79d04dc5ca9c76b2281f6153a/java/shims/src/java/org/apache/orc/impl/HadoopShims.java#L129

You are going to want to support the cloud Key Management Servers (KMS) and thus you need to be compatible with their APIs.

Given that you have a KeyManager, you want to record the key name to encrypt with. On the read side, you pretty much need to put the encrypted local key and IV into the file's metadata. You absolutely can't have a fixed IV for the table and shouldn't have a fixed local key for the table.

For column encryption, it becomes more interesting. There you have:

a key name
the columns to encrypt with which key
how to mask the unencrypted version of the data (defaulting to nullify)
a local key per file / encrypted column, which will be stored encrypted in the file's metadata

from iceberg.

omalley commented on September 28, 2024

I should also mention that ORC's KeyProvider API is a subset of Hadoop's and the default implementation will use the Hadoop implementation. On the cloud providers, we'll need to create an implementation for each service.

from iceberg.

mccheah commented on September 28, 2024

I was considering using Palantir's hadoop-crypto library to do the actual encryption portion of things. What do you think about this package?

Column encryption is interesting; on our side we haven't explored this yet, and thus would not really be able to handle per-column encryption, and need to, in the meantime, only encrypt at the top file layer. That is to say, our internal storage solution doesn't handle storing multiple keys to decrypt different portions of the same file. You'll also notice this as such in the hadoop-crypto library. So whatever solution we come up with should be able to handle a full file encryption or a per-column encryption. I suppose though a file would only be able to be encrypted one way or the other way strictly; if we encrypt the whole file, you more or less lose all the benefits of per-column encryption.

Additionally a key part of performance is reducing the number of round trips made to the key storage backend, particularly if the backend supports batch operations. So it's ideal if the KeyManager could support getting and putting multiple keys at once, as well as implementing the Spark Data Source and other Iceberg clients to contact the backend as few times as possible.

from iceberg.

mccheah commented on September 28, 2024

Tagging @vinooganesh @yifeih to follow.

from iceberg.

omalley commented on September 28, 2024

I understand that column encryption take file format support and that isn't available yet, although it will be available for ORC soon.

I haven't looked at the details of Palantir's hadoop-crypto library, but the approach looks good. For per-file encryption, I would:

Define the key name in the Iceberg table metadata.
When writing, call the KMS to generate a random local key and the corresponding encrypted bytes. Create a random IV for each file. When you update the manifest for the file, add the key name, key version, encryption algorithm, iv, and encrypted local key.
When reading, use the metadata from the manifest to have the KMS decrypt the key for that file. Use the decrypted key and iv to decrypt the file as needed.

The relevant features:

The master key stays in the KMS and is never given to the user.
There is only one trip per a file to the KMS during reading or writing.
The encryption never reuses a local key/iv pair. Reuse of those pairs is very very bad.
If the user keeps a local key, it can only be used to decrypt that file.
Rolling new versions of master keys is supported.

from iceberg.

omalley commented on September 28, 2024

Looking a little deeper at hadoop-crypto, they are doing the key management themselves. I think most users would be better served by using a KMS. Amazon's KMS is here. Using a KMS means that you don't send big secrets to the job, which radically lowers your potential for screw ups.

from iceberg.

mccheah commented on September 28, 2024

I think we can build a top level interface that, when implemented with KMS, could accomplish those goals, but, could also be implemented by other means to support other key storage strategies. More precisely I really don't want to tie us to KMS as the key storage backend for all Iceberg users. Here's a rough sketch of a design that I think could have that flexibility:

Define an EncryptionMetadata struct that is a column in the DataFile schema. It is optional. The struct has the following schema:

struct EncryptionMetadata {
    String keyName();
    long keyVersion();
    String encryptedKeyAlgorithm();
    byte[] iv();
    byte[] encryptedKeyBytes();
}

Define the structure EncryptionKey as follows; it's simply the real key (more or less just this KeyMaterial

struct EncryptionKey {
    byte[] keyBytes();
    byte[] iv();
}

Define a KeyManager that handles both read and generation of these keys.

struct GeneratedEncryptionKey {
    EncryptionMetadata keyMetadata();
    EncryptionKey key();
}

interface KeyManager {
    EncryptionKey resolveEncryptionkey(EncryptionMetadata metadata);

    // Optional, can just default to calling the individual resolve(...) multiple times.
    List<EncryptionKey> resolveEncryptionKeyBatch(List<EncryptionMetadata> metadatas);

    GeneratedEncryptionkey generateEncryptionKey(String filePath);

    // Optional, can just default to calling the individual generate(...) multiple times.
    List<GeneratedEncryptionKey> generateEncryptionKeyBatch(List<String> filePaths);
}

We remark that GeneratedEncryptionKey as a pair of decrypted and encrypted key is required because we need both the decrypted key to actually encrypt the output stream wherever we're doing the encrypted-write itself, as well as the encrypted key to store in the manifest file after the fact.

Now let's suppose we were to implement all of the above with KMS as the storage backend. I'd suppose we could provide this as the default implementation that ships with Iceberg.

KeyManager#resolveEncryptionKey goes to the KMS and asks to decrypt the key with the given metadata's name and the encrypted bytes. KMS uses the master key and passes back the decrypted key. KeyManager converts that into an Encryptionkey object and returns the value.
KeyManager#generateEncryptionKey generates a local key, tells KMS to encrypt it and gets back the encrypted key, Return the pair of decrypted and encrypted key in a GeneratedEncryptionKey struct.

Regarding hadoop-crypto, I was considering it less for the key storage layer and more so for its in-memory representation of keys and its generation of ciphers from those key objects. Though the interface shouldn't expose hadoop-crypto objects, naturally.

Finally to cap it all off I'd imagine KeyManager instances can be provided by implementations of TableOperations and must thus be serializable to be sent to e.g. Spark executors.

Thoughts on the above proposal?

from iceberg.

mccheah commented on September 28, 2024

I'm actually going to go ahead and put this into a prototype - we can add further discussion there.

from iceberg.

omalley commented on September 28, 2024

Ok, I'd separate out the key information from the encryption data:

struct KeyDescription {
  String keyName;
  int keyVersion;
}

struct FileKey {
  KeyDescription key;
  byte[] decryptedKey;
  byte[] encryptedKey;
}

struct FileEncryptionMetadata {
  KeyDescription key;
  byte[] encryptedKey;
  byte[] iv;
  String algorithm;
}

interface KeyManager {
   // Create a key for a single file.
   FileKey createFileKey(String keyName);

   // Decrypt a single file key
   byte[] decryptKey(KeyDescription key, byte[] encryptedKey);

   // Decrypt a list of keys
   List<byte[]> decryptKeys(KeyDescription key, List<byte[]> encryptedKeys);
}

The KMS doesn't need the path, iv, or the algorithm. But I think this is starting to look good.

from iceberg.

omalley commented on September 28, 2024

And of course, you could add List<FileKey> createFileKeys(String keyName, int keys);

from iceberg.

mccheah commented on September 28, 2024

For creation it would be helpful to pass in the file path, and in fact I'm wondering if the name should always be the URI of the file. Thoughts? We rely on path-based lookup for our key storage system.

from iceberg.

omalley commented on September 28, 2024

You typically don't want a new master key per a file, just a local file key. So the KMS shouldn't know or care about file or table paths. So for example, you could have an entire set of tables protected with the "pii" key.

The "user" table property has encrypt with "pii" key.
Each file has a unique file key that the KeyManager encrypts with the "pii" master key.

So to write, the job needs to generate random file keys and have them encrypted. The KeyManager just needs the master key name and gives back the key version, the encrypted file key, and the unencrypted file key. To read, we need to decrypt the file key, so the job passes the master key name, the master key version, and the encrypted key to the KeyManager.

from iceberg.

mccheah commented on September 28, 2024

So to write, the job needs to generate random file keys and have them encrypted. The KeyManager just needs the master key name and gives back the key version, the encrypted file key, and the unencrypted file key. To read, we need to decrypt the file key, so the job passes the master key name, the master key version, and the encrypted key to the KeyManager.

This is different from the way we model key storage. We roughly follow this model: https://github.com/palantir/hadoop-crypto/blob/ac8d4474ee667121873bf0abf0674d83c78d8b90/crypto-keys/src/main/java/com/palantir/crypto2/keys/KeyStorageStrategy.java#L27, where for a given FileSystem the encryption key name is derived from the file path: https://github.com/palantir/hadoop-crypto/blob/6d9e05a1e667f150f7d98435e93a0dd6f3ea5c08/hadoop-crypto/src/main/java/com/palantir/crypto2/hadoop/EncryptedFileSystem.java#L115. Because we use hadoop-crypto right now for existing encryption we need to match the existing model. I was wondering how we could do this with the Iceberg API while still enabling the KMS model you propose.

from iceberg.

mccheah commented on September 28, 2024

Let me also explain in more detail. In our system we don't have a mapping from file paths to encryption key names. We can't store that without changing the way our storage solution reasons about encryption. We are using Iceberg as an ephemeral representation of the data that will be read by Spark and written via Spark. On the write side specifically, our KeyManager plugin will need the file paths to store the encryption keys, because that's the only metadata we will have available to us on the read side. In other words, because some internal key storage and metastore solutions don't have a notion of a key name per file, we should pass in the file path to the KeyManager APIs so that they can use that as an additional option for their situations.

from iceberg.

mccheah commented on September 28, 2024

One other way we could solve for both use cases would be as follows:

interface KeyManager {

    default String generateNewKeyName(String filePath) {
        return "iceberg-key-" + UUID.randomUUID();
    }

    // Create a key for a single file.
    FileKey createFileKey(String keyName);

    // Decrypt a single file key
    byte[] decryptKey(KeyDescription key, byte[] encryptedKey);

   // Decrypt a list of keys
   List<byte[]> decryptKeys(KeyDescription key, List<byte[]> encryptedKeys);
}

What this allows is for implementations that want to write their own key name derivation to do so. So for example the name could be derived from the path, but by default for the traditional KMS use case it is completely random. The key name generator could also be its own module / interface.

from iceberg.

omalley commented on September 28, 2024

Ok, thanks for explaining. I can at least see at least some disconnect now, which helps.

So from the hadoop-crypto point of view, the keys are the "file keys" in the way that I was thinking about it. They are writing the file key out as a side file and encrypting with a public/private key. That is relatively expensive and doubles the number of s3 objects. So their model would fit into my proposal except that they have a single global master key (their public/private keys).

I get that your implementation hashes the path to generate the file key, but I don't see how you secure it. Obviously the hash of the path isn't a secret. :)

from iceberg.

mccheah commented on September 28, 2024

We would generate the name of the key using the path, but the bytes of the key given that name are generated securely. (Edit: Meaning, the name of the key has an association with the path, but the bytes of the key are generated independent of the path or the key name.) Additionally the built-in encryption solution that stores the key alongside the file is a default implementation, but one can choose to implement KeyStorageStrategy with another backend.

from iceberg.

omalley commented on September 28, 2024

I'd also argue that Iceberg can't assume paths are immutable. So using the hash of the path to reconstruct the secret key won't work.

from iceberg.

mccheah commented on September 28, 2024

Sure, and the default implementation doesn't have to. I think the default implementation shouldn't actually assume the file path has any association with the key. The idea is that it doesn't hurt to have the API include the file URI somewhere when generating keys, because the key manager can thus know what it is encrypting, and that can influence how it wants to make the key name and the storage metadata of that key.

from iceberg.

rdblue commented on September 28, 2024

We should also loop in @ggershinsky, who is working on the Parquet encryption spec.

from iceberg.

ggershinsky commented on September 28, 2024

Parquet encryption has additional metadata parameters (such as per-file aadPrefix's, column keys, etc) - but, since this is basically a column encryption format, it might be too early to factor this in. In any case, whenever needed feel free to ping me, I'd be glad to assist with the Parquet part.

from iceberg.

mccheah commented on September 28, 2024

@rdblue @omalley I moved the architecture to https://docs.google.com/document/d/1LptmFB7az2rLnou27QK_KKHgjcA5vKza0dWj4h8fkno/edit?usp=sharing. Please take a look and comment, and then we can continue with implementation.

from iceberg.

rdblue commented on September 28, 2024

@mccheah, can you also start a thread on the dev list to point out this spec? I think other people will probably be interested that aren't necessarily following the github issues.

from iceberg.

mccheah commented on September 28, 2024

This is done. Thanks everyone!

from iceberg.

Encryption in Data Files about iceberg HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent