mithril-security / blindai Goto Github PK

View Code? Open in Web Editor NEW

490.0 12.0 35.0 44.81 MB

Confidential AI deployment with secure enclaves :lock:

Home Page: https://www.mithrilsecurity.io/

License: Apache License 2.0

Python 33.75% Shell 11.86% Dockerfile 5.47% Rust 37.30% Earthly 8.66% Just 2.95%

ai intel-sgx confidential-computing privacy rust python3 inference enclave onnx sgx

blindai's People

Contributors

Stargazers

Watchers

blindai's Issues

Erroneous positional argument in docs/docs/how-to-guides/covid_net_confidential.ipynb

The client.run_model function passes the model id and input tensor batch as positional arguments, but does so incorrectly as the first two arguments of said function are the model id and the model hash. Hence there is a need to explicitly mention that the input tensor object is passed to the input_tensors argument.

Error:

Solution:

DecodeError: Error parsing message with type 'onnx.ModelProto'

I get this error when loading the pretrained whisper model you provide in the Colab example notebook:

import onnx
onnx_model = onnx.load("./whisper_tiny_en_20_tokens.onnx")

Stacktrace:

---------------------------------------------------------------------------
DecodeError                               Traceback (most recent call last)
[<ipython-input-8-d62064a4cf0f>](https://localhost:8080/#) in <module>
      1 import onnx
      2 
----> 3 onnx_model = onnx.load("./whisper_tiny_en_20_tokens.onnx")

2 frames
[/usr/local/lib/python3.8/dist-packages/onnx/__init__.py](https://localhost:8080/#) in _deserialize(s, proto)
    106         )
    107 
--> 108     decoded = cast(Optional[int], proto.ParseFromString(s))
    109     if decoded is not None and decoded != len(s):
    110         raise google.protobuf.message.DecodeError(

DecodeError: Error parsing message with type 'onnx.ModelProto'

Whisper example missing NNDecodingTask definition

Description

I am going through the whisper BlindAI example in examples/whisper/BlindAI_Whisper.ipynb and one of the classes used -- NNDecodingTask -- is not defined in the notebook. Can the definition of this class be added so that it will be possible to export different Whisper models (small, large, multi-language etc)?

Why this modification is needed?

Great tutorial that can't be followed 100%...

What documents need to be updated

examples/whisper/BlindAI_Whisper.ipynb

Automate the generation of the client API reference

Description

The client API reference should be automatically generated from docstrings in the package files.

Why this modification is needed?

It's not practicle to rewrite it everytime we do changes, it can also have a better look this way.

What documents need to be updated

BlindAI Client CHANGELOG
Python Docstrings (May need to change the style of the comments)
Client API reference

Additional Information

The existing toole may generate the documentatoin in form of HTML pages => A website, should we be restricted to the once who produce markdown files instead? So that they can always be included directly in the gitbook, or we can have a website that will be linked there?

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

Docs: Which hardware should you pick to get started?

Description

Here is a the advice we currently give (in french):

Notre solution nécessite l’utilisation de la technologie Intel SGX. Celle-ci est disponible sur la plupart des processeurs Intel actuels
(https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&2_SoftwareGuardExtensions=Yes%20with%20Intel%C2%AE%20ME), ainsi que ceux de dernière génération
(https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&2_SoftwareGuardExtensions=Yes%20with%20Intel%C2%AE%20SPS).
Dans le cadre de notre preuve de concept, n’importe quel processeur supportant la fonctionnalité SGX pourra être utilisé pour déployer notre solution.
Néanmoins nous suggérons l’utilisation de processeurs supportant le système d’attestation DCAP, qui est plus adapté aux besoins du scénario d’analyse de vidéo, en plus d’être le mode prédominant sur la dernière génération de processeurs Ice Lake, qui semblent être la gamme privilégiée pour un déploiement en pratique.
L’utilisation de processeurs supportant seulement le système d’attestation EPID est aussi possible par notre solution, mais serait moins représentative du workflow utilisé en production.
Pour un démarrage rapide pour lancer les premiers tests, l’utilisation d’Intel NUC peut être pertinente étant donné leur faible coût et leur facilité d’utilisation. Le modèle Intel® NUC Kit NUC7PJYH ($199) avec Intel® Pentium® Silver J5005 Processor peut être intéressant pour commencer.
Pour un test plus représentatif avec les derniers processeurs Ice Lake, nous recommandons le modèle Intel® Xeon® Gold 5318S ($1667) avec 512 GB de mémoire disponible pour les enclaves.

We need to make a proper english documentation page for it.

What documents need to be updated

Additional Information

None

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

Merge hardware and software in notebook examples

Description

Merge hardware and software in notebook examples

Something like

client = BlindAiClient()

# Comment this line for hardware mode
client.connect_server(addr="localhost", simulation=True)

# Comment this line for simulation mode
client.connect_server(
    addr="localhost",
    policy="policy.toml",
    certificate="host_server.pem"
)

Why this modification is needed?

Make the notebooks clearer, and less redundant

What documents need to be updated

Additional Information

None

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

Client notebook docker does not seem to work on WSL

Description

The command docker run --network host mithrilsecuritysas/blindai-client-demo does not work on WSL (tested on matthias' computer)
It runs fine but the notebook cannot be accessed on the browser.
This is probably due to --network host sharing the network with WSL and not Windows entirely, but I'm not sure, I don't know how WSL works.

I don't know how to fix it, and I don't have any windows machine on hand.

Also, we were talking with Daniel about packaging the server in the blindai-client-demo docker image directly. This could be awesome for getting started with the project, and that would allow us to sidestep this issue entirely.

Move telemetry to client-side

Description

Move telemetry to client-side, so that it is more useful.

Do not forget to change the documentation/readme/everywhere

Test plans

N/A

Additional Information

None

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

Adding model card to each AI API

BlindAI will provide managed AI APIs.
For transparency it would be good to register on the Client Python SDK information about each model we use behind the scenes, for instance a link to the build process that was used to serve a specific model.

For instance, we could have something like:

import blindai

card = blindai.api.get_model_card("whisper", tee="sgx")
card.model_hash
>> "77af778b51abd4a3c51c5ddd97204a9c3ae614ebccb75a606c3b6865aed6744e"

card.build_process_link
>> "github.com/..."

Not top prio but could be cool for transparency.

Use python context in API

Motivation and Context

Current usage may leak an open socket if the user forgets to call close_connection:

client = BlindAiClient()
client.connect_server(addr="localhost", simulation=True)
# do something with client...
client.close_connection()

This is not an issue right now since our users are mostly testing the app, making jupyter notebooks and not actual production usage. As the project matures, this may become an issue.

Description

There is a way in python to make APIs that work like this:

with client = BlindAiClient.connect_server(addr="localhost", simulation=True):
  # do something with client...
  client.run_model("aaa")
  # implicitely close the connection when exiting the scope

using special __enter__ and __exit__ functions, iirc

What do you think? Is this a better API surface?
This should be backward compatible with the current API.

Test plans

unit tests

Additional Information

None

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

Allow loading multiple models at the same time

Description

The upload_model request will return a UUID for the model.
This UUID will be used by run_model and will be added to the signed response proofs.

We may want to be backward compatible with the way our docs and blogposts are written, so that if the model name is not provided, you just use the last one that was uploaded.

Future work

Method to remove a model from the server.
Save the models on disk.
Should we ship everything at once or can we ship multiple models without these?

Motivation and Context

Right now, once you upload a model, you won't be able to upload another one without discarding the first.

Affected Features

State the different existing features that will be broken/affected by this new feature.

Test Plans

Server and client tests + unittests

Additional Information

None

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

client: Accept numpy/torch tensors

Description

Python client should accept numpy / torch tensors directly.

Motivation and Context

This would be a much better API.

Test plans

add unit tests

Additional Information

We should do it in a way that does not require users to install torch nor numpy, if they are not using this feature.

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

AllowDebug false in policy.toml

Description

When building in release mode for hardware mode, we should generate a policy.toml file that does not allow SGX debug mode.
This probably requires changes to the rust code in order to launch the enclave in non-debug mode.

Motivation and Context

The hardware docker image we publish on dockerhub has no reason to have SGX debug mode on.

We should add a build option / environment variable to generate allow-debug policy files, for dev purposes.

Test plans

Either

wait for the SGX-enabled CI runner and write a test
run on our own SGX-enabled runner, write a test for hardware mode, which will get ignored by the non-SGX runner

This is a good opportunity to add the following tests:

client does not allow debug mode enclave if allowDebug = false in policy.toml

Additional Information

none

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

Execution times

Description

In depth info about execution plans
This is more of a meta-issue (/roadmap) focusing everything about execution times.

Plans I have in mind:

first audit of execution time, map possible improvement, low hanging fruits improvements, see if there are design flaws
Will be using flamegraphs and run our current e2e tests with that
Work on benchmarks
CI integration to see regressions / improvements with each versions
probably an auto generated page with these execution times of well known models
Add execution info (or links to) on the readme for known models

I am not sure whether all of this is overkill or not since we're just using tract and not really touching the perf sensitive parts. We'll see.

Additional Information

None

Checklist

This tests concerns BlindAI Client
This tests concerns BlindAI Server

Undefined Parameter in docs/docs/how-to-guides/covid_net_confidential.ipynb

On multiple occasions, during the function call of blindai.client.connect, the parameter mentioned is hazmat_http_on_untrusted_port, but the parameter should be hazmat_http_on_unattested_port, as per the docs at https://blindai.mithrilsecurity.io/en/latest/blindai/core.html

Error:

Solution:

Verify the support of the server version

When first connected to the server, the client should request from the server its version, verify if it is supported by the client SDK.
In case the version is not supported, the client should reject the connection and request to be updated.

server: Remove SSH dependencies

There are some Cargo.lock files that still use SSH to pull in dependencies
One example of it is server/inference-server/network/sgx/rpc/Cargo.lock

We should change them to use HTTPS

Execution exports

Description

For the Signed Responses feature (#13) to be useful, we need a way to export and validate execution proofs.

Here is the proposed API:

response = client.run_model(run_inputs, sign=True)
response.save_to_file("./execution_proof.json")

from blindai.client import load_execution_file

response = load_execution_file("./proof_of_execution.json")
response.validate(policy_file="./policy.toml") # throws if invalid or execution is not signed
print("The proof is valid!")

We should also have these functions:

response.export_binary() :: bytes
from blindai.client import load_execution_binary
response = load_execution_binary(a :: bytes)

Unanswered questions

Json? CBOR is probably a better choice
Should save_to_file work when sign=False?

Motivation and Context

For the Signed Responses feature (#13) to be useful, we need a way to export execution proofs.

Affected Features

Signed Responses feature (#13)

Test Plans

Add tests and unit-tests.

Additional Information

None

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

Connection Error: client.connect_server(..) - Hardware mode - CovidNet Example

Hi,
I encountered some problems while implementing your framework in hardware with the CovidNet example that you provide.
Do you think of anything I forgot?

Description

I have an error that I cannot explain during the step of connecting the client to the server.

Blindai Versions

BlindAI Client : "0.2.0"
BlindAI Server : "0.2.2"

Additional Information

Ubuntu
Version: 20.04.1
Package Manager Version: pip 22.0.4
Language version : Python 3.8.10
Kernel : 5.13.0-40-generic

Screenshots:

Undefined behavior when using binding to QPL in the runner

The issue is triggered by the following code : https://github.com/mithril-security/blindai-preview/blob/main/runner/remote_attestation_sgx/src/quote_verification_collateral.rs#L246

    // Retrieving verification collateral using QPL
    let mut p_quote_collateral: *mut sgx_ql_qve_collateral_t = ptr::null_mut();
    let qv_ret = unsafe {
        sgx_ql_get_quote_verification_collateral(
            fmspc.as_ptr(),
            fmspc.len() as u16,
            ca_from_quote.as_ptr(),
            &mut p_quote_collateral as *mut *mut sgx_ql_qve_collateral_t,
        )
    };

    ensure!(
        qv_ret == Quote3Error::Success,
        "sgx_ql_get_quote_verification_collateral failed!"
    );

This code will usually work correctly, but it is broken. We discovered the issue when trying to debug a failure from sgx_ql_get_quote_verification_collateral. While debugging, we added the following before the ensure! statement to print the error code from the QPL.

println!("sgx_ql_get_quote_verification_collateral returned {:?}", qv_ret);

Quite surprisingly we got sgx_ql_get_quote_verification_collateral returned Quote3Error::Success despite the fact that qv_ret != Quote3Error::Success when executing the ensure!... To compound the mystery, the issue disappeared when compiling in debug mode, the debug builds simply printed a status different from Quote3Error::Success, (yet it was still the wrong status).
This kind of strange behavior are often the result of Undefined Behavior. And this is also the case here. The UB is actually due to how we declared the FFI interface with the QPL (C-library) in our rust code :

extern "C" {
    pub fn sgx_ql_get_quote_verification_collateral(
        fmspc: *const u8,
        fmspc_size: u16,
        pck_ra: *const c_char,
        pp_quote_collateral: *mut *mut sgx_ql_qve_collateral_t,
    ) -> Quote3Error;
    pub fn sgx_ql_free_quote_verification_collateral(
        p_quote_collateral: *const sgx_ql_qve_collateral_t,
    ) -> Quote3Error;
}

The return type of the sgx_ql_get_quote_verification_collateral is declared to be a Quote3Error which is a Rust enum. But a Rust enum is assumed to only take one of the declared values (it cannot host any int8 like what is often done in a C enum). In our case the UB happened when the QPL returned an enum value that could not be represented with the Rust enum.
For more information about this mismatch between Rust and C-like enum : https://mdaverde.com/posts/rust-bindgen-enum/

What should we do to fix it ?
The best course of action would be to replace our custom FFI interface declaration with an FFI declaration generated by rust-bindgen. This would avoid this kind of mistake (and also would ensure that the function signature matches) We should also look if there is already a crate on crates.io which does already that.

Security impact : No (outside of enclave).
Priority : Low (only impacts the error path)

Signed responses

Implement signed server responses.
The server responds to the client with a signed response, which the client could store to attest to someone else that the response was indeed emitted from the trusted enclave.

This may be made optional so that clients that do not use it don't have to pay for it.

Add ModelDatumType to the proto file

Currently, the ModelDatumType is defined in both the server and client sides.
Adding it to the API (securedexchange.proto) should eliminate this duplication and make the extension of supported data types easier.

Add ModelDatum

Currently, the ModelDatumType enumeration is defined twice (In the Server, and Client Side).

[Question] How to use multiple inputs for model?

Hi,

How can I upload a model with multiple inputs? The distilbert example does not use multiple inputs but it's quite normal with pre-trained models. What should I pass to dtype and shape in this case?

Thanks.

Remove CBOR dependency

Description

We are currently using cbor for some of the serializing: transforming the flattened input tensors to a byte array.
This is probably overkill, and having a dependency on cbor is troublesome for porting the client library to other languages, like javascript, as on npmjs cbor packages are either old or nodejs-only.

I see two ways of doing this:

either find a way to encode the tensors using protobuf directly, but we should keep in mind that we want to keep the data as small as possible on the wire.
implement our own way of turning tensors into byte arrays. This is probably very simple, and it would probably work by just packing the data into a contiguous array, reinterpret it as a byte array, and call it a day.

Motivation and Context

Dependency on cbor2 in the client and server side.

Test plans

Let's not think about backward compat :)

Additional Information

None

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

Load policy and certificate from bytes

Description

Add a way to load policy.toml and certificate from bytes instead of a file

Motivation and Context

Quick and easy feature, should improve our API surface.

Test plans

Add unit tests.

Additional Information

None.

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

Failed to load model, the model or the input format are perhaps invalid

Description

This is due to BlindAI not supporting integer tensor output yet.
This was reported on discord, more info & the model are available there.

Expected behavior:

Model runs

Actual behavior:

Fails with error Failed to load model, the model or the input format are perhaps invalid

Steps to Reproduce

The notebook: https://cdn.discordapp.com/attachments/965734276593242202/965734464690978866/Confidential_STT.ipynb
Input: https://cdn.discordapp.com/attachments/965734276593242202/965734464892313640/hello_world.wav

Blindai Versions

last docker version & probably on master too

Additional Information

None

Screenshots (if appropriate):

None

Improve the CI

This is a mini roadmap for the CI.
List of things we might want in the CI (medium term plan):

Focus is on end to end tests and building client/server packages for now.

Side goals:

end to end tests (test the server and client together as a whole)
- I am currently preparing a pull request on that
more unit tests on the rust side

Potential future work:

benchmarks, and run them in CI — be informed of regressions/improvements in term of inference speed

No information anymore about the jupyter notebook examples

Description

There used to be some kind of tutorial about how to use them. But not anymore, so it might be difficult to set up the proper environment

Why this modification is needed?

Because they are awesome otherwise

What documents need to be updated

Main README
We probably a new document dedicated to the jupyter notebooks usage is needed

Checklist

This issue concerns the documentation

Add FAQ

Description

This issue is a collections of questions. The goal is to fill an FAQ page on the docs.

Here are some questions I thought about when working on the readme in #36

Does my CPU work with Intel SGX? Which one are OK? => we may want a whole documentation page about that and link it in the readme
What about speed? Is inference slow on a CPU?
Want to learn more about how SGX works (technical audience, with lower level explanation that what our blogposts provide)
Threat model for Intel SGX in general => is this secure against physical attacks and stuff

Maybe some questions regarding the direction of the project until we have a concrete roadmap:

What about android/iOS/web/... clients

Please add questions in this issue using comments (:wave: @JoFrost you told me you had some)

I'll assign this to myself unless someone else wants to work on the FAQ page :)

Why this modification is needed?

Describe the reason behind this modification

What documents need to be updated

Additional Information

Add any additional information here.

Checklist

This issue concerns BlindAI Client
This issue concerns BlindAI Server

Handle the case where the returned quote status is STATUS_TCB_SW_HARDENING_NEEDED

We should do a more cautious evaluation of the quote in the case where we gotget a STATUS_TCB_SW_HARDENING_NEEDED.
This will require to update https://github.com/mithril-security/sgx-dcap-quote-verify-python

blindai/client/blindai_preview/_dcap_attestation.py

Line 112 in 1de7f5b

 # TODO: Handle the case where the retuned quote status is STATUS_TCB_SW_HARDENING_NEEDED 

mithril-security / blindai Goto Github PK

blindai's People

Contributors

Stargazers

Watchers

Forkers

blindai's Issues

Description

Why this modification is needed?

What documents need to be updated

Description

Why this modification is needed?

What documents need to be updated

Additional Information

Checklist

Description

What documents need to be updated

Additional Information

Checklist

Description

Why this modification is needed?

What documents need to be updated

Additional Information

Checklist

Description

Description

Test plans

Additional Information

Checklist

Motivation and Context

Description

Test plans

Additional Information

Checklist

Description

Future work

Motivation and Context

Affected Features

Test Plans

Additional Information

Checklist

Description

Motivation and Context

Test plans

Additional Information

Checklist

Description

Motivation and Context

Test plans

Additional Information

Checklist

Description

Additional Information

Checklist

Description

Unanswered questions

Motivation and Context

Affected Features

Test Plans

Additional Information

Checklist

Description

Blindai Versions

Additional Information

Screenshots:

Description

Motivation and Context

Test plans

Additional Information

Checklist

Description

Motivation and Context

Test plans

Additional Information

Checklist

Description

Expected behavior:

Actual behavior:

Steps to Reproduce

Blindai Versions