Giter VIP home page Giter VIP logo

secure-data-science-reference-architecture's Introduction

Secure Data Science Reference Architecture

Overview

Amazon SageMaker is a powerful enabler and a key component of a data science environment, but it’s only part of what is required to build a complete and secure data science environment. For more robust security you will need other AWS services such as Amazon CloudWatch, Amazon S3, and AWS VPC. This project aims to be an example of how to pull together these services, to use them together to create secure, self-service, data science environments.

Table of Contents

  1. Getting Started

  2. Features

  3. Architecture and Process Overview

  4. Repository Breakdown

  5. Further Reading

  6. License

Getting Started

Step 1, as yourself

Clone the repository on your local environment and follow the instructions in the codebuild_build_readme.md file. After approximately 5 minutes the stack will have deployed a Shared Services VPC to be leveraged across all data science environments and a PyPI mirror for pre-approved Python package hosting within your network. The stack also will have deployed a product portfolio for creating data science enviornments.

Step 2, as a project administrator

To access the portfolio click the Outputs tab of the CloudFormation stack and use the AssumeProjectAdminRole link to become a project administrator, capable of creating data science project environments. Once you've assumed the role you can visit the AWS Service Catalog console to deploy a data science environment.

Click the context menu icon next to the Data Science Project Environment product and click Launch product. After you provide a name for the product launch, and provide a project name, click Launch and you will have created your first data science project environment. This will launch a CloudFormation stack to provision the data science environment. This will require about 10 minutes to execute.

Step 3, as a project team member

When the data science project environment has completed its deployment you will have 2 links available from the Service Catalog console to assume user roles in the data science environment. Click on the AssumeProjectUserRole and return to the AWS Service Catalog console to launch a Jupyter notebook.

Click the context menu icon next to the SageMaker Notebook product, select Launch product, and give your notebook product a name. Provide an email address, specify the name of the project this notebook belongs to, and a username. Click Launch and 5 minutes later you will be provided with a hyperlink to open your Jupyter notebook server.

Step 4, Explore

From the Jupyter notebook server, using the sample notebooks, you can develop features, train, host, and monitor a machine learning model in a secure manner. If you assume your original AWS role you can also, from the AWS console, explore the various features deployed by the CloudFormation stacks.

Features

This source code demonstrates a self-service model for enabling project teams to create data science environments that employ a number of recommended security practices. Some of the more notable features are listed below. The controls, mechanisms, and services deployed by this source code is intended to provide operations and security teams with the assurance that their best practice is being employed while also enabling project teams to self service, move quickly, and stay focused on the data science task at hand.

Private Network per Data Science Environment

For every data science environment created a Virtual Private Cloud (VPC) is deployed to host Amazon SageMaker and other components of the data science environment. The VPC provides a familiar set of network-level controls to allow you to govern ingress and egress of data. These templates create a VPC with no Internet Gateway (IGW), therefore all subnets are private, without Internet connectivity. Network connectivity with AWS services or your own shared services is provided using VPC endpoints and PrivateLink. Security Groups are used to control traffic between different resources, allowing you to group like resources together and manage their ingress and egress traffic.

Authentication and Authorization

AWS Identity and Access Management (IAM) is used to create least-privilege, preventive controls for many aspects of the data science enviroments. These preventive controls, in the form of IAM policies, are used to control access to a project's data in Amazon S3, control who can access SageMaker resources like Notebook servers, and are also applied as VPC endpoint policies to put explicit controls around the API endpoints created in a data science environment.

There are several IAM roles deployed by this source code to manage permissions and ensure separation of concerns at scale. Those roles are:

  • Project Administrator role

    Granting permissions to create data science environments via the AWS Service Catalog.

  • Data science environment administrator role

    Granting permissions to administer project-specific resources.

  • Data science environment user role

    Granting Console access, start/stop Jupyter notebook, open Jupyter notebook, create Jupyter notebook via Service Catalog

  • Notebook execution role

    Used by a Jupyter Notebook to access AWS resources, this is created on a per user per notebook basis. This role can be re-used for training jobs, batch transformations, and other Amazon SageMaker resources to support auditbility.

The IAM policies created by this source code use many IAM conditions to grant powerful permissions but only under certain conditions.

Data Protection

It is assumed that a data science environment contains highly sensitive data to train ML models, and that there is also sensitive intellectual property in the form of algorithms, libraries, and trained models. There are many ways to protect data such as the preventive controls described above, defined as IAM policies. In addition this source code encrypts data at rest using managed encryption keys.

Many AWS services, including Amazon S3 and Amazon SageMaker, are integrated with AWS Key Management Service (KMS) to make it very easy to encrypt your data at rest. This source code takes advantage of these integrations to ensure that your data is encrypted in Amazon S3 and on Amazon SageMaker resources, end to end. This encryption is also applied to your intellectual property as it is being developed in the many places it may be stored such as Amazon S3, EC2 EBS volumes, or AWS CodeCommit git repository.

Auditability

Using cloud services in a safe and responsible manner is good, but being able to demonstrate to others that you are operating in a governed manner is even better. Developers and security officers alike will need to see activity logs for models being trained and persons interacting with the systems. Amazon CloudWatch Logs and CloudTrail are there to help, receiving logs from many different parts of your data science environment to include:

  • Amazon S3
  • Amazon SageMaker Notebooks
  • Amazon SageMaker Training Jobs
  • Amazon SageMaker Hosted Models
  • VPC Flow Logs

Architecture and Process Overview

High-level Architecture

Once deployed, this CloudFormation stack provides you with a Data Science Product Portfolio, powered by AWS Service Catalog. This allows users who have assumed the Project Administrator role to deploy new data science environments using the Data Science Environment product within the catalog. Project Administrators can specify a project name, the environment type, and a few other criteria to launch the data science environment. AWS Service Catalog will then create a data science project environment consisting of:

To use the environment, project team members can assume the Data Science Project Administrator role or the Data Science Project User role. Once they have assumed a project role users can provision resources within the data science environment. By visiting the AWS Service Catalog console they can access the project's product portfolio and launch an Amazon SageMaker notebook.

AWS Service Catalog will then deploy an Amazon SageMaker-powered Jupyter notebook server using an approved CloudFormation template. This will produce a Jupyter notebook server with:

  • A KMS-encrypted EBS volume attached
  • An IAM role associated with the notebook server which represents the intersection of user, notebook server, and project
  • An attachment to the data science project VPC
  • User access to root permissions disabled
  • Notebook server access to network resources outside of the project VPC disabled
  • A convenience Python module generated with constants defined for AWS KMS key IDs, VPC Subnet IDs, and Security Group IDs

Once the notebook server has been deployed the user can access the notebook server directly from the Service Catalog console.

Repository Breakdown

This repository contains the following files:

├── CODE_OF_CONDUCT.md                      # Guidance for participating in this open source project
├── CONTRIBUTING.md                         # Guidelines for contributing to this project
├── LICENSE                                 # Details for the MIT-0 license
├── README.md                               # This readme
├── cloudformation
│   ├── publish_cloudformation.sh           # Bash shell script to package and prepare CloudFormation for deployment
│   ├── ds_admin_detective.yaml             # Deploys a detective control to manage SageMaker resources
│   ├── ds_admin_principals.yaml            # Deploys the Project Administrator role
│   ├── ds_administration.yaml              # Deploys nested stacks
│   ├── ds_env_backing_store.yaml           # Deploys a project's S3 buckets and CodeCommit repository
│   ├── ds_env_catalog.yaml                 # Deploys a project's product portfolio
│   ├── ds_env_network.yaml                 # Deploys a project's private network
│   ├── ds_env_principals.yaml              # Creates a project administrator and user
│   ├── ds_env_sagemaker.yaml               # Creates a lifecycle configuration for this project
│   ├── ds_environment.yaml                 # Manages nested stacks for a project
│   ├── ds_notebook_v1.yaml                 # Early version of a Jupyter notebook product
│   ├── ds_notebook_v2.yaml                 # Refined version of a notebook product
│   ├── ds_shared_services_ecs.yaml         # PyPI mirror running on AWS Fargate
│   └── ds_shared_services_network.yaml     # Creates a shared services VPC
├── docs
│   └── images
│       └── hla.png
└── src
    ├── detective_control
    │   └── inspect_sagemaker_resource.py   # Lambda function to detect non-VPC-attached SageMaker resources
    └── project_template
        ├── SageMaker-Demo-Notebook.ipynb          # Sample Jupyter notebbok to demonstrate security controls
        ├── credit_card_default_data.xls                # Sample data set
        ├── util
            ├── __init__.py
            └── utilsspec.py

Further Reading

This project is a variation of the following aws-sample:

There is a multitude of material and resources available to you to advise you on how to best support your business using AWS services. The following is a non-exhaustive list in no particular order:

License

This source code is licensed under the MIT-0 License. See the LICENSE file for details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.