Giter VIP home page Giter VIP logo

secure-data-science-reference-architecture's Introduction

Secure Data Science Reference Architecture

Overview

Amazon SageMaker is a powerful enabler and a key component of a data science environment, but it’s only part of what is required to build a complete and secure data science environment. For more robust security you will need other AWS services such as Amazon CloudWatch, Amazon S3, and AWS VPC. This project aims to be an example of how to pull together these services, to use them together to create secure, self-service, data science environments.

Table of Contents

  1. Getting Started

  2. Features

  3. Architecture and Process Overview

  4. Repository Breakdown

  5. Further Reading

  6. License

Getting Started

Use the following links below to quickly deploy this repository to your AWS account. No need to clone or fork the repository - the source code is available in Amazon S3 ready for deployment via CloudFormation. To get started click one of the buttons below.

Region Launch Template
Oregon (us-west-2) Deploy to AWS Oregon
Ohio (us-east-2) Deploy to AWS Ohio
N. Virginia (us-east-1) Deploy to AWS N. Virginia
Ireland (eu-west-1) Deploy to AWS Ireland
London (eu-west-2) Deploy to AWS London
Sydney (ap-southeast-2) Deploy to AWS Sydney

Step 1, as yourself

Assuming you are signed into the AWS console, clicking one of the buttons above will take you to the AWS CloudFormation console for your selected region. Accept the stack's default values, tick the boxes for Capabilities and click Create stack. After approximately 5 minutes the stack will have deployed a Shared Services VPC to be leveraged across all data science environments and a PyPI mirror for pre-approved Python package hosting within your network. The stack also will have deployed a product portfolio for creating data science enviornments.

Step 2, as a project administrator

To access the portfolio click the Outputs tab of the CloudFormation stack and use the AssumeProjectAdminRole link to become a project administrator, capable of creating data science project environments. Once you've assumed the role you can visit the AWS Service Catalog console to deploy a data science environment.

Click the context menu icon next to the Data Science Project Environment product and click Launch product. After you provide a name for the product launch, and provide a project name, click Launch and you will have created your first data science project environment. This will launch a CloudFormation stack to provision the data science environment. This will require about 10 minutes to execute.

Step 3, as a project team member

When the data science project environment has completed its deployment you will have 2 links available from the Service Catalog console to assume user roles in the data science environment. Click on the AssumeProjectUserRole and return to the AWS Service Catalog console to launch a Jupyter notebook.

Click the context menu icon next to the SageMaker Notebook product, select Launch product, and give your notebook product a name. Provide an email address, specify the name of the project this notebook belongs to, and a username. Click Launch and 5 minutes later you will be provided with a hyperlink to open your Jupyter notebook server.

Step 4, Explore

From the Jupyter notebook server, using the sample notebooks, you can develop features, train, host, and monitor a machine learning model in a secure manner. If you assume your original AWS role you can also, from the AWS console, explore the various features deployed by the CloudFormation stacks.

Features

This source code demonstrates a self-service model for enabling project teams to create data science environments that employ a number of recommended security practices. Some of the more notable features are listed below. The controls, mechanisms, and services deployed by this source code is intended to provide operations and security teams with the assurance that their best practice is being employed while also enabling project teams to self service, move quickly, and stay focused on the data science task at hand.

Private Network per Data Science Environment

For every data science environment created a Virtual Private Cloud (VPC) is deployed to host Amazon SageMaker and other components of the data science environment. The VPC provides a familiar set of network-level controls to allow you to govern ingress and egress of data. These templates create a VPC with no Internet Gateway (IGW), therefore all subnets are private, without Internet connectivity. Network connectivity with AWS services or your own shared services is provided using VPC endpoints and PrivateLink. Security Groups are used to control traffic between different resources, allowing you to group like resources together and manage their ingress and egress traffic.

Authentication and Authorization

AWS Identity and Access Management (IAM) is used to create least-privilege, preventive controls for many aspects of the data science enviroments. These preventive controls, in the form of IAM policies, are used to control access to a project's data in Amazon S3, control who can access SageMaker resources like Notebook servers, and are also applied as VPC endpoint policies to put explicit controls around the API endpoints created in a data science environment.

There are several IAM roles deployed by this source code to manage permissions and ensure separation of concerns at scale. Those roles are:

  • Project Administrator role

    Granting permissions to create data science environments via the AWS Service Catalog.

  • Data science environment administrator role

    Granting permissions to administer project-specific resources.

  • Data science environment user role

    Granting Console access, start/stop Jupyter notebook, open Jupyter notebook, create Jupyter notebook via Service Catalog

  • Notebook execution role

    Used by a Jupyter Notebook to access AWS resources, this is created on a per user per notebook basis. This role can be re-used for training jobs, batch transformations, and other Amazon SageMaker resources to support auditbility.

The IAM policies created by this source code use many IAM conditions to grant powerful permissions but only under certain conditions.

Data Protection

It is assumed that a data science environment contains highly sensitive data to train ML models, and that there is also sensitive intellectual property in the form of algorithms, libraries, and trained models. There are many ways to protect data such as the preventive controls described above, defined as IAM policies. In addition this source code encrypts data at rest using managed encryption keys.

Many AWS services, including Amazon S3 and Amazon SageMaker, are integrated with AWS Key Management Service (KMS) to make it very easy to encrypt your data at rest. This source code takes advantage of these integrations to ensure that your data is encrypted in Amazon S3 and on Amazon SageMaker resources, end to end. This encryption is also applied to your intellectual property as it is being developed in the many places it may be stored such as Amazon S3, EC2 EBS volumes, or AWS CodeCommit git repository.

Auditability

Using cloud services in a safe and responsible manner is good, but being able to demonstrate to others that you are operating in a governed manner is even better. Developers and security officers alike will need to see activity logs for models being trained and persons interacting with the systems. Amazon CloudWatch Logs and CloudTrail are there to help, receiving logs from many different parts of your data science environment to include:

  • Amazon S3
  • Amazon SageMaker Notebooks
  • Amazon SageMaker Training Jobs
  • Amazon SageMaker Hosted Models
  • VPC Flow Logs

Architecture and Process Overview

High-level Architecture

Once deployed, this CloudFormation stack provides you with a Data Science Product Portfolio, powered by AWS Service Catalog. This allows users who have assumed the Project Administrator role to deploy new data science environments using the Data Science Environment product within the catalog. Project Administrators can specify a project name, the environment type, and a few other criteria to launch the data science environment. AWS Service Catalog will then create a data science project environment consisting of:

To use the environment, project team members can assume the Data Science Project Administrator role or the Data Science Project User role. Once they have assumed a project role users can provision resources within the data science environment. By visiting the AWS Service Catalog console they can access the project's product portfolio and launch an Amazon SageMaker notebook.

AWS Service Catalog will then deploy an Amazon SageMaker-powered Jupyter notebook server using an approved CloudFormation template. This will produce a Jupyter notebook server with:

  • A KMS-encrypted EBS volume attached
  • An IAM role associated with the notebook server which represents the intersection of user, notebook server, and project
  • An attachment to the data science project VPC
  • User access to root permissions disabled
  • Notebook server access to network resources outside of the project VPC disabled
  • A convenience Python module generated with constants defined for AWS KMS key IDs, VPC Subnet IDs, and Security Group IDs

Once the notebook server has been deployed the user can access the notebook server directly from the Service Catalog console.

Repository Breakdown

This repository contains the following files:

├── CODE_OF_CONDUCT.md                      # Guidance for participating in this open source project
├── CONTRIBUTING.md                         # Guidelines for contributing to this project
├── LICENSE                                 # Details for the MIT-0 license
├── README.md                               # This readme
├── cloudformation
│   ├── publish_cloudformation.sh           # Bash shell script to package and prepare CloudFormation for deployment
│   ├── ds_admin_detective.yaml             # Deploys a detective control to manage SageMaker resources
│   ├── ds_admin_principals.yaml            # Deploys the Project Administrator role
│   ├── ds_administration.yaml              # Deploys nested stacks
│   ├── ds_env_backing_store.yaml           # Deploys a project's S3 buckets and CodeCommit repository
│   ├── ds_env_catalog.yaml                 # Deploys a project's product portfolio
│   ├── ds_env_network.yaml                 # Deploys a project's private network
│   ├── ds_env_principals.yaml              # Creates a project administrator and user
│   ├── ds_env_sagemaker.yaml               # Creates a lifecycle configuration for this project
│   ├── ds_environment.yaml                 # Manages nested stacks for a project
│   ├── ds_notebook_v1.yaml                 # Early version of a Jupyter notebook product
│   ├── ds_notebook_v2.yaml                 # Refined version of a notebook product
│   ├── ds_shared_services_ecs.yaml         # PyPI mirror running on AWS Fargate
│   └── ds_shared_services_network.yaml     # Creates a shared services VPC
├── docs
│   └── images
│       └── hla.png
└── src
    ├── detective_control
    │   └── inspect_sagemaker_resource.py   # Lambda function to detect non-VPC-attached SageMaker resources
    └── project_template
        ├── 00_SageMaker-SysOps-Workflow.ipynb          # Sample Jupyter notebbok to demonstrate security controls
        ├── 01_SageMaker-DataScientist-Workflow.ipynb   # Sample Jupyter notebook to demonstrate secure ML lifecycle
        ├── 02_SageMaker-DevOps-Workflow.ipynb          # Second half of a secure ML lifecycle
        ├── credit_card_default_data.xls                # Sample data set
        ├── util
            ├── __init__.py
            └── utilsspec.py

Further Reading

There is a multitude of material and resources available to you to advise you on how to best support your business using AWS services. The following is a non-exhaustive list in no particular order:

License

This source code is licensed under the MIT-0 License. See the LICENSE file for details.

secure-data-science-reference-architecture's People

Contributors

amazon-auto avatar jpbarto avatar stefannatu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

secure-data-science-reference-architecture's Issues

Add a self-reference to the SageMaker security group

The SageMaker security group doesn't have any inbound rules. Any two EC2 instances within the security group will not be able to communicate with each other (e.g. SageMaker training jobs on multiple ML-instances).
The self-reference should be added to the security group:

  # Self-referencing the security group to enable communication between intances within the same SG
  SageMakerSecurityGroupSelfIngress:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      IpProtocol: '-1'
      SourceSecurityGroupId: !Ref SageMakerSecurityGroup
      GroupId: !Ref SageMakerSecurityGroup

Duplicate tags in SageMaker dev notebook (Service Catalog)

In Lab 2: Deploy a Project environment, when the Service Catalog product "example-project SageMaker dev notebook" is added to the portfolio, the Service Catalog product has duplicate tags in the "Launch options" for tags "EnvironmentType" and "ProjectName". The "Launch options" duplicate tags appear when viewed through the end-user Products console page (not seen in the Service Catalog Administration console view).

With the beta version of the Service Catalog console, the "example-project SageMaker dev notebook" product will fail to launch in Lab 3 because of duplicate tags. The duplicate tags do not seem to be blocking when the product is launched through the old console.

AWS Error Launching a product on AWS Service Catalog

Describe the bug
Hello,

Related to this workshop about creating a secure environment: https://sagemaker-workshop.com/security_for_sysops.html I had an error on launching the data science environment product.

To Reproduce
Just reproduce the steps described on these labs: (https://sagemaker-workshop.com/security_for_sysops/best_practice/best_practice_lab.html & https://sagemaker-workshop.com/security_for_sysops/team_resources/secure_environment_lab.html). You will see the error when you try to launch the product.

Expected behavior
The product should have been launched without errors.

Screenshots
I have this error on AWS Service Catalog:

imagen

Checking the stack created on Cloud Formation, I have the following error on the Events tag:

imagen

Additional context
As you can see in the second screenshot, the error seems to be related on the creation of the DataScienceRole. I checked the template where this role is described and I tried to manually create it. It seems that this template is deprecated because some policy names have changed. It should be a great help if the creator of this workshop fix that template.

Thanks for helping,
King regards

Include Tag Options Library to propagate centralized tagging

Is your feature request related to a problem? Please describe.
Customers are concerned about centrally managing cost controls on AWS

Describe the solution you'd like
With the Tag Options Library, admins can provision higher level tags that can be tied to various teams, projects that will propagate to all the service catalog environments created with those tags. A set of pre-defined tags can then be used to centrally manage costs and chargebacks.

We can update the readme to describe TOL and how it can be included in Module 1 before provisioning a product.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add ListProcessingJobs action to the SageMakerAccessInlinePolicy of the ds user role.

Now that processing jobs are visible in the console, a workshop participant might want to track the processing job in Lab 3 through the console. The ds user role doesn't have permission to perform this action though, and the user is presented with

AccessDeniedException
User: arn:aws:sts::<account>:assumed-role/ds-user-role-example-project-dev/meganleoni is not authorized to perform: sagemaker:ListProcessingJobs

when they open the SageMaker console.

IAM Policy reference in cfn script is deprecated now

Describe the bug
A clear and concise description of what the bug is.
As per this link, there are some lambda policies that have been deprecated.
https://docs.aws.amazon.com/lambda/latest/dg/access-control-identity-based.html

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

When you try to launch the product from Service catalog, the cfn fails because the policy no longer exists.

Expected behavior
A clear and concise description of what you expected to happen.
The cfn script should complete without errors

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]
    Mac, Chrome,

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.