Giter VIP home page Giter VIP logo

amazon-emr-management-guide's Introduction

amazon-emr-management-guide's People

Contributors

alessiosavi avatar angelcervera avatar bendrucker avatar christopherhackett avatar daniel-artchounin avatar ellkend-aws avatar ericabertugli avatar gabofdc avatar gtitievsky avatar jbelmont avatar jinkwon711 avatar joelthompson avatar joshbean avatar jsperson avatar maorfr avatar mrteutone avatar pahtoe avatar patrick-muller avatar rwaweber avatar schmutze avatar shashikumarec088 avatar sivankumar86 avatar wallacelim avatar wolruf avatar yegeniy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-emr-management-guide's Issues

More in-depth security discussion of EMRFS AssumeRole support

It's unclear to me in the documentation today what the security model for EMRFS AssumeRole is. As far as I've been able to gather, the node running a task will transparently call AssumeRole on your behalf if you request a certain prefix or match a user pattern.

However, this seems more like a convenience mechanism than a strong security one (i.e., there's no mechanism in place to stop a task from hitting an S3 bucket with the instance credentials or another assumeable role). Is that correct? It seems worth spelling out more explicitly in the documentation to make people don't use it in an attempt to make security boundaries.

Information on essential EC2 node IAM permissions

The management guide has a fairly extensive guide to IAM permissions (including sub-pages of that), but as far as I can tell, seems to be lacking a fairly important piece of information: what EMR nodes actually need to do their job, independent of tasks running on top of them.

Right now the guidance seems to be roughly, "use the arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role managed policy, or you can customize your permissions, especially as it pertains to a security configuration that configures AssumeRole for EMRFS".

But arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role is actually pretty powerful:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": "*",
            "Action": [
                "cloudwatch:*",
                "dynamodb:*",
                "ec2:Describe*",
                "elasticmapreduce:Describe*",
                "elasticmapreduce:ListBootstrapActions",
                "elasticmapreduce:ListClusters",
                "elasticmapreduce:ListInstanceGroups",
                "elasticmapreduce:ListInstances",
                "elasticmapreduce:ListSteps",
                "kinesis:CreateStream",
                "kinesis:DeleteStream",
                "kinesis:DescribeStream",
                "kinesis:GetRecords",
                "kinesis:GetShardIterator",
                "kinesis:MergeShards",
                "kinesis:PutRecord",
                "kinesis:SplitShard",
                "rds:Describe*",
                "s3:*",
                "sdb:*",
                "sns:*",
                "sqs:*",
                "glue:CreateDatabase",
                "glue:UpdateDatabase",
                "glue:DeleteDatabase",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetTableVersions",
                "glue:CreatePartition",
                "glue:BatchCreatePartition",
                "glue:UpdatePartition",
                "glue:DeletePartition",
                "glue:BatchDeletePartition",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition",
                "glue:CreateUserDefinedFunction",
                "glue:UpdateUserDefinedFunction",
                "glue:DeleteUserDefinedFunction",
                "glue:GetUserDefinedFunction",
                "glue:GetUserDefinedFunctions"
            ]
        }
    ]
}

Basically, it can do anything to S3, SNS, SQS, SDB, DynamoDB, and several other potentially scary things.

So if you don't fully trust your EMR tasks not to delete/corrupt all your S3 buckets and Dynamo tables, you probably want to customize that policy. But the documentation doesn't make a clear distinction between what EMR itself needs and some speculative permissions on what tasks running on top of it might want.

As far as I've been able to tell, these are completely unused by EMR itself:

  • DynamoDB
  • SDB
  • Kinesis

And S3 is at least partially used to upload logs to the configured logging bucket. Of course, if I tell my EMR job to fetch from s3://foo/bar, I'll need to also include permissions for that in my policy, but that separation is not very crisp right now.

It's also very hard for me to assess whether SNS/SQS is used internally by EMR today because both services have cross-account support so even if I see no relevant queues or topics in my account, I can't say with confidence that I'm not hobbling some uncommon EMR feature by not granting EMR access to those services.

The best experiment I've been able to run is to put the whole thing in a private subnet with no internet access and an S3 VPCE to send logs to S3. The EMR cluster seems quite content in that scenario, which suggests to me that everything but S3 is optional. But obviously if I were to tell an EMR package to fetch from Glue, that would break.

Ultimately, it would be nice to have a broken down table in the documentation saying things like (e.g.,) :

  • You always need S3 PutObject and ListBucket powers over your configured logging prefix.
  • If you want to use our Glue integration, you need permissions X, Y, Z on the instance IAM role
  • If you want to use our EMRFS AssumeRole powers, you need to grant AssumeRole powers to the instance IAM role

Or absent that (but this isn't a documentation thing), a cleaner separation between "task powers" and "EMR machinery powers" like what we have in ECS.

How to update the emr master dns whenever the cluster terminates

Hi
In case we terminate the emr cluster and spinup the new one within a vpc, a differnt ip address and the dns for the master node, resource manager and all gets changed. Inorder to have a friendly name name and also to point the current running emr cluster i see below document
https://aws.amazon.com/blogs/big-data/dynamically-create-friendly-urls-for-your-amazon-emr-web-interfaces/
We are using our on prem DNS and apart from this we cannot have any other way? If we create a vpc endpoint to emr cluster and do the dns alias to the vpc endpoint that will not solve the purpose?

Guide to creating cross-realm trust between EMR and AWS Managed AD

I'm having a lot of trouble getting the finicky details working properly to connect an EMR in my account to an AWS Managed Microsoft AD in the same account. In theory all the various knobs are in place, but a step-by-step guide would be pretty nice, especially if it included an overview of aws-cli or the relevant API calls, to ease automation.

It's complicated a bit by the fact that the managed AD doesn't let you run the commands described here on it, like netdom trust EC2.INTERNAL /Domain:ad.domain.com /add /realm /passwordt:MyVeryStrongPassword, and instead exposes the trust machinery through an AWS API.

HA Supported Applications and Features Page Contains an en-dash instead of a normal dash

I would PR this in, but the docs haven't been ported here yet :)

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-applications.html#emr-plan-ha-applications-HDFS states:

If you need to find out which NameNode is active, you can use SSH to connect to any master node in the cluster and run the following command:

hdfs haadmin –getAllServiceState

If you stare closely, the character before "getAllServiceState" is an en-dash () rather than a normal dash (-). If you try to copy/paste it into a command-line window, you get this:

$ hdfs haadmin –getAllServiceState
Bad command '–getAllServiceState': expected command starting with '-'
...

If you replace the with a - it works just fine:

$ hdfs haadmin -getAllServiceState
ip-XX-XX-XX-XX.ec2.internal:8020                  active    
ip-XX-XX-XX-XX.ec2.internal:8020                  standby

Document required connectivity for LocalDiskEncryptionKeyProvider type AwsKms

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-create-security-configuration.html and https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-encryption-enable.html#emr-awskms-keys discuss using KMS CMKs for EMR encryption. However, there is no mention that the main EC2 instances themselves require network connectivity to KMS when using AwsKms for the local disk encryption (either over the internet or over a VPC Endpoint). Having this spelled out explicitly would be helpful.

Encryption At Rest Options Seem to Work on HA Masters

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-applications.html#emr-plan-ha-features-unsupported states:

The following EMR features are currently not available in an EMR cluster with multiple master nodes:
...
* At-rest and in-transit encryption options

However, I successfully spun up a multi-master EMR cluster with encryption-at-rest options. I even terminated the active master node (at least, the active Yarn RM node) and a new one replaced it and re-applied the encryption at rest options.

Document Required KMS Permissions

This is a bit similar to #9 but not fully included in it -- https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-encryption-enable.html states:

The role for the Amazon EC2 instance profile must have permissions to use the CMK you specify.

However, it doesn't actually specify what those permissions are. It then states:

You can use the AWS Management Console to add your instance profile or EC2 instance profile to the list of key users for the specified AWS KMS CMK, or you can use the AWS CLI or an AWS SDK to attach an appropriate key policy.

It then walks through going through the AWS console to add the role as a "Key User" but it doesn't actually specify what the required permissions are, nor does it ever describe how one would use the AWS CLI or an AWS SDK to grant appropriate permissions. Can the required KMS permissions please be documented so we can more easily manage them in code?

Thanks!

Incorrect instance type recommendation

The documentation at https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html does not match the content in this repo and there seems to be a typo in the live documentation around the recommended instance type.

"The master node does not have large computational requirements. For most clusters of 50 or fewer nodes, consider using an m5.xlarge instance. For clusters of more than 50 nodes, consider using an m4.xlarge."

I think it should have been
"The master node does not have large computational requirements. For most clusters of 50 or fewer nodes, consider using an m5.large instance. For clusters of more than 50 nodes, consider using an m5.xlarge."

Basic user session policy contains invalid actions

The example role:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-user-role.html#emr-studio-basic-session-policy

https://github.com/awsdocs/amazon-emr-management-guide/blob/main/doc_source/emr-studio-user-role.md#example-basic-user-session-policy

Contains invalid actions, according to the console and documentation the following don't exist:

  • AttachEditor
  • DetachEditor
  • CreatePersistentAppUI
  • DescribePersistentAppUI
  • GetPersistentAppUIPresignedURL
  • GetOnClusterAppUIPresignedURL
  • CreateAccessTokenForManagedEndpoint

image

https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonelasticmapreduce.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.