aws / aws-eks-best-practices Goto Github PK

A best practices guide for day 2 operations, including operational excellence, security, reliability, performance efficiency, and cost optimization.

Home Page: https://aws.github.io/aws-eks-best-practices/

License: Other

Python 72.75% Dockerfile 4.30% Go 22.73% HTML 0.22%

aws-eks-best-practices's Introduction

Amazon Elastic Kubernetes Service (Amazon EKS) Best Practices

A best practices guide for day 2 operations, including operational excellence, security, reliability, performance efficiency, and cost optimization.

Return to Live Docs.

Contributing

While the best practices were originally authored by AWS employees, we encourage and welcome contributions from the Kubernetes user community. If you have a best practice that you would like to share, please review the Contributing Guidelines before submitting a PR.

License Summary

The documentation is made available under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file.

The sample code within this documentation is made available under the MIT-0 license. See the LICENSE-SAMPLECODE file.

aws-eks-best-practices's People

Contributors

Stargazers

Watchers

Forkers

mailjunze aidansteele bajutae jihed poeblu urbanadventurer uthark erick-aguayo davidh83110 superyhee arunixas saha-raskonda ashutosh16 sebuala ritesh igorgorovoy aditya187 devopseze firdausramlan allamand kellygriffin tinnleandro marcusjack marceloserpa askulkarni2 lxpollitt mmg1 pradeepbhuyan hahahadebuger duttab49 enixdark texanraj anandg112 silentsoul04 arhea stiwari8 steven-terrana raviacloudguy jerry123je rafaelpereyra vuviethung1998 lraymaddox ellistarn benlorence jeffwan jadenmao nagarjuna8686 ranshn adityati pratikfalke deyarijit lmorfitt jalawala khanraisa124 jimbugwadia latacora-tomekr codegazers mkandelaars liorrozen philmcdo natefox ibrahim-akk rezahashemi focaaby camanducci reoim stevekim-git huguesclouatre sudheer628 doppiomacchiatto uznick neilkuan shkoder techbrunch koushki mortezakarimi saravanan-sathyamoorthy andskli moolen santanubiswas2k1 njgibbon chatcharoen jeffhemmen sepiros62 przemolb nmadhire mikestef9 vaultgitorg calshankar ldomb hgeryville jasonumiker herbertgoto pradhans0906 ajayk magnologan jlbutler geoffcline mgrotheer sylr

aws-eks-best-practices's Issues

Private EKS Cluster not accessble

Describe the problem
Hi, this is srinivasa am created EKS cluster in AWS using EKSCTL but default it will create public eks (API server endpoint access) but it is i need to change this one into private am trying from AWS console after changing in to private from kube-server where i installed kubectl and eksctl i cant able to access that cluster am getting error tcp:ip ip:443 i/o timeout my kubeserver is in private subnet only and all my worker nodes is also in private only but i dont know why am getting this error from my kube-machine please help me for this to troubleshoot incase u need any info i will provide
EKS-version 1.15
thank you

References
Please include a link to the lines where the error appears.

"Implement QoS" should not be h3?

Describe the problem
Implement QoS is in h2, which is one level higher than the others, but isn't that a mistake for h3? Or is it intended to be?

References
https://aws.github.io/aws-eks-best-practices/reliability/docs/dataplane/#implement-qos

Windows Caveats

Speaking as someone who is attempting to run a mixed EKS cluster with both Windows and Linux workloads, there are several caveats and best practices that are specific to running Windows nodes that it would be helpful to highlight. It would also be beneficial to point out if certain tools or recommendations are incompatible with Windows.

Perhaps the best solution to this overall is an entire section dedicated to Windows EKS best practices, but intermingling notes about running Windows with existing sections could work as well.

I hope this feedback is helpful, even though it didn't quite fit the defined template.

ReadOnlyRootFilesystem recommendation should be listed in the Pod Security category

Describe the problem
In the Image Security category, there is a recommendation of "Configure your images with read-only root file system". I agree with the content, but since this is not an image configuration, wouldn't it be more appropriate to put it in the Pod Security category?

References
https://aws.github.io/aws-eks-best-practices/security/docs/image/#configure-your-images-with-read-only-root-file-system

Recommendation to use lifecycle policy in ECR

Is your idea request related to a problem that you've solved? Please describe.
NIST SP800-190 (Application Container Security Guide) lists "3.2.2 Stale images in registries" as a registry risk. As a countermeasure, it recommends automating the removal of insecure images in "4.2.2 Stale images in registries".

Describe the best practice
Amazon ECR lifecycle policies enable you to specify the lifecycle management of images in a repository. Consider using lifecycle policies to automate the removal of images for older generations that may be insecure.

egress-operator && kube-scan

Is your idea request related to a problem that you've solved? Please describe.
A clear and concise description of the problem.

Describe the best practice
A clear and concise description of the best practice you developed along with any code and/or projects you used to solve the problem.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the idea here.

EKS currently has native support for OIDC Authentication

Describe the problem
https://github.com/aws/aws-eks-best-practices/blob/master/content/security/docs/iam.md
“EKS currently has native support for webhook token authentication and service account tokens.”

I think EKS now support OIDC .

References
https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-technical-overview.html

Whitelist the image registry

Is your idea request related to a problem that you've solved? Please describe.
In NIST SP800-190, "3.1.5 Use of untrusted images" is listed as an image risk, and "Enforcement to ensure that all hosts in the environment only run images from these approved lists" is written as a countermeasure example.

The CIS EKS Benchmark also mentions "5.1.4 Minimize Container Registries to only those approved".

The EKS Workshop has an example of whitelisting the registry with OPA.

Actually, there are sample policies for OPA, Gatekeeper, and Kyverno that whitelist the registry in this best practices guide repository.

The current EKS Best Practices Guide does not clearly mention whitelisting the container registry as a recommendation, so how about mentioning it?

Describe the best practice
Consider only allowing images from approved image registries to run. Policy solutions such as OPA and Kyverno can be used for this purpose. Example policies can be found here.

TODO: Auditing additions

Audit changes to the aws-auth ConfigMap
Monitor increases in 403 Forbidden and 401 Unauthorized response codes (already have Log Insights queries in the doc. Need to add timeframes)
Anonymous calls to the API server
alert when there's an increate in 403 Forbidden responses, show attributes host, sourceIPs, and k8s_user.username
misconfigured RBAC policies, unusual API calls
401s: identify authentication issues (e.g., expired certificates or malformed tokens)

Fargate pod incorrectly treated as "Task"

Describe the problem
In the infrastructure page, a Fargate Pod is incorrectly referred as "Task". This common ocurrence lead to confusion with ECS service:

References
Please include a link to the lines where the error appears.

With EKS Fargate, AWS will automatically update the underlying infrastructure as updates become available. Oftentimes this can be done seamlessly, but there may be times when an update will cause your task to be rescheduled.

https://aws.github.io/aws-eks-best-practices/hosts/#treat-your-infrastructure-as-immutable-and-automate-the-replacement-of-your-worker-nodes

Include content about OIDC authentication for EKS

Is your idea request related to a problem that you've solved? Please describe.
A clear and concise description of the problem.

Describe the best practice
A clear and concise description of the best practice you developed along with any code and/or projects you used to solve the problem.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the idea here.

In 1.19, it is not necessary to specify the securityContext when using IRSA in non-root containers

Describe the problem

Run the application as a non-root user

According to the above topic, when using IRSA, non-root containers need to specify fsGroup in the securityContext and set the file permissions for the web identity token. In Kubernetes 1.19, this is no longer required, so it is a good idea to add this point.

This is documented in the EKS documentation.

https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-1.19

You're no longer required to provide a security context for non-root containers that need to access the web identity token file for use with IAM roles for service accounts. For more information, see IAM roles for service accounts andproposal for file permission handling in projected service account volume on GitHub.

References
https://aws.github.io/aws-eks-best-practices/security/docs/iam/#run-the-application-as-a-non-root-user

RFE: Provide more guidance concerning network policies

@jicowan Please provide more guidance and clarification regarding the many options available for enforcing network policies.

The network security section recommends several network policies and mentions CNI plugins (e.g., Cilium or Calico), or a service mesh (e.g. AWS App Mesh) as means of enforcement.

How to choose between these options?

Additional clarification regarding general applicability of Security Groups for Pods and App Mesh would also be helpful:

Are security groups for pods a viable means of enforcing all of the recommended traffic controls? They were designed to control egress (right?) and it is unclear how generally applicable they are.
Now that AWS App Mesh handles ingress, egress, and virtual gateways, is it a viable one-stop solution?

awsdocs/amazon-eks-user-guide#88

Write blurb on using OPA as an alternative to PSPs

Is your idea request related to a problem that you've solved? Please describe.
A clear and concise description of the problem.

Describe the best practice
A clear and concise description of the best practice you developed along with any code and/or projects you used to solve the problem.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the idea here.
based on content from https://www.infracloud.io/kubernetes-pod-security-policies-opa/

Duplicate recommendations about running containers as non-root user

Describe the problem
There are duplicate recommendations about running the container as a non-root user in both the IAM category and the Pod Security category .

The descriptions in the IAM and Pod Security categories are duplicated, and since they are about the securityContext, I think it is better to put it only in the Pod Security category.

It's also mentioned in the Image category, but I think that's fine as it's about specifying the USER in the Dockerfile.

References
https://aws.github.io/aws-eks-best-practices/security/docs/iam/#run-the-application-as-a-non-root-user
https://aws.github.io/aws-eks-best-practices/security/docs/pods/#do-not-run-processes-in-containers-as-root
https://aws.github.io/aws-eks-best-practices/security/docs/image/#add-the-user-directive-to-your-dockerfiles-to-run-as-a-non-root-user

Create project releases to track changes on the documentation

It would be great if we uses the github releases to track improvment on the documentation, so that folks can subscribe to watch releases and know that we have added new contents

What do you think ?

TODO: Tracee for finding evasive malware

Image scanning can find vulnerabilities and malware
In libraries, packages, etc.
Compare SHA of the files with SHAs of known malware
Scanners can detect misconfigurations, e.g. secrets embedded in the container image

Chain of trust is important
Evasive malware
Scanning is important but not sufficient
Malware can be hiding dormant in container images
In tar files or nested tar files; decompressed when the image is run
Need runtime security because static analysis is not enough

Honeypot
Shift left (run containers in a sandbox)
eBPF what allows you to plug into the kernel to handle
Evasive malware. Will make call into the kernel where eBPF is waiting. Instrument parts of the kernel and make an assessment

Tracee - CLI tool to detect malware in the sandbox.
Need to know what to look for
DTA (Dynamic Threat Analysis) wraps session into a product, run container in sandbox, run tracee in there and present the findings in a dashboard. Assigns a risk score to the container
Free to start
Can combined with Elasticsearch to look for specific findings.

Recommendation to create an EKS cluster with a dedicated IAM role

Is your idea request related to a problem that you've solved? Please describe.
The IAM user / role that creates an EKS cluster always has admin access. User management to the cluster is configured through the aws-auth ConfigMap, however this user/role is not present in this file. Unless access to this user/role is protected and monitored, this can be used to gain privileged access to the cluster.

Describe the best practice
A good solution to this problem is to create a custom IAM Role that is exclusively used to create the EKS cluster. Controls can be put in place to control who can assume this role.Additionally, once the cluster's aws-auth ConfigMap has been configured and additional users have been granted access, for extra protection this role can be deleted, provided that it can be recreated with the same ARN. This ensures that this backdoor entry to the cluster does not remain, but that it can be later recreated to gain access in an emergency / break glass situation. Recreating the role gives an additional audit trail, which is especially useful for controlling user access to production clusters that do not usually have direct user access configured.

Describe alternatives you've considered
Other alternatives would also exist if this initial root access could be configured, but this is not currently supported by EKS.For an additional security control when this root role is recreated, an automated function could delete the role after an hour of it being created to ensure access to production clusters is automatically revoked after a period of time.

PR incoming with suggested wording

Actuary has not been updated for 4 years

Describe the problem
"actuary" listed in the tools section of the compliance page hasn't been updated for 4 years. I don't think it's a good idea to put a link to a tool that is not active. How about removing it?

References
https://aws.github.io/aws-eks-best-practices/security/docs/compliance/#tools-and-resources

EKS support for FlexVolume Plugin

Is there a way to support FlexVolume plugin in EKS?

Currently, the only supported CSi drivers on EKS are EBS, EFS and FSx for Lustre storage classes but these storage classes can’t be used on windows pods that require access to file shares (File shares in windows natively use SMB protocol).

Kubernetes has an alternative way to support SMB storage classes – Flexvolume plugin. The issue is this plugin is required to be installed on both master and worker nodes but since EKS doesn’t give access to the control plane "Master node", it makes it difficult to install the plugin.

The installation guide for flexvolume can be found here - and

It requires this command run on each node

VOLUME_PLUGIN_DIR="/usr/libexec/kubernetes/kubelet-plugins/volume/exec" mkdir -p "$VOLUME_PLUGIN_DIR/fstab~cifs" cd "$VOLUME_PLUGIN_DIR/fstab~cifs" curl -L -O https://raw.githubusercontent.com/fstab/cifs/master/cifs chmod 755 cifs

This installation guide works well on self-managed K8s cluster.

apt require `apt update` before `apt upgrade`

Hi,
Thanks for writing
You should include RUN apt-get upgrade in your Dockerfiles to upgrade the packages in your images.
Could you rewrite two things

Please use apt-get update && apt-get upgrade. update is required before upgrade. It is a one of frequently showing mistake...
apt clean is also useful. It can clean files in /var/cache/apt/archives.

References
https://aws.github.io/aws-eks-best-practices/security/docs/image/#update-the-packages-in-your-container-images

https://www.debian.org/doc/manuals/debian-handbook/sect.apt-get.en.html

kubectl convert was removed in 1.20

Describe the problem
kubectl convert was removed in 1.20 and is now available as a plugin for kubectl. I think the link and description need to be updated.

kubernetes/kubectl#725
https://kubernetes.io/docs/tasks/tools/included/kubectl-convert-overview/
https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/#install-kubectl-convert-plugin

References
https://aws.github.io/aws-eks-best-practices/reliability/docs/controlplane/#handling-cluster-upgrades

Flesh out more Auth Federation details

It would be great to have additional information about the mechanics of federating with an AD or LDAP provider.

I think it’s worth expanding a bit on how things fit together. How the federated Roles are mapped to IAM Roles which, in turn, are used in the ConfigMap. How that ConfigMap gets created/updated, etc.

Network Policy can be used to selectively allow metadata access to the pods

Is your idea request related to a problem that you've solved? Please describe.

The above topics describes how to block access to instance metadata

In my experience, the Kinesis Client Library used by some Pods does not support IRSA. So I could not block metadata access because using only IMDSv2 and setting hop count to 1 or using iptables would target all the Pods on the node.

However I could use Kubernetes Network Policy to selectively allow metadata access to the pods.

How about adding a description of how to block metadata access using Network Policy?

Describe the best practice

You can use Kubernetes Network Policy to block metadata access and selectively allow to some pods.

At first, block access to the metadata service from all pods by adding following policy.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-metadata-access
  namespace: example
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 169.254.169.254/32

Then allow access from some pods by adding following policy.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-metadata-access
  namespace: example
spec:
  podSelector:
    matchLabels:
      app: myapp  
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 169.254.169.254/32

CNI custom networking for managed node groups

Describe the problem
The web page section "EKS managed node groups currently don’t support custom networking option." This conflicts with AWS official documentation.
https://aws.github.io/aws-eks-best-practices/reliability/docs/networkmanagement/#cni-custom-networking indicates "

References
Web page section: https://aws.github.io/aws-eks-best-practices/reliability/docs/networkmanagement/#cni-custom-networking : Claims "EKS managed node groups currently don’t support custom networking option."

AWS official docs: https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html shows example with managed node groups.

Recommendation to include sessionName when mapping roles in aws-auth ConfigMap

Is your idea request related to a problem that you've solved? Please describe.
When accessing the EKS cluster with the IAM entity mapped by aws-auth ConfigMap, the username described in aws-auth ConfigMap is recorded in the user field of the Kubernetes audit log. If you're using an IAM role, the actual users who assume that role aren't recorded and can't be audited.

Describe the best practice
When assigning K8s RBAC permissions to an IAM role using mapRoles in aws-auth ConfigMap, you should include {{SessionName}} in your username. That way, the audit log will record the session name so you can track who the actual user assume this role along with the CloudTrail log.

- rolearn: arn:aws:iam::XXXXXXXXXXXX:role/testRole
  username: testRole:{{SessionName}}
  groups:
    - system:masters

Public pages have prominent link to private github repo

The public published pages have a prominent link to this private github repo (top right). This is a bit rude since almost everyone who clicks on these will get a 404. We should remove the link imo.

Accessing the IRSA service account token as a non-root user

If you run your application as a non-root user [a best practice] you cannot access the IRSA service account token because it is assigned 0600 [root] permissions by default. If you update the securityContext for your container to include fsgroup=65534 [Nobody] the container will be able to read the token.

spec:
  securityContext:
    fsGroup: 65534

This is supposed to be fixed in an upcoming release of k8s, kubernetes/enhancements#1598.

Add seccomp.security.alpha.kubernetes.io/pod: "runtime/default"

Is your idea request related to a problem that you've solved? Please describe.
A clear and concise description of the problem.

Describe the best practice
This implements the runtime defaults of Docker or another CRI.

Describe alternatives you've considered
After 1.19 it's part of the securityContext for Pod or container.

Additional context
Add any other context or screenshots about the idea here.

Conflicting advice about mixed instance policies

Describe the problem
The spot instances section states this in the first paragraph: "Mixed Instance Policies with Spot Instances are a great way to increase diversity without increasing the number of node groups...". Later in the third paragraph, the statement is: "It's recommended to isolate On-Demand and Spot capacity into separate EC2 Auto Scaling groups. This is preferred over using a base capacity strategy because the scheduling properties are fundamentally different...". It is unclear what exactly is the recommendation - mixed instance policies or separate ASGs. I think it will be best to remove the statement about mixed instance policies being "great".

References
https://github.com/aws/aws-eks-best-practices/blob/master/content/cluster-autoscaling/cluster-autoscaling.md#spot-instances

Using SSM for managed nodes.

In https://aws.github.io/aws-eks-best-practices/hosts/ it's mentioned that it is better to use SSM session manager rather than ssh access to the nodes.

Is there a way to get SSM installed on managed EKS nodes? By default the ssm agent is NOT installed.

Embellish section on Forensics

Describe the best practice
Customers want additional information about how to do a forensics investigation involving containers.

This is an evolving space. Performing a forensics against a container is challenging because containers are oftentimes ephemeral; by the time you realize a container has been compromised, the container has been replaced. You can compensate for this by running software that warns of suspicious behavior while the container is running, but additional guidance is necessary to capture evidence of a breach.

Cannot apply PodSecurityPolicy configurations in 'content/security/docs/pods.md'

Describe the problem

I cannot apply PodSecurityPolicy configurations in https://aws.github.io/aws-eks-best-practices/pods/

Here is the output of applying the PSP named "eks.privileged".

$ cat << EOF | k apply -f -
> apiVersion: extensions/v1beta1
> kind: PodSecurityPolicy
> metadata:
>   annotations:
>     kubernetes.io/description: privileged allows full unrestricted access to pod features,
>       as if the PodSecurityPolicy controller was not enabled.
>     seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
>   labels:
>     eks.amazonaws.com/component: pod-security-policy
>     kubernetes.io/cluster-service: "true"
>   name: eks.privileged
> spec:
>   allowPrivilegeEscalation: true
>   allowedCapabilities:
>   - '*'
>   fsGroup:
>     rule: RunAsAny
>   hostIPC: true
>   hostNetwork: true
>   hostPID: true
>   hostPorts:
>   - max: 65535
>     min: 0
>   privileged: true
>   runAsUser:
>     rule: RunAsAny
>   seLinux:
>     rule: RunAsAny
>   supplementalGroups:
>     rule: RunAsAny
>   volumes:
>   - '*'
> EOF
error: unable to recognize "STDIN": no matches for kind "PodSecurityPolicy" in version "extensions/v1beta1"

apiVersion should be policy/v1beta1 instead of extensions/v1beta1.
In the PSP named "restricted", it is already policy/v1beta1.

Also, I cannot apply the "restricted" PSP.

$ cat <<EOF | k apply -f -
> apiVersion: policy/v1beta1
> kind: PodSecurityPolicy
> metadata:
>     name: restricted
>     annotations:
>     seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default,runtime/default'
>     apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default'
>     seccomp.security.alpha.kubernetes.io/defaultProfileName:  'runtime/default'
>     apparmor.security.beta.kubernetes.io/defaultProfileName:  'runtime/default'
> spec:
>     privileged: false
>     # Required to prevent escalations to root.
>     allowPrivilegeEscalation: false
>     # This is redundant with non-root + disallow privilege escalation,
>     # but we can provide it for defense in depth.
>     requiredDropCapabilities:
>     - ALL
>     # Allow core volume types.
>     volumes:
>     - 'configMap'
>     - 'emptyDir'
>     - 'projected'
>     - 'secret'
>     - 'downwardAPI'
>     # Assume that persistentVolumes set up by the cluster admin are safe to use.
>     - 'persistentVolumeClaim'
>     hostNetwork: false
>     hostIPC: false
>     hostPID: false
>     runAsUser:
>     # Require the container to run without root privileges.
>     rule: 'MustRunAsNonRoot'
>     seLinux:
>     # This policy assumes the nodes are using AppArmor rather than SELinux.
>     rule: 'RunAsAny'
>     supplementalGroups:
>     rule: 'MustRunAs'
>     ranges:
>         # Forbid adding the root group.
>         - min: 1
>         max: 65535
>     fsGroup:
>     rule: 'MustRunAs'
>     ranges:
>         # Forbid adding the root group.
>         - min: 1
>         max: 65535
>     readOnlyRootFilesystem: false
> EOF
error: error parsing STDIN: error converting YAML to JSON: yaml: line 40: did not find expected '-' indicator

It seems that some indents in the configuration is incorrect. (e.g. annotations, rule)

Here is my cluster versions.

$ k version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:40:13Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:37:12Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

References
https://aws.github.io/aws-eks-best-practices/pods/

Calico and eBPF for large cluster

In the Running large clusters section, we mention the ipvs mode of kube-proxy.

There is a AWS blog about replacing kube-proxy with calico in eBPF mode, which seems to perform even better. 1.19 and above will work on Amazon Linux 2 AMI with kernel 5.4 and calico in eBPF mode seems to be available. How about mentioning this AWS blog?

Run apt-get upgrade to upgrade OS packages in your image

Is your idea request related to a problem that you've solved? Please describe.
https://pythonspeed.com/articles/security-updates-in-docker/ << reference this blog

Describe the best practice
A clear and concise description of the best practice you developed along with any code and/or projects you used to solve the problem.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the idea here.

Stream Kubernetes audit logs to S3 using CWL subscriptions and Firehose

Is your idea request related to a problem that you've solved? Please describe.
It can be difficult to analyze the audit logs when they are in CWLs. If they're streamed to S3, you can use Athena, Glue, SageMaker, and other AWS services to analyze them. You can also use tools like audit2rbac to create RBAC policies, e.g. roles, rolebindings, clusterroles, and clusterrolebindings, from observed behavior in the logs.

Describe the best practice

Create S3 bucket
Create IAM policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload",
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket>",
                "arn:aws:s3:::<bucket>/*"
            ]
        }
    ]
}

Create Lambda function from Lambda kinesis-firehose-cloudwatch-logs-processor-python blueprint
Configure Firehose stream to deliver logs to your bucket. Enable source record transformation and specify the Lambda function you created from the blueprint.
Create IAM policy to allow CWL to put data onto the stream:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::<account>:role/CWLtoKinesisFirehoseRole"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "firehose:*",
            "Resource": "arn:aws:firehose:us-west-2:<account>:*"
        }
    ]
}

Create a subscription for the log group where you audit logs can be found:

aws logs put-subscription-filter --log-group-name "<log group>" --filter-name "Destination" --destination-arn "<firehose_arn>" --role-arn "<role_arn>" --filter-pattern ""

Logs streamed from CWL to S3 with Firehose are automatically compressed. The Lambda function decompresses the logs before they're written to S3.

Describe alternatives you've considered
https://github.com/rafaelpereyra/ekscw-export

Additional context
Once the logs are in S3 you can run the following:

audit2rbac --filename audit-delivery-stream-2-2020-06-19-19-01-58-c91bfdd2-d182-4803-8ba8-bca8284a5aaf --user=bob

This will generate RBAC permissions for user Bob based on observed behavior in the logs.

Fargate Profile

Describe the problem
Write a blurb on the Fargate Execution Role and issues with the path and aws-auth configmap.

References
Please include a link to the lines where the error appears.

More guidance for the upcoming deprecation of PSPs

@jicowan Currently, OPA Gatekeeper is only mentioned in two links in a Tools and Resources section at the bottom of the page. Is this adequate guidance? (I ask this naively not rhetorically.)

Given that PSP is deprecated, I'm trying to determine what the best practice should be regarding pod security. Can you discuss the decision of whether to replace and/or augment PSP with Gatekeeper or Kyverno in the body of this section? I would appreciate it if you could recommend a course of action. Or are we to assume that we should stick with PSP for now, even if we are creating a new cluster?

Originally posted by @joebowbeer in #16 (comment)

Propose to put Dockle on the list of tools for image security.

Describe the best practice
Dockle is a good tool as a linter for Docker images. It has got 1.5k stars and is actively maintained. I would like to propose to put it on the list of tools for image security.

https://aws.github.io/aws-eks-best-practices/security/docs/image/#tools.

TODO: Multi-tenancy

Investigate k8spin as an alternative to Kiosk, https://github.com/k8spin/k8spin-operator

Better to introduce hadolint than dockerfile-lint

Describe the problem
In the "Lint your Dockerfiles" section, dockerfile-lint is introduced, but hadolint has much more stars on GitHub. Why not introduce hadolint here? dockerfile-lint has fewer stars and doesn't seem to be actively maintained.

References
https://aws.github.io/aws-eks-best-practices/security/docs/image/#lint-your-dockerfiles

strace (seccomp) && docker profile

Is your idea request related to a problem that you've solved? Please describe.
A clear and concise description of the problem.

Describe the best practice
A clear and concise description of the best practice you developed along with any code and/or projects you used to solve the problem.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the idea here.

Fix statement about kube-bench reporting false positives

Describe the problem
kube-bench has been updated to support EKS. See:
https://aws.amazon.com/blogs/containers/introducing-cis-amazon-eks-benchmark/
https://www.eksworkshop.com/intermediate/300_cis_eks_benchmark/
https://aws.amazon.com/about-aws/whats-new/2020/07/announcing-cis-benchmark-for-amazon-eks/

References
https://aws.github.io/aws-eks-best-practices/security/docs/hosts/#periodically-run-kube-bench-to-verify-compliance-with-cis-benchmarks-for-kubernetes

Many URLs are not linked

Describe the problem
While reading, I noticed a number of plain URLs not actually being links.

References
Submitting a PR to linkify all the URLs I could find.

Broken link to the k8s audit policy that EKS uses as a base

Describe the problem
At the top of the Detective Control section, it mentions the k8s audit policy that EKS uses. The range of lines in the linked code is not correct because the referenced configure-helper.sh has been updated.

The guide refers to L983-L1108, but L1116-L1241 seems to be correct at this moment.

Is EKS still using this latest updated GCP configure-helper.sh as a base? The EKS user guide does not mention the k8s audit policy that EKS uses.

References
https://aws.github.io/aws-eks-best-practices/security/docs/detective.html

PSPs EOL: Transition to Policy as Code (PaC) solutions and/or Pod Security Standards (PSS)

Is your idea request related to a problem that you've solved? Please describe.
A clear and concise description of the problem.

Describe the best practice
A clear and concise description of the best practice you developed along with any code and/or projects you used to solve the problem.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the idea here.

Navigation of "Data Encryption and Secrets Management" page is broken

Describe the problem
The navigation in the right pane of the "Data Encryption and Secrets Management" page is broken.
I think it can be fixed by putting "Data Encryption and Secrets Management" as the title (H1) and making "Encryption at rest" and "Secrets management" one step down to H2.

References
https://aws.github.io/aws-eks-best-practices/security/docs/data/

Monitoring Control Plane Metrics shows guidance for deprecated metrics

Describe the problem

Metric	Notes
etcd_request_latencies_summary	Deprecated: kubernetes/kubernetes#76496 - Replaced with etcd_request_duration_seconds
etcd_helper_cache_entry_total	Deprecated: kubernetes/kubernetes#79520
etcd_helper_cache_hit_total	Deprecated: kubernetes/kubernetes#79520
etcd_helper_cache_miss_total	Deprecated: kubernetes/kubernetes#79520
etcd_request_cache_get_duration_seconds	Deprecated: kubernetes/kubernetes#79520
etcd_request_cache_add_duration_seconds	Deprecated: kubernetes/kubernetes#79520

The PR/code changes for this deprecation do not list alternatives.

Can anyone provide alternatives for these deprecated metrics?

References
https://aws.github.io/aws-eks-best-practices/reliability/docs/controlplane/#monitor-control-plane-metrics

Cluster Autoscaling Guide 404 and Asset Errors

Describe the problem
Visiting the Cluster Autoscaling guide page (https://aws.github.io/aws-eks-best-practices/cluster-autoscaling/cluster-autoscaling/) results in broken Javascript, CSS and a 404 error. I'm not sure if this is simply an issue with the link in the navigation or with the page itself.

Screenshot

Upstream Kubernetes lifecycle changes

Describe the problem
The support window for minor versions of upstream Kubernetes has been changed to one year starting with 1.19.
The release interval for minor versions of upstream Kubernetes has been changed to three releases a year since 1.22.
I think these need to be reflected in the EKS best practice documentation.

https://kubernetes.io/blog/2020/08/31/kubernetes-1-19-feature-one-year-support/
https://kubernetes.io/blog/2021/07/20/new-kubernetes-release-cadence/

References
https://aws.github.io/aws-eks-best-practices/reliability/docs/controlplane/#handling-cluster-upgrades

aws / aws-eks-best-practices Goto Github PK

aws-eks-best-practices's Introduction

Amazon Elastic Kubernetes Service (Amazon EKS) Best Practices

Contributing

License Summary

aws-eks-best-practices's People

Contributors

Stargazers

Watchers

Forkers

aws-eks-best-practices's Issues

Recommend Projects

Recommend Topics

Recommend Org