cookpad / terraform-aws-eks Goto Github PK

View Code? Open in Web Editor NEW

68.0 16.0 40.0 1004 KB

A Terraform module to Provision AWS Elastic Kubernetes (EKS) clusters and worker nodes

License: Apache License 2.0

Shell 2.86% HCL 57.81% Go 37.60% Ruby 1.73%

kubernetes terraform aws aws-eks kubernetes-deployment aws-eks-cluster hacktoberfest

terraform-aws-eks's Introduction

Terraform EKS Module

This repo contains a set of Terraform modules that can be used to provision an Elastic Kubernetes (EKS) cluster on AWS.

This module provides a way to provision an EKS cluster based on the current best practices employed at Cookpad.

Using this module

To provision an EKS cluster you need (as a minimum) to specify a name, and the details of the VPC network you will create it in.

module "cluster" {
  source  = "cookpad/eks/aws"
  version = "~> 1.25"

  name       = "hal-9000"

  vpc_config = {
    vpc_id = "vpc-345abc"

    public_subnet_ids = {
      use-east-1a = subnet-000af1234
      use-east-1b = subnet-123ae3456
      use-east-1c = subnet-456ab6789
    }

    private_subnet_ids = {
      use-east-1a = subnet-123af1234
      use-east-1b = subnet-456bc3456
      use-east-1c = subnet-789fe6789
    }
  }
}

provider "kubernetes" {
  host                   = module.cluster.config.endpoint
  cluster_ca_certificate = base64decode(module.cluster.config.ca_data)

  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "aws"
    args        = ["eks", "get-token", "--cluster-name", module.cluster.config.name]
  }
}

There are many options that can be set to configure your cluster. Check the input documentation for more information.

Networking

If you only have simple networking requirements you can use the submodule cookpad/eks/aws/modules/vpc to create a VPC, it's output variable config can be used to configure the vpc_config variable.

Check the VPC module documentation for more extensive information.

Karpenter

We use karpenter to provision the nodes that run the workloads in our clusters. You can use the submodule cookpad/eks/aws/modules/vpc to provision the resources required to use karpenter, and a fargate profile to run the karpenter pods.

Check the Karpenter module documentation for more information.

Requirements

In order for this module to communicate with kubernetes correctly this module requires the aws cli to be installed and on your path.

You will need to initialise the kuberntes provider as shown in the example.

Multi User Environments

In an environment where multiple IAM users are used to running terraform run and terraform apply it is recommended to use the assume role functionality to assume a common IAM role in the aws provider definition.

provider "aws" {
  region              = "us-east-1"
  version             = "3.53.0"
  assume_role {
    role_arn = "arn:aws:iam::<your account id>:role/Terraform"
  }
}

see an example role here

Alternatively you should ensure that all users who need to run terraform are listed in the aws_auth_user_map variable of the cluster module.

terraform-aws-eks's People

Contributors

Stargazers

Watchers

terraform-aws-eks's Issues

Change the default of endpoint_public_access to be true

Make creating the cluster sg optional

By default eks will create a security group for the cluster ... which we can just use.

We should make an option not to create aws_security_group.control_plane and just use this instead!

Add a variable to unset k8s.io/cluster-autoscaler/enabled ASG tag

It can be useful to disable the cluster autoscaler (for a particular node group) when getting ready to service out an old ASG. Otherwise the autoscaler might scale up as we drain existing nodes!

Support bottlerocket nodes

https://github.com/bottlerocket-os/bottlerocket/blob/develop/QUICKSTART-EKS.md

BUG: aws_user_role_map not implemented

Followed the documentation for the cluster module to use aws_user_role_map and got terraform error:
https://github.com/cookpad/global-aws/pull/231

It looks like this is unsupported looking at variables.tf in the cluster module

Allow custom cloud-config to be supplied to the nodes!

e.g. https://github.com/cookpad/global-sre/blob/11c91d72e44b71e9cf0536149abc779d90885904/terraform/us-east-1/eks-node-user-data.yml.tmpl

We might want to provide some customisations to the cloud-config... to configure things like sshd etc.

VPC

For completeness a module to setup a VPC + subnets

Support EKS Cluster Envelope Encryption with KMS

EKS recently added support to encrypt kubernetes secrets via a KMS key... This seems like best practice and easy to implement so lets do it!

🔗 https://aws.amazon.com/blogs/containers/using-eks-encryption-provider-support-for-defense-in-depth/
🔗 hashicorp/terraform-provider-aws#12287
🔗 https://www.terraform.io/docs/providers/aws/r/eks_cluster.html#encryption_config-1

Documentation

Rename default IAM role EKSControlPlane to eksServiceRole

The user guide uses this name.
It makes clear that it is a service role.
It's consistent with the ECS counterpart ecsServiceRole.

Fix docker volume...

It seems the docker volume is not mounted correctly.

Nodes

Background

The cluster needs nodes so it can do things!

We need to decide if we should support the following, and then implement:

~~Managed Nodes~~
Fargate
On Demand, via ASG
Spot, via ASG

Things to consider

Autoscaling
Spot termination
Custom vs AWS image

Cluster Autoscaler

Background

We want the number of nodes in the cluster to be autoscaled!

Work

Deploy the cluster autoscaler, via terraform.

https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler

Basic Config

Background

There is some basic configuration that should be part of the bootstrap!

Config

NVMe mounting for ephemeral and docker_volume

If machines have nVME storage, it'd be great to make this available - it's cheap and fast (notwithstanding it's ephemeral nature!).

Setup ODIC provider

EKS can use an odic provider to provide a IAM role per pod, this should be setup by the cluster module.

IAM

Background

A module to create IAM resources would make this solution more complete...

Motivation

Not much, since we wont use it as we don't manage IAM with terraform at the moment!

Refactor to not use Kubernetes provider.

I have come to the conclusion that using the Kubernetes Provider to bootstrap the cluster is sub-optimal for some reasons:

It's a bit overkill - we don't really need terraform to manage the full lifecycle of Kubernetes objects (it can do that itself)
Lazy config of the Kubernetes provider isn't recommended (and was broken by the latest version hashicorp/terraform-provider-kubernetes#759) https://xkcd.com/1172/
There is a mismatch between terraform config and yaml that makes it harder to diff from upstream...

ALB Ingress Controller.

To make the cluster batteries included we should/could ship an ingress controller.

I guess the ALB ingress controller seems like the best AWS native option...

https://github.com/kubernetes-sigs/aws-alb-ingress-controller/

Enable passing of Additional IAM roles to add to the aws-auth configmap

Metrics Server

Test control plane logs are written to cloudwatch

Need to add some validation that cloudwatch is getting the logs from the control plane

K8s 1.16

https://aws.amazon.com/about-aws/whats-new/2020/04/amazon-eks-now-supports-kubernetes-version-1-16/

Addons

Background

Launching some basic services onto the cluster seems to be part of the bootstrapping process.

Addons

Fargate Node Groups

It would be nice to be able to use fargate with eks.

📝

I think a nice design would be a module that could be used in place of the asg_node_group allowing pods to run on fargate.

Stop pods from using the node instance profile

Handle upgrading clusters

There are a number of manual steps that need to be taken to upgrade the kubernetes version

it would be great to automate these as much as possible

https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html

Workout procedure to migrate from internal cookpad terraform-eks module

Alternatives to provisioners to ensure a clean cleanup...

When we want to cleanup a cluster there are some resources that won't be cleaned up by terraform.
Currently we have been adding some destroy time provisioners to deal with these cases (mainly for a clean CI run).

Security groups that were created by EKS itself: https://github.com/cookpad/terraform-aws-eks/blob/master/modules/vpc/vpc.tf#L15-L24
ENIs that were created by https://github.com/aws/amazon-vpc-cni-k8s and then not cleaned up (probably because the cluster / it's nodes were removed) https://github.com/cookpad/terraform-aws-eks/blob/master/modules/cluster/main.tf#L44-L52 / aws/amazon-vpc-cni-k8s#608

Both these cases are unfortunate because they prevent some resources that are managed by terraform from being deleted, so we can't simply just run a script as part of our CI run (plus document these issues)

Destroy provisioners are sub-optimal in both these cases because:

We are just running some shell script yuck .... (but this could be mitigated with enough testing)!
They will not run when a module is removed from the configuration ... due to hashicorp/terraform#13549 ...

Explicitly set aws-node-termination handler config options

https://github.com/cookpad/terraform-aws-eks/blob/master/modules/cluster/addons/aws-node-termination-handler.yaml#L225-L240

Defaults come from https://github.com/aws/aws-node-termination-handler/blob/master/pkg/config/config.go#L105-L127

Switch to Amazon EBS CSI driver

https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html

Document / script a procedure to service out asg_node_group

Something like:

Disable cluster autoscaler for asg_node_group #111
Discover all the nodes managed by the asg_node_group (perhaps we need to add a label for this)
Cordon all the nodes
Drain each node with a configurable delay between
Remove the asg_node_group module!

Document/script a procedure to roll the nodes (upgrade)

When upgrading a cluster we want to migrate all the workloads running on a cluster to nodes running the new version!

In a cluster with the autoscaler enabled the procedure can be:

Check that the asgs have a high enough max_size to add additional capactiy.
cordon all the nodes running the old version
drain each node running the old version (pause for some time period between each node, could be done intlegently by waiting for cluster workloads to reach a steady state, or perhaps just pause some amount of time between each node).

The autoscaler should handle provisioning additional capacity as we drained the old nodes (using the new version), but to avoid waiting for new nodes to be provisioned we could do some overprovisioning first! https://medium.com/scout24-engineering/cluster-overprovisiong-in-kubernetes-79433cb3ed0e

Once the old nodes have been drained the cluster autoscaler should mark them for low utilistation and remove them in due course.

Nginx Ingress controller

In the same vein to #87 can you see value in adding the Nginx Ingress controller to the module?

There are going to be situations where we might want to use the Nginx Ingress controller with an NLB rather than AWS+ALB, for example a service we want to place behind a Basic Auth implementation.

Node Exporter

Change content type of cloudconfig parts to text/plain

Apply tags to the cluster!

It would be nice to be able to tag the eks cluster!

GPU support

Make kubectl apply from terraform work for multiple users!

Versioning stratergy

Decide and document the versioning strategy for this module!

Conform to standard module structure

https://www.terraform.io/docs/modules/index.html#standard-module-structure

Node local DNS cache

https://www.vladionescu.me/posts/eks-dns.html
https://github.com/weaveworks/eksctl/pull/550/files

Assume EKSAdmin role in manifests from templates

@jnavarro86 and I are experiencing the following:

I created a cluster with this module yesterday.

Today @jnavarro86 ran a terraform plan on https://github.com/cookpad/global-aws and got some in-place changes from this module:

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create
-/+ destroy and then create replacement
Terraform will perform the following actions:
  # module.iam.aws_iam_policy.cookpad_global_1_research_data_writable will be created
  + resource "aws_iam_policy" "cookpad_global_1_research_data_writable" {
      + arn         = (known after apply)
      + description = "Access to read/write artifacts in/from research-data.cookpad-global-1 in cookpad-global-1 AWS cookpad account."
      + id          = (known after apply)
      + name        = "cookpad-global-1-research-data-writable"
      + path        = "/"
      + policy      = jsonencode(
            {
              + Statement = [
                  + {
                      + Action   = [
                          + "s3:PutObjectAcl",
                          + "s3:PutObject",
                          + "s3:ListBucketVersions",
                          + "s3:ListBucket",
                          + "s3:GetObjectVersion",
                          + "s3:GetObjectAcl",
                          + "s3:GetObject",
                          + "s3:DeleteObject",
                        ]
                      + Effect   = "Allow"
                      + Resource = [
                          + "arn:aws:s3:::research-data.cookpad-global-1/*",
                          + "arn:aws:s3:::research-data.cookpad-global-1",
                        ]
                      + Sid      = ""
                    },
                ]
              + Version   = "2012-10-17"
            }
        )
    }
  # module.iam.aws_iam_role_policy_attachment.attach_cookpad_global_1_research_data_writable_to_MLServicesDeployment will be created
  + resource "aws_iam_role_policy_attachment" "attach_cookpad_global_1_research_data_writable_to_MLServicesDeployment" {
      + id         = (known after apply)
      + policy_arn = (known after apply)
      + role       = "MLServicesDeploymentsstaging"
    }
  # module.testing-eks-cluster.module.aws_auth.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "2876031626230467720" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "fd0b2cd0d1b8c03172146eb2ae05d934101ea3c9" -> "cf2ffb9f14ddf3310b0a8d4c0e653ad2cf4d4cfc"
        }
    }
  # module.testing-eks-cluster.module.aws_node_termination_handler.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "6960512851880308979" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "f776501a0a073a3468954b3e49d0f3072f719d06" -> "8900a3db6e4471100784c62183483ae71c552ba5"
        }
    }
  # module.testing-eks-cluster.module.cluster_autoscaler.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "7127138984923775440" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "a9f328dacbea6796d3278ac7df0d3adf3f96eb65" -> "600acb30f72144b0e00155d12fac8023a48aaa89"
        }
    }
  # module.testing-eks-cluster.module.metrics_server.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "8957918169943094131" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "b557313e65407769f40427871758f980c7eaf0e9" -> "14cff8c248aec8a589090dffa2dab79c80dae9dc"
        }
    }
  # module.testing-eks-cluster.module.pod_nanny.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "2349652268422680257" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "a9d96d1424a177997c7849c78e192fd5b9b481be" -> "d4a0a0ac2380b275d8fbf529b743384053e6df4f"
        }
    }
  # module.testing-eks-cluster.module.prometheus_node_exporter.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "8495425553566447814" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "ece2ae801ea663851bc22a0af667937c680a5547" -> "e9c91943f6fb87a6ce824a9779b1163938aba250"
        }
    }
  # module.testing-eks-cluster.module.storage_classes.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "1737947817272327439" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "a5349d15d4301f5ab68d91e15175bc9726602cdd" -> "b12737dd4e9aac3c336e59b06146cbdcb1aff152"
        }
    }
Plan: 9 to add, 0 to change, 7 to destroy.

I believe updating the manifest templates to assume the EKSAdmin role will stop us seeing these diffs?

Kubernetes 1.17 Support

Handle cluster bootstrap from 0 nodes.

Background

When a cluster is provisioned it has 0 nodes... so there is nowhere for the cluster autoscaler to run in order to scale out the auto scaling groups.

Work

Work out how to handle this and implement a solution.
- Create a dedicated asg to run "cluster services"
- Run cluster autoscaler on fargate...
- Just require a minimum number of nodes in each ASG.

We should also tag the asg with:

k8s.io/cluster-autoscaler/node-template/label/node-role.kubernetes.io/{worker/spot-worker}

Fix node-exporter config

To follow: https://github.com/prometheus/node_exporter/tree/v0.18.1#using-docker

Allow additional security groups to be specified for nodes

It can be useful to apply additional security groups to ec2 worker nodes. e.g. To provide ssh access!

cookpad / terraform-aws-eks Goto Github PK

terraform-aws-eks's Introduction

Terraform EKS Module

Using this module

Networking

Karpenter

Requirements

Multi User Environments

terraform-aws-eks's People

Contributors

Stargazers

Watchers

Forkers

terraform-aws-eks's Issues

Background

Things to consider

Background

Work

Background

Config

Background

Motivation

Background

Addons

Background

Work

Recommend Projects

Recommend Topics

Recommend Org