Giter VIP home page Giter VIP logo

terraform-aws-eks's Introduction

Terraform EKS Module

.github/workflows/ci.yml

This repo contains a set of Terraform modules that can be used to provision an Elastic Kubernetes (EKS) cluster on AWS.

This module provides a way to provision an EKS cluster based on the current best practices employed at Cookpad.

Using this module

To provision an EKS cluster you need (as a minimum) to specify a name, and the details of the VPC network you will create it in.

module "cluster" {
  source  = "cookpad/eks/aws"
  version = "~> 1.25"

  name       = "hal-9000"

  vpc_config = {
    vpc_id = "vpc-345abc"

    public_subnet_ids = {
      use-east-1a = subnet-000af1234
      use-east-1b = subnet-123ae3456
      use-east-1c = subnet-456ab6789
    }

    private_subnet_ids = {
      use-east-1a = subnet-123af1234
      use-east-1b = subnet-456bc3456
      use-east-1c = subnet-789fe6789
    }
  }
}

provider "kubernetes" {
  host                   = module.cluster.config.endpoint
  cluster_ca_certificate = base64decode(module.cluster.config.ca_data)

  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "aws"
    args        = ["eks", "get-token", "--cluster-name", module.cluster.config.name]
  }
}

There are many options that can be set to configure your cluster. Check the input documentation for more information.

Networking

If you only have simple networking requirements you can use the submodule cookpad/eks/aws/modules/vpc to create a VPC, it's output variable config can be used to configure the vpc_config variable.

Check the VPC module documentation for more extensive information.

Karpenter

We use karpenter to provision the nodes that run the workloads in our clusters. You can use the submodule cookpad/eks/aws/modules/vpc to provision the resources required to use karpenter, and a fargate profile to run the karpenter pods.

Check the Karpenter module documentation for more information.

Requirements

In order for this module to communicate with kubernetes correctly this module requires the aws cli to be installed and on your path.

You will need to initialise the kuberntes provider as shown in the example.

Multi User Environments

In an environment where multiple IAM users are used to running terraform run and terraform apply it is recommended to use the assume role functionality to assume a common IAM role in the aws provider definition.

provider "aws" {
  region              = "us-east-1"
  version             = "3.53.0"
  assume_role {
    role_arn = "arn:aws:iam::<your account id>:role/Terraform"
  }
}

see an example role here

Alternatively you should ensure that all users who need to run terraform are listed in the aws_auth_user_map variable of the cluster module.

terraform-aws-eks's People

Contributors

aidy avatar akbargumbira avatar andrassy avatar crigor avatar dan-slinky-ckpd avatar dependabot[bot] avatar errm avatar ettiee avatar jacksmith15 avatar jportasa avatar mtpereira avatar pray avatar shimpeko avatar sikachu avatar sorah avatar takanabe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

terraform-aws-eks's Issues

Make creating the cluster sg optional

By default eks will create a security group for the cluster ... which we can just use.

We should make an option not to create aws_security_group.control_plane and just use this instead!

VPC

For completeness a module to setup a VPC + subnets

Nodes

Background

The cluster needs nodes so it can do things!

We need to decide if we should support the following, and then implement:

  • Managed Nodes
  • Fargate
  • On Demand, via ASG
  • Spot, via ASG

Things to consider

  • Autoscaling
  • Spot termination
  • Custom vs AWS image

Setup ODIC provider

EKS can use an odic provider to provide a IAM role per pod, this should be setup by the cluster module.

IAM

Background

A module to create IAM resources would make this solution more complete...

Motivation

Not much, since we wont use it as we don't manage IAM with terraform at the moment!

Refactor to not use Kubernetes provider.

I have come to the conclusion that using the Kubernetes Provider to bootstrap the cluster is sub-optimal for some reasons:

  • It's a bit overkill - we don't really need terraform to manage the full lifecycle of Kubernetes objects (it can do that itself)
  • Lazy config of the Kubernetes provider isn't recommended (and was broken by the latest version hashicorp/terraform-provider-kubernetes#759) https://xkcd.com/1172/
  • There is a mismatch between terraform config and yaml that makes it harder to diff from upstream...

Alternatives to provisioners to ensure a clean cleanup...

When we want to cleanup a cluster there are some resources that won't be cleaned up by terraform.
Currently we have been adding some destroy time provisioners to deal with these cases (mainly for a clean CI run).

Both these cases are unfortunate because they prevent some resources that are managed by terraform from being deleted, so we can't simply just run a script as part of our CI run (plus document these issues)

Destroy provisioners are sub-optimal in both these cases because:

  • We are just running some shell script yuck .... (but this could be mitigated with enough testing)!
  • They will not run when a module is removed from the configuration ... due to hashicorp/terraform#13549 ...

Document / script a procedure to service out asg_node_group

Something like:

  • Disable cluster autoscaler for asg_node_group #111
  • Discover all the nodes managed by the asg_node_group (perhaps we need to add a label for this)
  • Cordon all the nodes
  • Drain each node with a configurable delay between
  • Remove the asg_node_group module!

Document/script a procedure to roll the nodes (upgrade)

When upgrading a cluster we want to migrate all the workloads running on a cluster to nodes running the new version!

In a cluster with the autoscaler enabled the procedure can be:

  • Check that the asgs have a high enough max_size to add additional capactiy.
  • cordon all the nodes running the old version
  • drain each node running the old version (pause for some time period between each node, could be done intlegently by waiting for cluster workloads to reach a steady state, or perhaps just pause some amount of time between each node).

The autoscaler should handle provisioning additional capacity as we drained the old nodes (using the new version), but to avoid waiting for new nodes to be provisioned we could do some overprovisioning first! https://medium.com/scout24-engineering/cluster-overprovisiong-in-kubernetes-79433cb3ed0e

Once the old nodes have been drained the cluster autoscaler should mark them for low utilistation and remove them in due course.

Nginx Ingress controller

In the same vein to #87 can you see value in adding the Nginx Ingress controller to the module?

There are going to be situations where we might want to use the Nginx Ingress controller with an NLB rather than AWS+ALB, for example a service we want to place behind a Basic Auth implementation.

Assume EKSAdmin role in manifests from templates

@jnavarro86 and I are experiencing the following:

I created a cluster with this module yesterday.

Today @jnavarro86 ran a terraform plan on https://github.com/cookpad/global-aws and got some in-place changes from this module:

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create
-/+ destroy and then create replacement
Terraform will perform the following actions:
  # module.iam.aws_iam_policy.cookpad_global_1_research_data_writable will be created
  + resource "aws_iam_policy" "cookpad_global_1_research_data_writable" {
      + arn         = (known after apply)
      + description = "Access to read/write artifacts in/from research-data.cookpad-global-1 in cookpad-global-1 AWS cookpad account."
      + id          = (known after apply)
      + name        = "cookpad-global-1-research-data-writable"
      + path        = "/"
      + policy      = jsonencode(
            {
              + Statement = [
                  + {
                      + Action   = [
                          + "s3:PutObjectAcl",
                          + "s3:PutObject",
                          + "s3:ListBucketVersions",
                          + "s3:ListBucket",
                          + "s3:GetObjectVersion",
                          + "s3:GetObjectAcl",
                          + "s3:GetObject",
                          + "s3:DeleteObject",
                        ]
                      + Effect   = "Allow"
                      + Resource = [
                          + "arn:aws:s3:::research-data.cookpad-global-1/*",
                          + "arn:aws:s3:::research-data.cookpad-global-1",
                        ]
                      + Sid      = ""
                    },
                ]
              + Version   = "2012-10-17"
            }
        )
    }
  # module.iam.aws_iam_role_policy_attachment.attach_cookpad_global_1_research_data_writable_to_MLServicesDeployment will be created
  + resource "aws_iam_role_policy_attachment" "attach_cookpad_global_1_research_data_writable_to_MLServicesDeployment" {
      + id         = (known after apply)
      + policy_arn = (known after apply)
      + role       = "MLServicesDeploymentsstaging"
    }
  # module.testing-eks-cluster.module.aws_auth.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "2876031626230467720" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "fd0b2cd0d1b8c03172146eb2ae05d934101ea3c9" -> "cf2ffb9f14ddf3310b0a8d4c0e653ad2cf4d4cfc"
        }
    }
  # module.testing-eks-cluster.module.aws_node_termination_handler.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "6960512851880308979" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "f776501a0a073a3468954b3e49d0f3072f719d06" -> "8900a3db6e4471100784c62183483ae71c552ba5"
        }
    }
  # module.testing-eks-cluster.module.cluster_autoscaler.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "7127138984923775440" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "a9f328dacbea6796d3278ac7df0d3adf3f96eb65" -> "600acb30f72144b0e00155d12fac8023a48aaa89"
        }
    }
  # module.testing-eks-cluster.module.metrics_server.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "8957918169943094131" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "b557313e65407769f40427871758f980c7eaf0e9" -> "14cff8c248aec8a589090dffa2dab79c80dae9dc"
        }
    }
  # module.testing-eks-cluster.module.pod_nanny.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "2349652268422680257" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "a9d96d1424a177997c7849c78e192fd5b9b481be" -> "d4a0a0ac2380b275d8fbf529b743384053e6df4f"
        }
    }
  # module.testing-eks-cluster.module.prometheus_node_exporter.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "8495425553566447814" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "ece2ae801ea663851bc22a0af667937c680a5547" -> "e9c91943f6fb87a6ce824a9779b1163938aba250"
        }
    }
  # module.testing-eks-cluster.module.storage_classes.null_resource.apply[0] must be replaced
-/+ resource "null_resource" "apply" {
      ~ id       = "1737947817272327439" -> (known after apply)
      ~ triggers = { # forces replacement
          ~ "manifest_sha1" = "a5349d15d4301f5ab68d91e15175bc9726602cdd" -> "b12737dd4e9aac3c336e59b06146cbdcb1aff152"
        }
    }
Plan: 9 to add, 0 to change, 7 to destroy.

I believe updating the manifest templates to assume the EKSAdmin role will stop us seeing these diffs?

Handle cluster bootstrap from 0 nodes.

Background

When a cluster is provisioned it has 0 nodes... so there is nowhere for the cluster autoscaler to run in order to scale out the auto scaling groups.

Work

  • Work out how to handle this and implement a solution.
    • Create a dedicated asg to run "cluster services"
    • Run cluster autoscaler on fargate...
    • Just require a minimum number of nodes in each ASG.

Correctly label nodes with `node-role.kubernetes.io/worker`

So we (and the cluster autoscaler) can tell if a node is spot, or on-demand.

node-role.kubernetes.io/worker=true
node-role.kubernetes.io/spot-worker=true

We should also tag the asg with:

k8s.io/cluster-autoscaler/node-template/label/node-role.kubernetes.io/{worker/spot-worker}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.