Giter VIP home page Giter VIP logo

terraform-modules's Introduction

Terraform modules

This repository contains a set of (opinionated) Terraform modules to provision HashiCorp's suite of tools on AWS, including:

  • Consul: Service discovery, distributed key-value store, and service mesh
  • Nomad: Scheduling
  • Vault: secrets management

These tools are useful to deploy a basic infrastructure on the cloud for your developers to run their applications and services.

To get started, see the Core module. Some of the modules are optional and add additional features after you have provisioned the Core module.

Contributing

See CONTRIBUTING.md for more details.

Submodules

This repository has various submodules. When you are cloning it for the first time, make sure to do so with

git clone --recursive https://github.com/GovTechSG/terraform-modules.git

To update an already cloned repository, you can do

git submodule update --init --recursive

Modules

This module sets up a VPC, and a Consul and Nomad cluster to allow you to run applications on.

This module configures Vault to accept authentication via EC2 instance metadata. This is required for use with some of the Vault integration modules.

This module serves as a post-bootstrap addon for the Core Module. It integrates Vault into Nomad so that jobs may acquire secrets from Vault.

This module serves as a post-bootstrap addon for the Core Module. This enables ACL for Nomad, where Nomad ACL tokens can be retrieved from Vault.

We can use Vault's SSH secrets engine to generate signed certificates to access your machines via SSH.

This module serves as a post-bootstrap addon for the Core Module. This module provisions load balancers on top of a Traefik reverse proxy to expose your applications running on your Nomad cluster to the internet.

This module serves as a post-bootstrap addon for the Core Module. It allows you to configure Nomad clients to authenticate with private Docker registries.

This module serves as a bootstrap addon for the Core module. It provisions the PKI secrets engine in Vault. This PKI secrets engine allows you to maintain an internal CA and allows Vault users to request for certificates.

This module is required for some of the other Vault integration.

This modules serves as a post-bootstrap addon for the Core Module. This module adds managed AWS Elasticsearch service (with Kibana). The module also allows integration with Traefik set-up, to allow redirect service to redirect users to the Kibana visualisation UI with a more friendly named URL.

This module runs Curator as a Cron job in Nomad to clean up old indices in your Elasticsearch cluster.

This module sets up a Lambda function with a API Gateway trigger, secured with an API key authentication.

This module sets up Telegraf service for collecting and reporting metrics. This is instances containing services consul, nomad_client, nomad_server and vault.

This module allows enabling of td-agent, the stable distrution package of Fluentd, for log forwarding. For instances containing services consul, nomad_client, nomad_server and vault.

This module sets up an additional cluster of Nomad clients after the initial bootstrap of the core module.

This module is an addon for adding application service policies to access key / value secrets stored in your already set-up Vault.

This module runs Fluentd on Nomad to forward logs to Elasticsearch and (optionally) S3.

Provisions additional resources to enable Vault Auto Unseal when used with the Core module.

Roles

Contains Ansible roles for installation of various services. For more details, check out the README in the respective role directories.

terraform-modules's People

Contributors

abby-ng avatar binhoul avatar briantjt avatar chrissng avatar guangie88 avatar jrlonan-gt avatar knightniwrem avatar lawliet89 avatar qbiqing avatar ryanoolala avatar sturdek avatar sunakan avatar tingweiftw avatar tyng94 avatar xtrntr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

terraform-modules's Issues

Upgrade to Ansible 2.9

The current packer commands use ansible 2.7 syntax, however, we should upgrade to ansible 2.9 which is officially compatible with python 3.8, to avoid issues like this:

ansible/ansible#63973

"Bad gateway" when performing rolling updates of Traefik

Currently, when Traefik is being upgraded in a rolling manner, there will be a period of time while ELB updates its target groups when end users will get a "bad gateway error".

This is because the Traefik job can potentially be running on less than the maximum number of Nomad clients.

There are two ways to solve this:

  • Run Traefik as a system job. But we must be careful to write a constraint such that only those that are in the ASG that is used as the target group for Traefik's ELB has this scheduled.
  • Run some kind of service that reacts to consul watch changes in the Traefik service and then react accordingly.

Create a Nomad job to configure Elasticsearch

We still have to configure Elasticsearch via the HTTP API, see https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-createupdatedomains.html#es-createdomain-configure-slow-logs.

We need to configure:

  • Slow index logs
  • Slow search logs
  • Rolling indices
# https://www.elastic.co/guide/en/elasticsearch/reference/5.5/index-modules-slowlog.html
# This is PER INDEX
# We should probably do an index template...?
# curl -XPUT  \
#  https://vpc-tf-l-xxx.ap-southeast-1.es.amazonaws.com/syslog-*/_settings \
#  -H 'Content-Type: application/json' \
#  --data '{ "index.indexing.slowlog.threshold.index.warn": "10s", "index.indexing.slowlog.threshold.index.info": "5s", "index.indexing.slowlog.threshold.index.debug": "2s", "index.indexing.slowlog.threshold.index.trace": "500ms", "index.indexing.slowlog.level": "info", "index.indexing.slowlog.source": "1000"}'

Refactor logic for configuring Telegraf

Currently, the run-telegraf script configures itself with a statsd server and output to Elasticsearch. At the same time, it also generates the telemetry stanza for Consul, Nomad and Vault.

This presents us with a few problems:

  • Not easily extensible to "future" server types
  • Does not allow for ease of adding additional Telemetry options. For example, for Nomad clients, you would want to enable allocation and node metrics.

Let's refactor the logic for generating the configuration for Nomad/Consul/Vault by putting them into additional or existing "configure" scripts for Nomad/Consul/Vault. Nomad already has such a script. We will have to create one for Consul and Vault.

Add host information to global tags for Telegraf

Including:

  • IP Address in octet format (currently, the host name is something obscure like "ip-10-123-123-123" which is not nice to work with
  • EC2 Instance ID

Retrieve it from the instance metadata.

Prometheus Module

Packer image can be provisioned with the aid of https://github.com/cloudalchemy/ansible-prometheus

  • Each node should run their own "Prometheus server" via Telegraf, and then advertise it as such on Consul. We can then use Prometheus' built in service discover with Consul to do the scraping.~
  • Integrate this into the Telegraf module by updating it to generate the configuration file for Telegraf to run the Prometheus server and to advertise its presence via Consul for Prometheus to discover.

Integrations:

  • AWS Auth
  • Curator
  • td-agent
  • Traefik
  • Vault SSH

Detect `user_data` failure

We run pretty elaborate scripts in the user_data portions of the EC2 instances.

We need some way to detect if these scripts have failed.

Probabilities:

Idea:

  • Define the existence of a user_data completion marker file as a service health check in Consul
  • Forward logs via td-agent from User_data

TODOs:

  • Consul Server
  • Consul Agents (#131)
  • Forward user_data logs via td-agent

Remove self token rendering for Consul Template

#95 tried to reduce the frequency in which the template reading from Vault's auth/token/lookup-self is too frequent. But I failed to realise that this basically also affected the time the initial template is also rendered. #101 reverts that.

We should change this behaviour. The Vault token should never change anyway, so instead of having a template write this out, we should just write it out to ~/.vault-token as part of the run-consul-template.sh script.

Monitor Telegraf Health

Register Health Check in Consul by opening a connection to its socket_listener.

This won't check that the agent is actually collecting and sending stats. We probably need some other kind of "absence" monitoring.

Fix Ansible APT Deprecation warnings

Since 2.7

    ubuntu-1804-xxx-ami: [DEPRECATION WARNING]: Invoking "apt" only once while using a loop via
    ubuntu-1804-xxx-ami: squash_actions is deprecated. Instead of using a loop to supply multiple items
    ubuntu-1804-xxx-ami: and specifying `name: {{ item }}`, please use `name: ['apt-transport-https',
    ubuntu-1804-xxx-ami: 'ca-certificates', 'curl', 'software-properties-common']` and remove the loop.
    ubuntu-1804-xxx-ami: This feature will be removed in version 2.11. Deprecation warnings can be
    ubuntu-1804-xxx-ami: disabled by setting deprecation_warnings=False in ansible.cfg.

Rethink Vault Integration Implementation

There are several current (and future) integration in the user_data scripts that require reading secrets from Vault. We might want to consider refactoring this to better support Vault operations. In particular, many applications do not know the existence of Vault and may fail to renew the lease on their secrets.

This is usually not a problem for secrets passed to Nomad (except for the future Consul ACL token) because Nomad will update the secrets as needed.

Steps the user_data script should perform:

  1. Check with Consul if Vault is up and the KV store to see if AWS authentication is enabled for the type of servers.
  2. Get a Vault Token (via vault login with AWS auth) and save it to the usual spot at ~/.vault_token. This will be done as the Root user so /root/.vault_token.
  3. Change existing Nomad Vault integration to derive a Token from this Token to pass to Nomad instead of using the vault login token.
  4. Get additional secrets from Vault if needed (e,g, a future Consul ACL token)
  5. Set up a cron job to renew the lease for the login token and any other secrets. Currently, only the token passed to Nomad for Vault integration will automatically get renewed by Nomad.

We might want to consider using consul_template to render and manage the renewal of these tokens.

Open Questions:

  • What if we were to enable ACL in Consul and then have Vault provide a Consul secrets backend to provide a Consul ACL. Won't this cause a chicken and egg problem when we try to discover Vault? Possibilities: anonymous tokens or agent specific default token

Core Module cannot be bootstrapped from scratch anymore

This is a regression.

Error: Error running plan: 6 error(s) occurred:

* module.core.module.vault_consul_gossip.aws_security_group_rule.allow_serf_lan_udp_inbound_from_security_group_ids: aws_security_group_rule.allow_serf_lan_udp_inbound_from_security_group_ids: value of 'count' cannot be computed
* module.core.module.nomad_clients.module.nomad_client_consul_gossip.aws_security_group_rule.allow_serf_lan_tcp_inbound_from_security_group_ids: aws_security_group_rule.allow_serf_lan_tcp_inbound_from_security_group_ids: value of 'count' cannot be computed
* module.core.module.vault_consul_gossip.aws_security_group_rule.allow_serf_lan_tcp_inbound_from_security_group_ids: aws_security_group_rule.allow_serf_lan_tcp_inbound_from_security_group_ids: value of 'count' cannot be computed
* module.core.module.nomad_server_consul_gossip.aws_security_group_rule.allow_serf_lan_tcp_inbound_from_security_group_ids: aws_security_group_rule.allow_serf_lan_tcp_inbound_from_security_group_ids: value of 'count' cannot be computed
* module.core.module.nomad_server_consul_gossip.aws_security_group_rule.allow_serf_lan_udp_inbound_from_security_group_ids: aws_security_group_rule.allow_serf_lan_udp_inbound_from_security_group_ids: value of 'count' cannot be computed
* module.core.module.nomad_clients.module.nomad_client_consul_gossip.aws_security_group_rule.allow_serf_lan_udp_inbound_from_security_group_ids: aws_security_group_rule.allow_serf_lan_udp_inbound_from_security_group_ids: value of 'count' cannot be computed

Maybe Terraform 0.12 will fix this, but for now, this doesn't work.

Or hashicorp/terraform#4149 will allow this.

Random failings of apt installing packages during packer build

Many times when running packer build for Nomad servers fail (likely the same for the other three services), because it is unable to install some package (obviously should be there) such as gcc, even though Ansible has stated that apt has already updated the cache and upgraded the state to the latest.

The log looks something like this:

    ubuntu-1604-nomad-server-ami:
    ubuntu-1604-nomad-server-ami: PLAY [Provision AMI] ***********************************************************
    ubuntu-1604-nomad-server-ami:
    ubuntu-1604-nomad-server-ami: TASK [Gathering Facts] *********************************************************
    ubuntu-1604-nomad-server-ami: ok: [default]
    ubuntu-1604-nomad-server-ami:
    ubuntu-1604-nomad-server-ami: TASK [Upgrade all packages to the latest version] ******************************
    ubuntu-1604-nomad-server-ami:  [WARNING]: Could not find aptitude. Using apt-get instead.
    ubuntu-1604-nomad-server-ami: changed: [default]

...

    ubuntu-1604-nomad-server-ami: TASK [xxx/terraform-modules/modules/core/packer/nomad_servers/../../../../roles/td-agent : Install dependencies for td-agent] ***
    ubuntu-1604-nomad-server-ami: failed: [default] (item=['gcc', 'make']) => {"changed": false, "item": ["gcc", "make"], "msg": "No package matching 'gcc' is available"}
    ubuntu-1604-nomad-server-ami:       to retry, use: --limit @xxx/terraform-modules/modules/core/packer/nomad_servers/site.retry
    ubuntu-1604-nomad-server-ami:
    ubuntu-1604-nomad-server-ami: PLAY RECAP *********************************************************************
    ubuntu-1604-nomad-server-ami: default                    : ok=5    changed=3    unreachable=0    failed=1

Monitor td-agent Health

Find some mechanism to register a health check with Consul.

This will not check that logs are being delivered. We need some form of "absence" monitoring.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.