panfactum / stack Goto Github PK

The Panfactum Stack

License: Other

HCL 30.53% Shell 2.24% Nix 0.71% JavaScript 0.29% Dockerfile 0.08% Starlark 0.06% MDX 59.57% TypeScript 6.13% CSS 0.29% Lua 0.07% HTML 0.04%

aws devenv devops grafana karpenter kubernetes nix platform-engineering prometheus terraform vault cilium linkerd2 terragrunt argocd authentik infrastructure-as-code

stack's Introduction

Panfactum Stack

The Panfactum Stack is an integrated set of OpenTofu (Terraform) modules and local tooling aimed at providing the best experience for building, deploying, and managing software on AWS and Kubernetes.

Check out our demos here.

Installation

If you'd like to add the Panfactum stack to your organization, see our deployment guide.

If you'd like to connect to an existing stack, see the new user guide.

Structure

This repository contains the following components of the panfactum architecture which are all versioned together to ensure internal consistency:

Licensing

Unless an alternative license is supplied in a specific directory, all files in this repository are governed by this license. If a directory contains an alternative license, all files contained in that directory (and it's descendants) are governed by that alternative license exclusively.

Maintainers

fullykubed

stack's People

Contributors

Stargazers

Forkers

jlevydev pambalos mschnee

stack's Issues

[Bug]: kube_bastion module fails to apply

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

The kube_bastion module fails to apply with the error below. All other modules are running the latest greatest and I believe I am on the latest greatest flake as well. Let me know what other debugging information might be helpful!

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

terraform

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

[Bug]: Redis Sentinel value typo automaticClusterRecovery -> automateClusterRecovery

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

I was exploring the underlying chart for the Panfactum Redis Sentinel set up and it seems that one of the values (sentinel.automateClusterRecovery) is misspelled in the Panfactum usage of the chart (sentinel.automaticClusterRecovery). Maybe an issue with the chart documentation but thought I'd flag

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

terraform

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

[Bug]: pf-update-aws --build fails to apply after wiping .aws/config

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

When running pf-update-aws --build the script seems to clear the .aws/config file and then attempt to get the roles from the aws_iam_identity_center_permissions module with a role that no longer exists in the config file. Kind of circular. See the error below.

Steps to Reproduce

See above

Version

main (development branch)

Relevant log output

No response

[feature]: Replace Vault with OpenBao

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

Using Hashicorp Vault as a part of a self-hosted infrastructure stack might not be compatible with the Vault's updated BUSL license. If it turns out to be incompatible, replace Vault with the OSS vault fork: https://github.com/openbao.

How would you use this new functionality?

To keep the Stack license compliant.

What primary components of the stack would this impact?

terraform, nix, website, reference

Code of Conduct

I agree to follow this project's Code of Conduct

[feature]: Reduce down time in fck-nat with rolling ASG upgrades

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

Currently, the fck-nat configuration forces us to spin down NAT nodes prior to launching new nodes in the same AZ. Ideally, we would launch the new nodes while the old nodes are still running and then simply update the routing table rules in realtime to reduce downtime to a few seconds vs minutes.

How would you use this new functionality?

During fck-nat node upgrades

What primary components of the stack would this impact?

terraform

Code of Conduct

I agree to follow this project's Code of Conduct

[feature]: Add aws_external_delegated_zone module

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

We should have a module for setting up zones for domains registered in non-AWS providers.

How would you use this new functionality?

This would make it easier for users who cannot transfer their domains to AWS accounts for management.

This would not be the default / recommended setup, but would be a valuable escape hatch module to include.

[feature]: Add pvc-autoresizer

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

Add the pvc-autoresizer functionality to take care of automatically adjusting the underlying PVC storage.

How would you use this new functionality?

It would eliminate manual maintenance and monitoring to ensure that PVCs are large enough to hold the containing data

What primary components of the stack would this impact?

terraform

Code of Conduct

I agree to follow this project's Code of Conduct

[Bug]: /docs/guides/bootstrapping/preparing-aws crashes the application

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

When going to /docs/guides/bootstrapping/preparing-aws the next application crashes, seems like an issue with an image asset but takes down the whole site and any other open tabs

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

website

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

[Bug]: Devenv set up does not build when making small changes to devenv.nix

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

We are adding some custom utilities to our devenv set up. When we do so it requires reloading the shell often. However when we make small modifications we get the error seen below. I had just replaced a different custom module with the get-buildkit-address module, so the difference between the two files was only around the name. I suspect a caching issue but maybe something else. I have already tried deleting my .direnv and .devenv directories and neither has worked to resolve.

{ pkgs, config, inputs, ... }: 
let
  customModule = module: import ./packages/nix/${module} { pkgs = pkgs; };

  custom_packages = with pkgs; [
    (customModule "docker-credential-aws")
    (customModule "local-build")
    (customModule "sf")
    (customModule "get-buildkit-address")
    (customModule "enter-shell")
  ];
in
{
  enterShell = ''source enter-shell'';
  env = {
    PF_REPO_NAME = "hudsonts";
    PF_REPO_URL = "github.com/Hudson-Technology-Systems/hudsonts";
    PF_REPO_PRIMARY_BRANCH = "main";
    PF_ENVIRONMENTS_DIR = "environments";
    PF_IAC_DIR = "packages/terraform";
  };

  packages = custom_packages;
}

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

nix

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

[feature]: Allow AWS NAT Gateways in aws_vpc module

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

Allow users to choose to use AWS NAT Gateway instead of our EC2 NAT nodes in the aws_vpc module

How would you use this new functionality?

For users who want to pay more money to AWS in exchange for eliminating some of the minor limitations of fck-nat

What primary components of the stack would this impact?

terraform

Code of Conduct

I agree to follow this project's Code of Conduct

[Bug]: Communication b/w Authentik and redis is not encrypted

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

Currently unable to deploy authentik with an encrypted redis connection. This is caused by two issues:

(1) Authentik does not allow for passing the proper parameters for an encrypted redis connection: goauthentik/authentik#9123

(2) The Linkerd service mesh cannot be used due to this issue: linkerd/linkerd2#12382

Authentik will be enhancing their redis support in this PR, so we will revisit once this is available.

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

terraform

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

[feature]: Move pf_website to be a first-party infrastructure module

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

Move the pf_website from the infrastructure package to the reference package to serve as an example of a first-party module.

How would you use this new functionality?

This will help users understand how to write first-party modules while we flesh out more robust documentation and tutorials.

[Bug]: MacOS pf-tunnel does not accept inputs correctly

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

We use the pf-tunnel command in our organization to help facilitate a "local development" environment. Using pf-tunnel worked on my machine however when using it on MacOS with an ARM chip one of my team members was met with the following output:

This output was consistent whether running pf-tunnel in the context of the enclosing script or vanilla in the terminal. From inspecting the script it looks like the usage block only prints when there's an error with the inputs, so wondering if there might be some difference in the argument parsing between Linux and Mac machines

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

nix

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

[Bug]: issue with AWS cli in recommended NixOS package registry version

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

The aws cli seems broken in the release of the NixOS package registry recommended in the docs. I've pinned back my NixOS version in the meantime but may be good to recommend a more stable version.

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

nix, website

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

[Bug]: Issue setting variables in .env when migrating from previous iteration of the stack

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

When migrating from the non-flake version of the stack and attempting to set environment variables it seems as if there are conflicts from an unknown file. Potentially as a result of a previous devenv set up in the same file location

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

nix

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

Use Terragrunt Scaffholding for Module Instantiation

See https://terragrunt.gruntwork.io/docs/features/scaffold/#scaffold

[Bug]: Descheduler logs seem to indicate mis-configuration or missing plug-ins

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

I was investigating an unrelated issue with the Descheduler and noticed that there are some plugins that were not being loaded successfully. See logs attached below
descheduler_plugin_issue.csv

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

terraform

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

[feature]: Remove EKS node groups

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

Currently, we have two different types of nodes in the Panfactum stack:

Nodes managed by EKS node groups
Nodes managed by Karpenter

The EKS node groups are significantly less flexible than the nodes provisioned by Karpenter. Moreover, there is extra maintenance involved in maintaining two paradigms for provisioning nodes.

We should set up the base cluster so that it does not depend on EKS node groups.

How would you use this new functionality?

See above.

[feature]: Force pods that are scheduled on spot instances to be scheduled on different instance types

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

Pods that can be scheduled on spot instances should be spread across instance types.

This cannot be done until #37 is completed as the EKS node group isn't intelligent enough to deal with this scheduling constraint.

How would you use this new functionality?

Currently, when all pods in a deployment are scheduled on the same spot instance type, they are vulnerable to mass preemption. Scheduling on different instance types will reduce the likelihood of this occurring.

[feature]: Replace terraform with opentofu

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

With the recent changes to terraform's license, it is unclear whether using more recent versions of terraform within the Panfactum stack would violate the new licenses "embedded use" policy. It seems likely. As a result, we should migrate to opentofu once 1.7 is released.

How would you use this new functionality?

Same as terraform

What primary components of the stack would this impact?

terraform, nix, website, reference

Code of Conduct

I agree to follow this project's Code of Conduct

[feature]: Improve documentation about how to version the panfactum stack

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

Expand the documentation on how to version the panfactum stack infrastructure modules

How would you use this new functionality?

Will better help new users consume the stack

[Question]: Unexpected pod disruptions of custom redis module

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

We deploy a first party module that involves a similar Redis deployment to the Sentinal deployment Panfactum has and utilizes the same underlying chart. On our Master instance we have a Pod Disruption Budget with maxUnavailable = 0. However we consistently see this Pod rescheduled leading to an outage in our application's ability to communicate with Redis.

We added the PDB to our Redis deployment at 11:00AM EST Thursday and observed four disruptions to that deployment throughout Thursday and Friday:

11:24AM Thursday for 55 seconds
6:01PM Thursday for 5 seconds
6:39PM Thursday for 5 seconds
6:51PM Friday for 25 seconds

In 3 out of the four scenarios the failure was not long enough to trigger a fail-over in a HA Sentinal deployment and each time when inspecting what had happened in the cluster it seemed the pod was rescheduled to a new node managed by Karpenter. Currently it seems the pod has made a "stable" home on one of the dedicated instances not managed by Karpenter, but I'd like to understand whether this behavior is the expected behavior since Karpenter does not seem to be respecting the PDB.

See also the Karpenter docs discussing Disruption Budgets. It has language about Pods that "fail to shut down" which makes me think Karpenter will attempt to evict Pods and back off/retry if the Pods themselves are not successfully evicted.

TLDR: Karpenter is taking down my Redis cluster :(

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

terraform

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

[feature]: Add example for consuming infrastructure from other repositories

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

An example of the stack being used to deploy infrastructure modules from other repositories

How would you use this new functionality?

Many users will not have a mono-repository setup and may want to develop infrastructure modules in other repos but still deploy on the stack repo.

What primary components of the stack would this impact?

website, reference

Code of Conduct

I agree to follow this project's Code of Conduct

[feature]: Replace kube_rbac with aws_eks_access_entries

Prior Search

I have already searched this project's issues to determine if a similar request has already been made.

What new functionality would you like to see?

aws_eks_access_entries provides a mechanism to deploy access directly in the aws_eks module, and we can eliminate the kube_rbac module entirely.

How would you use this new functionality?

See above

What primary components of the stack would this impact?

terraform, website, reference

Code of Conduct

I agree to follow this project's Code of Conduct

[Bug]: tf_bootstrap_resources fails with current instructions [docs]

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

Per Configuring Infrastructure-as-Code:

aws_secondary_region: Set this to the aws_region from above.

# region.yaml
aws_region: us-west-2
aws_secondary_region: us-west-2

Causes the aws provider to throw the following error:

│ Error: updating Amazon DynamoDB Table (corgeek-production-state-lock-table): updating replicas, while creating: creating replica (us-west-2): ValidationException: Cannot add, delete, or update the local region through ReplicaUpdates. Use CreateTable, DeleteTable, or UpdateTable as required.
│       status code: 400, request id: L1M1VGBDCEB99I5MK47GPK9EQFVV4KQNSO5AEMVJF66Q9ASUAAJG

This can be remediated by changing the secondary region:

# region.yaml
aws_region: us-west-2
aws_secondary_region: us-east-2

Suggestions

Given the importance of replicating the lock table in another region for purpose of stability and disaster recovery, it may be recommended to update the documentation to recommend setting aws_secondary_region to a different region, so that the lock tables and keys can be replicated.

Steps to Reproduce

Starting from scratch
Create an AWS account and subaccounts
Follow the Guides through "Configuring Infrastructure-as-Code"

Version

main (development branch)

Relevant log output

...
aws_s3_bucket_lifecycle_configuration.state: Creation complete after 30s [id=corgeek-production-state-bucket]
╷
│ Error: updating Amazon DynamoDB Table (corgeek-production-state-lock-table): updating replicas, while creating: creating replica (us-west-2): ValidationException: Cannot add, delete, or update the local region through ReplicaUpdates. Use CreateTable, DeleteTable, or UpdateTable as required.
│       status code: 400, request id: L1M1VGBDCEB99I5MK47GPK9EQFVV4KQNSO5AEMVJF66Q9ASUAAJG
│ 
│   with aws_dynamodb_table.lock,
│   on main.tf line 171, in resource "aws_dynamodb_table" "lock":
│  171: resource "aws_dynamodb_table" "lock" {
│ 
╵
ERRO[0046] tofu invocation failed in /home/matt/Projects/corgeek.net/stack/.terragrunt-cache/w5dThPls3tG3mqtPwds7ZROlBzU/K2rNS_bBptYOziF7qIevbc5oi_I/packages/infrastructure/tf_bootstrap_resources  prefix=[/home/matt/Projects/corgeek.net/stack/environments/production/global/tf_bootstrap_resources] 
ERRO[0046] 1 error occurred:
        * [/home/matt/Projects/corgeek.net/stack/.terragrunt-cache/w5dThPls3tG3mqtPwds7ZROlBzU/K2rNS_bBptYOziF7qIevbc5oi_I/packages/infrastructure/tf_bootstrap_resources] exit status 1

[Bug]: Cannot set recovery times for password reset links

Prior Search

I have already searched this project's issues to determine if a bug report has already been made.

What happened?

See goauthentik/authentik#9671

Password reset links will always be 30 minutes.

Version

main (development branch)

What primary components of the stack are you seeing the problem on?

terraform

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct