Giter VIP home page Giter VIP logo

mspnp / aks-baseline-multi-region Goto Github PK

View Code? Open in Web Editor NEW
53.0 8.0 71.0 5.38 MB

This is the Azure Kubernetes Service (AKS) baseline for multi-region reference implementation as produced by the Microsoft Azure Architecture Center.

Home Page: https://aka.ms/architecture/aks-baseline-multi-region

License: MIT License

Shell 16.94% Bicep 83.06%
kuberenetes aks aks-kubernetes-cluster gateway-services azure-application-gateway azure-front-door load-balancer multi-cluster multi-region

aks-baseline-multi-region's Introduction

Azure Kubernetes Service (AKS) for multiregion deployment

This reference implementation will go over some design decisions from the baseline to detail them as a well as incorporate some new recommended infrastructure options for a multicluster (and multiregion) architecture. This implementation and document are meant to guide the multiple distinct teams introduced in the AKS baseline through the process of expanding from a single cluster to a multicluster solution with a fundamental driver in mind which is Reliability via the Geode cloud design pattern.

Note: This implementation does not use AKS Fleet Manager capability or any other cluster clustering technologies, but instead represents a manual approach to clustering multiple AKS clusters together. Operating fleets containing a large number of clusters is usually best performed with advanced and dedicated tooling. This implementation supports a small scale and introduces some of the core concepts that will be necessary no matter the scale or tooling.

Throughout the reference implementation, you will see reference to Contoso Bicycle. They are a fictional, small, and fast-growing startup that provides online web services to its clientele on the east coast of the United States. This narrative provides grounding for some implementation details, naming conventions, etc. You should adapt as you see fit.

๐ŸŽ“ Foundational Understanding
If you haven't familiarized yourself with the general-purpose AKS baseline cluster architecture, you should start there before continuing here. The architecture rationalized and constructed in that implementation is the direct foundation of this body of work. This reference implementation avoids rearticulating points that are already addressed in the AKS baseline cluster.

The Contoso Bicycle app team that owns the a0042 workload app is planning to deploy an AKS cluster strategically located in the East US 2 region as this is where most of their customer base can be found. They will operate this single AKS cluster following Microsoft's recommended baseline architecture.

AKS baseline clusters are meant to be available from multiple availability zones within the same region. But now they realize that if East US 2 went fully down, zone coverage is not sufficient. Even though the SLAs are acceptable for their business continuity plan, they are starting to think what their options are, and how their stateless application (Application ID: a0042) could increase its availability in case of a complete regional outage. They started conversations with the business unit (BU0001) to increment the number of clusters by one. In other words, they are proposing to move to a multicluster infrastructure solution in which multiple instances of the same application could live.

This architectural decision will have multiple implications for the Contoso Bicycle organization. It is not just about following the baseline twice, adding another the region to get a twin infrastructure. They also need to look for how they can efficiently share Azure resources as well as detect those that need to be added; how they are going to deploy more than one cluster as well as operate them; decide to which specific regions they deploy to; and many more considerations striving for higher availability.

Azure Architecture Center guidance

This project has a companion set of articles that describe challenges, design patterns, and best practices for an AKS multicluster solution designed to be deployed in multiple regions to be highly available. You can find this article on the Azure Architecture Center at Azure Kubernetes Service (AKS) baseline for multiregion clusterss. If you haven't reviewed it, we suggest you read it as it will give added context to the considerations applied in this implementation. Ultimately, this is the direct implementation of that specific architectural guidance.

Architecture

This architecture is infrastructure focused, more so than on workload. It concentrates on two AKS clusters, including concerns like multiregion deployments, the desired state/bootstrapping of the clusters, geo-replication, network topologies, and more.

The implementation presented here, like in the baseline, is the minimum recommended starting (baseline) for a multiple AKS cluster solution. This implementation integrates with Azure services that will deliver geo-replication, a centralized observability approach, a network topology that is going go with multiregional growth, and an added benefit of additional traffic balancing as well.

Finally, this implementation uses the ASP.NET Docker samples as an example workload. This workload is purposefully uninteresting, as it is here exclusively to help you experience the multicluster infrastructure.

Core architecture components

Azure platform

  • Azure Kubernetes Service (AKS) v1.29
  • Azure Virtual Networks (hub-spoke)
  • Azure Front Door (classic)
  • Azure Application Gateway (WAF)
  • Azure Container Registry
  • Azure Monitor Log Analytics

In-cluster OSS components

The federation diagram depicting the proposed cluster fleet topology running different instances of the same application from them.

Deploy the reference implementation

๐Ÿงน Clean up resources

Most of the Azure resources deployed in the prior steps will incur ongoing charges unless removed.

Cost Considerations

The main costs of this reference implementation are (in order):

  1. Azure Firewall dedicated to control outbound traffic - ~35%
  2. Node pool virtual machines used inside the cluster - ~30%
  3. Application Gateway which controls the ingress traffic to the workload - ~15%
  4. Log Analytics - ~10%

Azure Firewall can be a shared resource, and maybe your company already has one and you can reuse in existing regional hubs.

The virtual machines on the AKS cluster are needed. The cluster can be shared by several applications. You can analyze the size and the amount of nodes. The reference implementation has the minimum recommended nodes for production environments, but in a multicluster environment when you have at least two clusters, based on your traffic analysis, failover strategy and autoscaling configuration, you will choose a scale appropriate to your workload.

Keep an eye on Log Analytics data growth as time goes by and manage the information which is collected. The main cost is related to data ingestion into the Log Analytics workspace, you can fine tune that.

There is WAF protection enabled on Application Gateway and Azure Front Door. The WAF rules on Azure Front Door have extra cost, you can disable these rules. The consequence is that not valid traffic will arrive at Application Gateway using resources instead of being eliminated as soon as possible.

Next Steps

This reference implementation intentionally does not cover all scenarios. If you are looking for other topics that are not addressed here, visit AKS baseline for the complete list of covered scenarios around AKS.

Related documentation

Contributions

Please see our contributor guide.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

With โค๏ธ from Microsoft Patterns & Practices, Azure Architecture Center.

aks-baseline-multi-region's People

Contributors

agger1995 avatar balteravishay avatar ckittel avatar dcasati avatar dstrebel avatar ferantivero avatar goprasad avatar hallihan avatar idanshahar avatar kyleburnsdev avatar lastcoolnameleft avatar magrande avatar neilpeterson avatar pelithne avatar raykao avatar richlander avatar skabou avatar v-fearam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aks-baseline-multi-region's Issues

Consider adding a few helpful checks in this content

Because of the complexity of this content, and many disparate dependencies, it may be helpful to add a few 'checks' along the way. Some examples that I would have found helpful.

After the Github secrets sections, add a note to visually inspect the secrets and ensure that necessary five are presents (add a screen shot for comparison).

After performing and sed replacement, add a note on what the results should look like, or a note to verify the results. In my case, a terminal session had been terminated, removing variable values. When performing the replacement using sed, an empty value was added to the flux stuff.

In general, understood that this is a complex solution, it has taken me a few attempts to get things right, anything that can be done to remove some of the troubleshooting / guess work on the consumer would be helpful.

Issue when running letsencrypt-pip-cert-generation.sh

I encounter the the following error when trying to run through the Cert Generation

It might be worth noting that I'm running on a Mac as the commands might be specific to Linux.

โžœ  aks-baseline-multi-region git:(main) โœ— ./certs/letsencrypt-pip-cert-generation.sh $APPGW_SUBDOMAIN_BU0001A0042_03 $APPGW_FQDN_BU0001A0042_03 $APPGW_IP_RESOURCE_ID_03 eastus2

Location    Name
----------  ---------------------------
eastus2     rg-cert-let-encrypt-eastus2
Name                ResourceGroup                State      Timestamp                         Mode
------------------  ---------------------------  ---------  --------------------------------  -----------
ca-cert-generation  rg-cert-let-encrypt-eastus2  Succeeded  2021-04-28T17:57:21.069648+00:00  Incremental
Storage Account: regionkqijwomkik3vi

The request may be blocked by network rules of storage account. Please check network rule set using 'az storage account show -n accountname --query networkRuleSet'.
If you want to change the default action to apply when no rule matches, please use 'az storage account update'.

UnrecognizedArgumentError: unrecognized arguments: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=regionkqijwomkik3vi;AccountKey=<REMOVED>
Try this: 'az storage blob upload --account-name <mystorageaccount> --account-key <0000-0000> --container-name <mycontainer> --file </path/to/file> --name <myblob>'
Still stuck? Run 'az --help' to view all commands or go to 'https://docs.microsoft.com/cli/azure/reference-index' to learn more

Add procedure to populate variables on step 4 of the validation content

On step 4 of the 10-validation.md content, we instruct to open a new terminal and run the following commands.

# [This whole execution takes about 40 minutes.]
az network application-gateway stop -g rg-bu0001a0042-03 -n $APPGW_FQDN_BU0001A0042_03 && \ # first incident
az network application-gateway start -g rg-bu0001a0042-03 -n $APPGW_FQDN_BU0001A0042_03 && \
az network application-gateway stop -g rg-bu0001a0042-04 -n $APPGW_FQDN_BU0001A0042_04 && \ # second incident
az network application-gateway start -g rg-bu0001a0042-04 -n $APPGW_FQDN_BU0001A0042_04

Seems like we need to add a step to populate the $APPGW_FQDN_BU0001A0042_03 and $APPGW_FQDN_BU0001A0042_04 variables.

Scheduled query is failing to deploy with the shared resources tempalte

I am seeing this across two subscriptions when deploying the shared services template.

{
  "code": "DeploymentFailed",
  "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.",
  "details": [
    {
      "code": "BadRequest",
      "message": "{\r\n  \"error\": {\r\n    \"message\": \"The request had some invalid properties\",\r\n    \"code\": \"BadArgumentError\",\r\n    \"correlationId\": \"ac8d0b97-011d-4561-9fd2-8f91834a61b2\",\r\n    \"innererror\": {\r\n      \"code\": \"SemanticError\",\r\n      \"message\": \"A semantic error occurred.\",\r\n      \"innererror\": {\r\n        \"code\": \"SEM0100\",\r\n        \"message\": \"'distinct' operator: Failed to resolve table or column expression named 'KubePodInventory'\"\r\n      }\r\n    }\r\n  }\r\n}"
    }
  ]
}

Unable to pull from ACR after deploying a second cluster

I am having some issues when pulling from ACR in my cluster.

Steps to reproduce:

  • Deploy the shared services and az acr import my custom image to ACR
  • Deploy cluster A and manually apply my yaml file (here ACR pull is successful)
  • Deploy cluster B and manually apply my yaml file (here ACR pul is succsful)
  • Go back to cluster A and delete the pod of my app, which would trigger the pod to recreate and attempt to pull the image again, which would fail. Seems to be that DNS is no longer resolving my registry acraksmemilf46kr3qe.azurecr.io
Warning  Failed     37m (x4 over 38m)      kubelet, aks-npuser01-42199627-vmss000001  Failed to pull image "acraksmemilf46kr3qe.azurecr.io/retaildevcrews/ngsa-lr:beta": rpc error: code = Unknown desc = failed to pull and unpack image "acraksmemilf46kr3qe.azurecr.io/retaildevcrews/ngsa-lr:beta": failed to reso โ”‚
โ”‚ lve reference "acraksmemilf46kr3qe.azurecr.io/retaildevcrews/ngsa-lr:beta": failed to do request: Head "https://acraksmemilf46kr3qe.azurecr.io/v2/retaildevcrews/ngsa-lr/manifests/beta": EOF  

Highly appreciate the help.

Deployment failure: invalid Application Gateway certificate retrieved from Key Vault

Assuming I missed something but could use help troubleshooting.

Deployment failed. Correlation ID: fe07567e-08ba-4a5b-856e-0ac6fd9391cd. ***
  "status": "Failed",
  "error": ***
    "code": "ResourceDeploymentFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      ***
        "code": "ApplicationGatewayKeyVaultSecretException",
        "message": "Problem occured while accessing and validating KeyVault Secrets associated with Application Gateway '/subscriptions/3762d87c-ddb8-425f-b2fc-29e5e859edaf/resourceGroups/rg-bu0001a0042-03-nepeters-02/providers/Microsoft.Network/applicationGateways/apw-aks-m2zje6oaepxpc'. See details below:",
        "details": [
          ***
            "code": "ApplicationGatewaySslCertificateInvalidData",
            "message": "Data or Password for certificate /subscriptions/3762d87c-ddb8-425f-b2fc-29e5e859edaf/resourceGroups/rg-bu0001a0042-03-nepeters-02/providers/Microsoft.Network/applicationGateways/apw-aks-m2zje6oaepxpc/sslCertificates/apw-aks-m2zje6oaepxpc-ssl-certificate is invalid."

I've verified that the appgw-ingress-internal-aks-ingress-contoso-com-tls secret has been created and populated with the base64 encoded value created in this step:

2021-06-07_20-54-17

If I pull the value out of Key Vault and decode, it does match the value found in the traefik-ingress-internal-aks-ingress-contoso-com-tls.crt file.

Any thought on other things to check?

Several commands not working on macos

I think it would be beneficial to work through this content end to end on macos. I am finding several commands that do not work as specified in these documents.

One example:

# Create an Azure Service Principal
az ad sp create-for-rbac --name "github-workflow-aks-cluster" --sdk-auth --skip-assignment > sp.json
APP_ID=$(grep -oP '(?<="clientId": ").*?[^\\](?=",)' sp.json)

results in:

usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
        [-e pattern] [-f file] [--binary-files=value] [--color=when]
        [--context[=num]] [--directories=action] [--label] [--line-buffered]
        [--null] [pattern] [file ...]

For this particular example, we could consider moving from grep to the az cli --query which should work across all platforms.

gh set secret failing with GraphQL issue

The commands to set GH secrets via:

gh secret set NAME  -b "VALUE" -repo=":owner/:repo"

... are currently failing with the following message ...

failed to look up IDs for repositories [epo=:owner/:repo]: failed to look up repositories: GraphQL: Could not resolve to a Repository with the name 'epo=:owner/:repo'. (repo_000)

This has been reproduced multiple times.

Removing the -repo=":owner/:repo" bit from the command allows the command to succeed.

Supporting information:

gh auth status
github.com
  โœ“ Logged in to github.com as ckittel (C:\Users\chkittel\AppData\Roaming\GitHub CLI\hosts.yml)
  โœ“ Git operations for github.com configured to use https protocol.
  โœ“ Token: *******************

gh --version
gh version 2.6.0 (2022-03-15)
https://github.com/cli/cli/releases/tag/v2.6.0

HT: @anihitk07 for bringing this to our attention.

PodFailedScheduledQuery should DependsOn container insights

There is a race condition where PodFailedScheduledQuery (a scheduled query rule) might fail to deploy if Container Insights hasn't finished installing. The error is about a missing column (KubePodInventory). The AKS baseline solves this with a depends on, but that doesn't look like it's been implemented here. Evaluate if that is the right solution for here.

If for some reason, going with the latest API and the dependson still doesn't fix the issue, we'd be fine using skipQueryValidation: true in here.

Thanks @anihitk07 for bringing this to our attention!

Steps made too complex for no reason

The steps are made too complex by using GitHub pipeline, flux and all the fancy tooling. It is too difficult to follow. Honestly, i have been trying to following this in multiple attempts, but i can't even deploy at step 6, to deploy AKS clusters because of so much of tooling. Can you simplify the examples or create a video guide as to how to follow this stuff?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.