tanzu-solutions-engineering / tkg-lab Goto Github PK

Day in the life of a TKG platform team.

Shell 97.92% HCL 2.08%

tkg-lab's Introduction

TKG Lab

In this lab, we will deploy Tanzu Kubernetes Grid to vSphere, AWS, or Azure. We will additionally deploy TKG packages for ingress, logging, metrics, service discovery and container registry services.

OSS Signed and Supported Packages:

Contour for ingress
Fluent-bit for logging
Cert-manager for certificate management
Harbor for image registry
Velero for backup/restore, via Tanzu Mission Control Data Protection
Prometheus and Grafana for monitoring
External DNS as Kubernetes native way to manage DNS records

Incorporates the following Tanzu SaaS products:

Tanzu Mission Control for multi-cluster management
Tanzu Observability by Wavefront for enterprise full-stack observability (via optional Bonus Lab)

Leverages the following external services:

AWS S3 as an object store for Velero backups
AWS Route 53, GCP Cloud DNS or Azure DNS as DNS provider
Okta as an OIDC provider
Let's Encrypt as Certificate Authority

Additional OSS components not supported by VMware

Elasticsearch and Kibana for log aggregation and viewing
Minio for object storage

Goals and Audience

The following demo is for Tanzu field team members to see how various components of Tanzu and OSS ecosystem come together to build a modern application platform. We will highlight two different roles of the platform team and the application team's devops role. This could be delivered as a presentation and demo. Or it could be extended to include having the audience actually deploy the full solution on their own using thier cloud resources. The latter would be for SE’s and likely require a full day.

What we do have is a combination of open source and proprietary components, with a bias towards providing VMware built/signed OSS components by default, with flexibility to swap components and flexible integrations.

VMware commercial products included are: TKG, TO and TMC.

3rd-party SaaS services included are: AWS S3, AWS Route 53, GCP Cloud DNS, Azure DNS, Let's Encrypt, Okta. Note: There is flexibility in deployment planning. For instance, You could swap GCP Cloud DNS with Route53. Or you could swap Okta for Google or Auth0 for OpenID Connect.

Scenario Business Context

The acme corporation is looking to grow its business by improving their customer engagement channels and quickly testing various marketing and sales campaigns. Their current business model and methods can not keep pace with this anticipated growth. They recognize that software will play a critical role in this business transformation. Their development and ops engineers have chosen microservices and Kubernetes as foundational components to their new delivery model. They have engaged as a partner to help them with their ambitious goals.

App Team

The acme fitness team has reached out the platform team requesting platform services. They have asked for:

Kubernetes based environment to deploy their acme-fitness microservices application
Secure ingress for customers to access their application
Ability to access application logs in real-time as well as 30 days of history
Ability to access application metrics as well as 90+ days of history
Visibility into overall platform settings and policy
Daily backups of application deployment configuration and data
4 Total GB RAM, 3 Total CPU Core, and 10GB disk for persistent application data

Shortly after submitting their request, the acme fitness team received an email with the following:

Cluster name
Namespace name
Base domain for ingress
Link to view overall platform data, high-level observability, and policy
Link to login to kubernetes and retrieve kubeconfig
Link to search and browse logs
Link to access detailed metrics

DEMO: With this information, let’s go explore and make use of the platform…

Retrieve kubeconfig with tanzu cli
Update ingress definition based upon base domain and deploy application (acme-fitness)
Test access to the app as an end user (contour)
View application logs (kibana, elasticsearch, fluent-bit)
View application metrics (prometheus and grafana or tanzu observability)
View backup configuration (velero)
Browse overall platform data, observability, and policy (tmc)

Wow, that was awesome, what happened on the other side of the request for platform services? How did that all happen?

Required CLIs

kubectl
tmc
tanzu v1.5.1
velero v1.7.0
helm v3
yq v4.12+ (to install use brew for Mac and apt-get for Linux).
kind (helpful, but not required)
ytt, kapp, imgpkg, kbld (bundled with tanzu cli)
jq
aws (for deploying on AWS or using Route53 DNS)
az (when deploying to Azure or using Azure DNS)
terraform (for deploying on AWS)

Foundational Lab Setup Guides

There are are few options to setup the foundation lab setup of three clusters: management cluster, shared services cluster, and workload cluster.

Step by Step Guide - Provides instructional guidance for each step, along with validation actions. This is the best option for really learning how each cluster is setup and develop experience with the enterprise packages and integrations configured for the lab.
One Step Scripted Deployment - This method assumes you have done any required manual steps. There is one script that will deploy all clusters and perform integrations. It is best to use this after you have already completed the step by step guide, as any specific configuration issue you may would have been worked out in that process previously.

Acme Fitness Lab

This lab will go through our simulated experience of receiving a request from an app team for cloud resources and following the steps for both the platform team receiving the request and the app team accessing and deploying their app once the request has been fulfilled.

Platform Team Steps

1. Update Okta for Application Team Users and Group

2. Set policy on Workload Cluster and Namespace

Switch to the App Team Perspective

3. Log-in to workload cluster and setup kubeconfig

4. Get, update, and deploy Acme-fitness app

Bonus Labs

The following labs additional labs can be run on the base lab configuration.

Deploy Advanced Observability with Tanzu Observability

Deploy Kubeapps to Workload Cluster

Cluster Autoscaling

Deploy Gitlab to Shared Services Cluster

Deploy Concourse to Shared Services Cluster

Wavefront Tracing with Acme-Fitness App

Apply Image Registry Policy with TMC

Restore Backup with Velero

ArgoCD w Kustomize

Configure NSX ALB Auth with Okta LDAP

tkg-lab's People

Contributors

Stargazers

Watchers

tkg-lab's Issues

Re-structure TMC integrations

We should consider either:

Removing TMC from the main flow and adding the steps as a single lab or labs. If we do this, all of the labs up to deployment of workload would require manual RBAC
Creating 2 flows - with tmc and without. In the former, TMC should be connected early on and used for DP, Observability, Inspection, and RBAC

Remove Google Cloud DNS from Primary Flow

I think that Google Cloud DNS should be removed from the primary flow of the labs, with Route 53 chosen as the default option. We could add a separate doc page indicating how you could swap Route53 for Google Cloud DNS.

Wavefront values.yaml file collecting tech debt

The wavefront values file wf.yaml has set values beyond what is intentionally overriden, thus causing debt from default. The most explicit scenario are the container versions.

Instead, the values file should only contain that which is explicity set.

From what I can tell, that only includes the following:

kubeStateMetrics.enabled=true
proxy.ports(alltypes)

collector.discovery.config is defined twice, which questions whether it works or not. Recommend removing the second key and leaving this as an example of how to do discovery.

Create a lab for the Create Full Baseline Lab Configuration in one script.

The script create-all-aws.sh exists to create the baseline 3 clusters in one shot, but there is no lab supporting it. This could be added as a seperate doc. And then on the main Readme.md, there could be a section for alternative boot strap options. The helm chart baseline lab install lab could be referenced there too.

Remove helm deploy option as it is no longer matching TKG 1.2

tmc delete requires -m and -p options

In order to detach a tmc cluster that was attached, you have to use delete, but also the -m and -p options are required though the CLI doesn't specify that. Without those options you get an error message such as the below:

x rpc error: code = InvalidArgument desc = invalid DeleteClusterRequest.FullName: embedded message failed validation | caused by: invalid FullName.ProvisionerName: value must be a valid Tanzu name | caused by: a name cannot be empty or be longer than 63 characters

eg. correct command:

tmc cluster delete $CLUSTER_NAME --force -m attached -p attached

"attached" is what tmc cluster list will show for both MANAGEMENTCLUSTER and PROVISIONER.

Argo CD Lab Improvements

Should be deploying to workload cluster
Create a service account for argo in the cluster
Kustomizations seem like they they are bloated and could be reduced to only the required changes over base

Incorporate PSPs

Since PSPs are likely to exist in clusters provisioned from vSphere7 or TMC (if we try to do the labs this way). we should set up a PSP and cluster roles/bindings that exemplifies what needs to be added in order to run the lab. At the least, we could apply privileged PSP to certain namespaces - (projectcontour|tanzu-system-ingress), vmware-system-tmc, (wavefront|tanzu-observability-saas).

Perhaps we only need to do this if we support a variation of the overall lab that uses vSphere7. But might also be a separate topic to document up front, general guidance on using PSP.

Dex fails with custom Okta endpoints with Let's Encrypt certificates

When you've got a custom URL and issuer on Okta and use Let's Encrypt for certs on it, Dex will fail because LE isn't a trusted CA in image it's built from. I originally addressed this in #105, but #100 switched to using the new extension mechanism and invalidated that approach.

I've got a working fix that depends on #115 and will be submitting a PR once this issue is in.

Add bonus lab for deploying eduk8s

https://confluence.eng.vmware.com/display/CNA/Running+eduk8s+Portal+on+a+TKG+Workload+Cluster

AWS Cert-manager ClusterIssuer requires HostedZoneID if >1 zones are in use.

If you have more than one hosted zone in AWS, it will fail to propagate the challenge, because it is ambiguous.
Status:
Presented: false
Processing: true
Reason: Failed to determine Route 53 hosted zone ID: Zone homelab.arg-pivotal.com. not found in Route 53 for domain _acme-challenge.dex.tkg-mgmt.tkg-vsphere-lab.homelab.arg-pivotal.com.
State: pending

I will look at adding it to the template for the clusterissuer.

DNS challenge fails with Google CloudDNS

Found cases where the DNS challenge fails with Google CloudDNS, where the CloudDNs certbot does not scan zones "below" the TLD. Recreating the cert-manager pods solves the issue as suggested here.
However this comment seems to indicate there are ways to fix the configuration of the ClusterIssuer to avoid thee problem entirely.

Add option to choose different networks for mgmt cluster and other clusters on vSphere

Fix Harbor integration

Typos and move it to shared services cluster

Bump TKG version to 1.1.3

Consider bumping TKG version to 1.1.3. This primarily impacts vsphere scripts as there is direct reference to versions in OVAs. However, due to the incompatibility of TMC and TKG 1.1.3 for health, perhaps we should wait until that is resolved.

Replace MetalLB w/ Avi for vSphere load balancer

As we continue to integrate Avi w/ Tanzu, let's update the vSphere version of the lab to use Avi instead of MetalLB.

./scripts/deploy-workload-cluster.sh hard-coded

This script has the infrastructure provider version hard coded to 0.5.2 but for me it was 0.5.3. How do we make this more generic?

Offer a Concourse/Helm Lab

It would be good, given our modular approach, that we offer additional products, like concourse. I will investigate adding it as an additional lab to the shared services cluster.

Primary Readme Instructions do not direct user to deploy workload cluster

The workload-cluster deployment lab was created, but not referenced in the primary readme.

https://github.com/Pivotal-Field-Engineering/tkg-lab/blob/master/docs/workload-cluster/01_install_tkg_and_components_wlc.md

Issues in 01-prep-aws-objects

Few issues -

Please add steps to setup - clusterawsadm
https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases
SSH_KEY_FILE_NAME=$MANAGEMENT_CLUSTER_ENVIRONMENT_NAME-ssh.pem
MANAGEMENT_CLUSTER_ENVIRONMENT_NAME is not set

Upgrade Harbor Lab to 1.2

Upgrade Harbor Lab to use the new Harbor Extension included in TKG 1.2 and the new concept of Shared Services.
Implementation will leverage the Envoy VIP so we will only install the tanzu-registry-webhook and not the tkg-connectivity-operator as per the documentation.

Create Workload Cluster fails to add context

I failed on adding the Shared Services cluster to TMC:

Workload cluster 'tkg-shared' created

storageclass.storage.k8s.io/aws-sc unchanged
ubuntu@ip-172-31-39-146:/tkg-lab$ ./scripts/tmc-attach.sh $(yq r params.yaml shared-services-cluster.name)
error: no context exists with the name: "tkg-shared-admin@tkg-shared"
ubuntu@ip-172-31-39-146:/tkg-lab$ k config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg

    tkg-mgmt-admin@tkg-mgmt              tkg-mgmt                             tkg-mgmt-admin

But then I noticed the script to create it doesn't call "tkg get credentials". I ran it manually:

ubuntu@ip-172-31-39-146:/tkg-lab$ tkg get cluster
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES
tkg-shared default running 1/1 2/2 v1.18.2+vmware.1
ubuntu@ip-172-31-39-146:/tkg-lab$ tkg get credentials tkg-shared
Credentials of workload cluster 'tkg-shared' have been saved
You can now access the cluster by switching the context to 'tkg-shared-admin@tkg-shared'
ubuntu@ip-172-31-39-146:~/tkg-lab$ kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg

    tkg-mgmt-admin@tkg-mgmt              tkg-mgmt                             tkg-mgmt-admin
    tkg-shared-admin@tkg-shared          tkg-shared                           tkg-shared-admin

ubuntu@ip-172-31-39-146:~/tkg-lab$ ./scripts/tmc-attach.sh $(yq r params.yaml shared-services-cluster.name)
Switched to context "tkg-shared-admin@tkg-shared".
✔ cluster "gregoryan-tkg-shared" created successfully

Create lab for operational practices

Ideas include:

ssh into worker nodes
recover kubeconfig for management cluster
upgrade tkg cluster

Harbor OIDC login fails with custom Okta enpoint using Let's Encrypt certificate

Just like with #118, when you've got a custom URL and issuer on Okta and use Let's Encrypt for certs on it, Harbor will fail because LE isn't a trusted CA in image it's built from. I originally addressed this in #105, but #100 switched to using the new extension mechanism and invalidated that approach. Waiting on #108 to be complete before submitting a PR on this one.

I'll be using the same approach to this as #119, leveraging the overlay from #112 and the method from #115.

Update AWS deployment to leverage single VPC

As of TKG 1.1, we have ability to deploy work-load clusters to an existing VPC. We would benefit in demonstrating how to add the shared services cluster and workload cluster to the VPC created when deploying the management cluster.

Need to add jaeger integration for distributed tracing

This will be done:

Assume acme-fitness is running normally in workload cluster
Update the Wavefront proxy (via helm)
Update the deployment YAMLs for acme-fitness to point to wavefront proxy for tracing
Documentation

AWS Install references v1.1 method for cloudformation

Per docs....

NOTE: If in Tanzu Kubernetes Grid v1.1 you set AWS_B64ENCODED_CREDENTIALS as an environment variable, unset the variable before deploying management clusters with v1.2 of the CLI. In v1.2 and later, Tanzu Kubernetes Grid calculates the value of AWS_B64ENCODED_CREDENTIALS automatically. To enable Tanzu Kubernetes Grid to calculate this value, you must set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION variables in .tkg/config.yaml or as environment variables. See Create the Cluster Configuration File in Deploy Management Clusters to Amazon EC2 with the CLI.

02-deploy-aws-mgmt-cluster.sh still refers to the legacy cloud formation stack approach and won't work for new users who have never executed this process in v1.1.

Move KubeApps to Workload Cluster

My original design for the demo would leverage TSM and make connections between clusters without external ports viable. Since TSM won't get added to the demo yet, Kubeapps should be on the cluster with the deployed applications so that you don't have to use LoadBalancer,etc to connect.

tmc attach assumes you've logged into the tmc cli but I it's not explicitly stated anywhere

Make server a Env Var for ArgoCD Lab

One suggestion for the ArgoCD when you commit. You may want to try externalizing the --dest-server into an var that is easily reusable in the docs

e.g.: SERVER=$(argocd cluster list -o json | jq '.[0] .server')

Then in the docs your commands will work exactly as-is:
argocd app create fortune-app-prod
--repo https://github.com/Pivotal-Field-Engineering/tkg-lab.git
--revision argocd-integration-exercise --path argocd/production
--dest-server $SERVER
--dest-namespace production
--sync-policy automated

kubeapps to use tac.bitnami instead of charts.bitnami

tac.bitnami is more aligned with the services available for TAC than charts.bitnami. The latter has a more extensive inventory of services which could cause confusion when shown to customers as part of our TAC sales motion.

Oauth 2: Failed to refresh token

Hi,

I have followed the lab guide and deployed successfully all the steps.
But after few hours, I lose the connection to my kubernetes cluster with the following error:

Unable to connect to the server: failed to refresh token: oauth2: cannot fetch token: 500 Internal Server Error
Response: {"error":"server_error"}

I can get it back to normal after downloading a new kubeconf.
Could you help me to troubleshoot or configure to extend this or to autorefresh the token ?

Thanks

Gitlab integration needs ssh support

Gitlab helm install is currently using Countour which doesn't really support non-http workloads (TCP Ports). Some other apps, like Argo/Flux leverage ssh for git integration and Port 22 isn't exposed

Support Google CloudDNS via external-dns

Reference implementation: http://tech.paulcz.net/kubernetes-cookbook/gcp/gcp-external-dns/

Data Protection Still Needs Work

This still needs to:

deploy-all-vsphere.sh script
delete the velero.sh script
update the folllowing readmes
- docs/mgmt-cluster/10_velero_mgmt.md
  - Need to write readme regarding setting up a data protection account in TMC
- docs/shared-services-cluster/09_velero_ssc.md

Add separate instructions for Google Cloud DNS provider options

Enhance scripts to wait for pods Running

We could add another while after the apply deployment, same way we did to wait for the certs, instead of asking the users in the instructions to check after.
Ideally we should implement with a timeout.

Include Spring Boot app in addition or instead of Acme fitness

Gearing towards a Spring Boot app could facilitate expanding with other labs that showcase Tanzu Build portfolio (eg: TBS)

External DNS

ExternalDNS makes Kubernetes resources discoverable via public DNS servers. It retrieves a list of resources (Services, Ingresses, etc.) from the Kubernetes API to determine a desired list of DNS records.

External DNS eliminates scripting that we have today to create/delete DNS entries in route53.
Changes:
Create a policy that external dns uses
use that policy as part of clusterawsadm so that it could be added to the same role.
deploy external dns
modify contour envoy service to add annotation to create a wild card entry once that service is created. External DNS does not support http-proxy CRD but does support ingressroute but once we have that wild card entry added for the envoy service then it will as it is working today.

Work for this issue has already started as part of tkg-lab <> tkg-hol branch.

Runaway Contour App Changes

Contour is generating tons of updates. kubectl get cm -n tanzu-system-ingress. I suspect this has started to happen since TKG 1.2 updates given the kapp-controller. I also suspect that adding the annotation to the envoy service after the fact is causing a condition where these two are constantly fighting each other. Recommend creating external-dsn ahead of time and then using an overlay to set the annotation on the envoy service when deploying contour extension.

Issues in 02-deploy-aws-mgmt-cluster

Docs doesn't guide how do you get ~/.tkg/config.yaml.
AWS_B64ENCODED_CREDENTIALS=$(yq r ~/.tkg/config.yaml AWS_B64ENCODED_CREDENTIALS)

Incorporate kubeapps into the lab

Would like to see kubeapps in the lab. Certainly in the workload cluster and perhaps in the shared services cluster.

YQ Failures

If you get YQ failures like this:

andrew@ubuntu-jump:~/tkg/tkg-lab$ ./scripts/deploy-workload-cluster.sh \

$(yq r params.yaml shared-services-cluster.name)
$(yq r params.yaml shared-services-cluster.worker-replicas)
12:32:55 main [ERRO] open /home/andrew/.tkg/config.yaml: permission denied

You may have an issue with the version of YQ installed. It appears that if installed on Ubuntu via SNAP, there is an issue:

andrew@ubuntu-jump:~/tkg/tkg-lab$ which yq
/snap/bin/yq

To remedy this, install it via APT-GET:

sudo snap remove yq
sudo add-apt-repository ppa:rmescandon/yq
sudo apt-get update
sudo apt install yq -y
logout and login again

Use FQDN and trusted CA for AVI Controller

The current AVI setup uses the AVI Controller IP and a self-signed certificate for that server. We want to align this with the rest of the lab and:

Configure a FQDN to access the AVI Controller
When we create the AVI Controller Server certificate we should use a trusted CA

Check consistency on script params vs env vars

Some scripts (contour, gangway) receive the cluster name and fqdn as params (extracted from params.yaml inline).
Some others (dex, elastic) receive no params and extract the info from the params.yaml inside the script.

One reason for this is that contour and gangway are deployed in many clusters, while dex and elastic are only deployed in one specific cluster.

We may want to do all scripts the same (like dex one), to favor cleanliness in the readme over reusability of the scripts.