Giter VIP home page Giter VIP logo

tkg-lab's Introduction

TKG Lab

TKG Lab Base Diagram TKG Lab Deployment Diagram

In this lab, we will deploy Tanzu Kubernetes Grid to vSphere, AWS, or Azure. We will additionally deploy TKG packages for ingress, logging, metrics, service discovery and container registry services.

OSS Signed and Supported Packages:

  • Contour for ingress
  • Fluent-bit for logging
  • Cert-manager for certificate management
  • Harbor for image registry
  • Velero for backup/restore, via Tanzu Mission Control Data Protection
  • Prometheus and Grafana for monitoring
  • External DNS as Kubernetes native way to manage DNS records

Incorporates the following Tanzu SaaS products:

  • Tanzu Mission Control for multi-cluster management
  • Tanzu Observability by Wavefront for enterprise full-stack observability (via optional Bonus Lab)

Leverages the following external services:

  • AWS S3 as an object store for Velero backups
  • AWS Route 53, GCP Cloud DNS or Azure DNS as DNS provider
  • Okta as an OIDC provider
  • Let's Encrypt as Certificate Authority

Additional OSS components not supported by VMware

  • Elasticsearch and Kibana for log aggregation and viewing
  • Minio for object storage

Goals and Audience

The following demo is for Tanzu field team members to see how various components of Tanzu and OSS ecosystem come together to build a modern application platform. We will highlight two different roles of the platform team and the application team's devops role. This could be delivered as a presentation and demo. Or it could be extended to include having the audience actually deploy the full solution on their own using thier cloud resources. The latter would be for SE’s and likely require a full day.

What we do have is a combination of open source and proprietary components, with a bias towards providing VMware built/signed OSS components by default, with flexibility to swap components and flexible integrations.

VMware commercial products included are: TKG, TO and TMC.

3rd-party SaaS services included are: AWS S3, AWS Route 53, GCP Cloud DNS, Azure DNS, Let's Encrypt, Okta. Note: There is flexibility in deployment planning. For instance, You could swap GCP Cloud DNS with Route53. Or you could swap Okta for Google or Auth0 for OpenID Connect.

Scenario Business Context

The acme corporation is looking to grow its business by improving their customer engagement channels and quickly testing various marketing and sales campaigns. Their current business model and methods can not keep pace with this anticipated growth. They recognize that software will play a critical role in this business transformation. Their development and ops engineers have chosen microservices and Kubernetes as foundational components to their new delivery model. They have engaged as a partner to help them with their ambitious goals.

App Team

The acme fitness team has reached out the platform team requesting platform services. They have asked for:

  • Kubernetes based environment to deploy their acme-fitness microservices application
  • Secure ingress for customers to access their application
  • Ability to access application logs in real-time as well as 30 days of history
  • Ability to access application metrics as well as 90+ days of history
  • Visibility into overall platform settings and policy
  • Daily backups of application deployment configuration and data
  • 4 Total GB RAM, 3 Total CPU Core, and 10GB disk for persistent application data

Shortly after submitting their request, the acme fitness team received an email with the following:

  • Cluster name
  • Namespace name
  • Base domain for ingress
  • Link to view overall platform data, high-level observability, and policy
  • Link to login to kubernetes and retrieve kubeconfig
  • Link to search and browse logs
  • Link to access detailed metrics

DEMO: With this information, let’s go explore and make use of the platform…

  • Retrieve kubeconfig with tanzu cli
  • Update ingress definition based upon base domain and deploy application (acme-fitness)
  • Test access to the app as an end user (contour)
  • View application logs (kibana, elasticsearch, fluent-bit)
  • View application metrics (prometheus and grafana or tanzu observability)
  • View backup configuration (velero)
  • Browse overall platform data, observability, and policy (tmc)

Wow, that was awesome, what happened on the other side of the request for platform services? How did that all happen?

Required CLIs

  • kubectl
  • tmc
  • tanzu v1.5.1
  • velero v1.7.0
  • helm v3
  • yq v4.12+ (to install use brew for Mac and apt-get for Linux).
  • kind (helpful, but not required)
  • ytt, kapp, imgpkg, kbld (bundled with tanzu cli)
  • jq
  • aws (for deploying on AWS or using Route53 DNS)
  • az (when deploying to Azure or using Azure DNS)
  • terraform (for deploying on AWS)

Foundational Lab Setup Guides

There are are few options to setup the foundation lab setup of three clusters: management cluster, shared services cluster, and workload cluster.

  1. Step by Step Guide - Provides instructional guidance for each step, along with validation actions. This is the best option for really learning how each cluster is setup and develop experience with the enterprise packages and integrations configured for the lab.
  2. One Step Scripted Deployment - This method assumes you have done any required manual steps. There is one script that will deploy all clusters and perform integrations. It is best to use this after you have already completed the step by step guide, as any specific configuration issue you may would have been worked out in that process previously.

Acme Fitness Lab

This lab will go through our simulated experience of receiving a request from an app team for cloud resources and following the steps for both the platform team receiving the request and the app team accessing and deploying their app once the request has been fulfilled.

Platform Team Steps

Switch to the App Team Perspective

Bonus Labs

The following labs additional labs can be run on the base lab configuration.

tkg-lab's People

Contributors

afewell avatar afewellvmware avatar ansergit avatar bkirkware avatar bthelen avatar ccollicutt avatar cdelashmutt-pivotal avatar crdant avatar dbbaskette avatar doddatpivotal avatar guillaumemorini avatar jaimegag avatar jeffellin avatar jkhan24558 avatar keithrichardlee avatar rhardt-pivotal avatar scottbri avatar tkrausjr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tkg-lab's Issues

Re-structure TMC integrations

We should consider either:

  1. Removing TMC from the main flow and adding the steps as a single lab or labs. If we do this, all of the labs up to deployment of workload would require manual RBAC
  2. Creating 2 flows - with tmc and without. In the former, TMC should be connected early on and used for DP, Observability, Inspection, and RBAC

Remove Google Cloud DNS from Primary Flow

I think that Google Cloud DNS should be removed from the primary flow of the labs, with Route 53 chosen as the default option. We could add a separate doc page indicating how you could swap Route53 for Google Cloud DNS.

Wavefront values.yaml file collecting tech debt

The wavefront values file wf.yaml has set values beyond what is intentionally overriden, thus causing debt from default. The most explicit scenario are the container versions.

Instead, the values file should only contain that which is explicity set.

From what I can tell, that only includes the following:

  • kubeStateMetrics.enabled=true
  • proxy.ports(alltypes)

collector.discovery.config is defined twice, which questions whether it works or not. Recommend removing the second key and leaving this as an example of how to do discovery.

Create a lab for the Create Full Baseline Lab Configuration in one script.

The script create-all-aws.sh exists to create the baseline 3 clusters in one shot, but there is no lab supporting it. This could be added as a seperate doc. And then on the main Readme.md, there could be a section for alternative boot strap options. The helm chart baseline lab install lab could be referenced there too.

tmc delete requires -m and -p options

In order to detach a tmc cluster that was attached, you have to use delete, but also the -m and -p options are required though the CLI doesn't specify that. Without those options you get an error message such as the below:

x rpc error: code = InvalidArgument desc = invalid DeleteClusterRequest.FullName: embedded message failed validation | caused by: invalid FullName.ProvisionerName: value must be a valid Tanzu name | caused by: a name cannot be empty or be longer than 63 characters 

eg. correct command:

tmc cluster delete $CLUSTER_NAME --force -m attached -p attached

"attached" is what tmc cluster list will show for both MANAGEMENTCLUSTER and PROVISIONER.

Argo CD Lab Improvements

  • Should be deploying to workload cluster
  • Create a service account for argo in the cluster
  • Kustomizations seem like they they are bloated and could be reduced to only the required changes over base

Incorporate PSPs

Since PSPs are likely to exist in clusters provisioned from vSphere7 or TMC (if we try to do the labs this way). we should set up a PSP and cluster roles/bindings that exemplifies what needs to be added in order to run the lab. At the least, we could apply privileged PSP to certain namespaces - (projectcontour|tanzu-system-ingress), vmware-system-tmc, (wavefront|tanzu-observability-saas).

Perhaps we only need to do this if we support a variation of the overall lab that uses vSphere7. But might also be a separate topic to document up front, general guidance on using PSP.

Dex fails with custom Okta endpoints with Let's Encrypt certificates

When you've got a custom URL and issuer on Okta and use Let's Encrypt for certs on it, Dex will fail because LE isn't a trusted CA in image it's built from. I originally addressed this in #105, but #100 switched to using the new extension mechanism and invalidated that approach.

I've got a working fix that depends on #115 and will be submitting a PR once this issue is in.

AWS Cert-manager ClusterIssuer requires HostedZoneID if >1 zones are in use.

If you have more than one hosted zone in AWS, it will fail to propagate the challenge, because it is ambiguous.
Status:
Presented: false
Processing: true
Reason: Failed to determine Route 53 hosted zone ID: Zone homelab.arg-pivotal.com. not found in Route 53 for domain _acme-challenge.dex.tkg-mgmt.tkg-vsphere-lab.homelab.arg-pivotal.com.
State: pending

I will look at adding it to the template for the clusterissuer.

DNS challenge fails with Google CloudDNS

Found cases where the DNS challenge fails with Google CloudDNS, where the CloudDNs certbot does not scan zones "below" the TLD. Recreating the cert-manager pods solves the issue as suggested here.
However this comment seems to indicate there are ways to fix the configuration of the ClusterIssuer to avoid thee problem entirely.

Bump TKG version to 1.1.3

Consider bumping TKG version to 1.1.3. This primarily impacts vsphere scripts as there is direct reference to versions in OVAs. However, due to the incompatibility of TMC and TKG 1.1.3 for health, perhaps we should wait until that is resolved.

Offer a Concourse/Helm Lab

It would be good, given our modular approach, that we offer additional products, like concourse. I will investigate adding it as an additional lab to the shared services cluster.

Upgrade Harbor Lab to 1.2

Upgrade Harbor Lab to use the new Harbor Extension included in TKG 1.2 and the new concept of Shared Services.
Implementation will leverage the Envoy VIP so we will only install the tanzu-registry-webhook and not the tkg-connectivity-operator as per the documentation.

Create Workload Cluster fails to add context

I failed on adding the Shared Services cluster to TMC:

Workload cluster 'tkg-shared' created

storageclass.storage.k8s.io/aws-sc unchanged
ubuntu@ip-172-31-39-146:/tkg-lab$ ./scripts/tmc-attach.sh $(yq r params.yaml shared-services-cluster.name)
error: no context exists with the name: "tkg-shared-admin@tkg-shared"
ubuntu@ip-172-31-39-146:
/tkg-lab$ k config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg

  •     tkg-mgmt-admin@tkg-mgmt              tkg-mgmt                             tkg-mgmt-admin
    

But then I noticed the script to create it doesn't call "tkg get credentials". I ran it manually:

ubuntu@ip-172-31-39-146:/tkg-lab$ tkg get cluster
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES
tkg-shared default running 1/1 2/2 v1.18.2+vmware.1
ubuntu@ip-172-31-39-146:
/tkg-lab$ tkg get credentials tkg-shared
Credentials of workload cluster 'tkg-shared' have been saved
You can now access the cluster by switching the context to 'tkg-shared-admin@tkg-shared'
ubuntu@ip-172-31-39-146:~/tkg-lab$ kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg

  •     tkg-mgmt-admin@tkg-mgmt              tkg-mgmt                             tkg-mgmt-admin
        tkg-shared-admin@tkg-shared          tkg-shared                           tkg-shared-admin
    

ubuntu@ip-172-31-39-146:~/tkg-lab$ ./scripts/tmc-attach.sh $(yq r params.yaml shared-services-cluster.name)
Switched to context "tkg-shared-admin@tkg-shared".
✔ cluster "gregoryan-tkg-shared" created successfully

Harbor OIDC login fails with custom Okta enpoint using Let's Encrypt certificate

Just like with #118, when you've got a custom URL and issuer on Okta and use Let's Encrypt for certs on it, Harbor will fail because LE isn't a trusted CA in image it's built from. I originally addressed this in #105, but #100 switched to using the new extension mechanism and invalidated that approach. Waiting on #108 to be complete before submitting a PR on this one.

I'll be using the same approach to this as #119, leveraging the overlay from #112 and the method from #115.

Update AWS deployment to leverage single VPC

As of TKG 1.1, we have ability to deploy work-load clusters to an existing VPC. We would benefit in demonstrating how to add the shared services cluster and workload cluster to the VPC created when deploying the management cluster.

Need to add jaeger integration for distributed tracing

This will be done:

  1. Assume acme-fitness is running normally in workload cluster
  2. Update the Wavefront proxy (via helm)
  3. Update the deployment YAMLs for acme-fitness to point to wavefront proxy for tracing
  4. Documentation

AWS Install references v1.1 method for cloudformation

Per docs....

NOTE: If in Tanzu Kubernetes Grid v1.1 you set AWS_B64ENCODED_CREDENTIALS as an environment variable, unset the variable before deploying management clusters with v1.2 of the CLI. In v1.2 and later, Tanzu Kubernetes Grid calculates the value of AWS_B64ENCODED_CREDENTIALS automatically. To enable Tanzu Kubernetes Grid to calculate this value, you must set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION variables in .tkg/config.yaml or as environment variables. See Create the Cluster Configuration File in Deploy Management Clusters to Amazon EC2 with the CLI.

02-deploy-aws-mgmt-cluster.sh still refers to the legacy cloud formation stack approach and won't work for new users who have never executed this process in v1.1.

Move KubeApps to Workload Cluster

My original design for the demo would leverage TSM and make connections between clusters without external ports viable. Since TSM won't get added to the demo yet, Kubeapps should be on the cluster with the deployed applications so that you don't have to use LoadBalancer,etc to connect.

Make server a Env Var for ArgoCD Lab

One suggestion for the ArgoCD when you commit. You may want to try externalizing the --dest-server into an var that is easily reusable in the docs

e.g.: SERVER=$(argocd cluster list -o json | jq '.[0] .server')

Then in the docs your commands will work exactly as-is:
argocd app create fortune-app-prod
--repo https://github.com/Pivotal-Field-Engineering/tkg-lab.git
--revision argocd-integration-exercise --path argocd/production
--dest-server $SERVER
--dest-namespace production
--sync-policy automated

kubeapps to use tac.bitnami instead of charts.bitnami

tac.bitnami is more aligned with the services available for TAC than charts.bitnami. The latter has a more extensive inventory of services which could cause confusion when shown to customers as part of our TAC sales motion.

Oauth 2: Failed to refresh token

Hi,

I have followed the lab guide and deployed successfully all the steps.
But after few hours, I lose the connection to my kubernetes cluster with the following error:

Unable to connect to the server: failed to refresh token: oauth2: cannot fetch token: 500 Internal Server Error
Response: {"error":"server_error"}

I can get it back to normal after downloading a new kubeconf.
Could you help me to troubleshoot or configure to extend this or to autorefresh the token ?

Thanks

Gitlab integration needs ssh support

Gitlab helm install is currently using Countour which doesn't really support non-http workloads (TCP Ports). Some other apps, like Argo/Flux leverage ssh for git integration and Port 22 isn't exposed

Data Protection Still Needs Work

This still needs to:

  • deploy-all-vsphere.sh script
  • delete the velero.sh script
  • update the folllowing readmes
    • docs/mgmt-cluster/10_velero_mgmt.md
      • Need to write readme regarding setting up a data protection account in TMC
    • docs/shared-services-cluster/09_velero_ssc.md

Enhance scripts to wait for pods Running

We could add another while after the apply deployment, same way we did to wait for the certs, instead of asking the users in the instructions to check after.
Ideally we should implement with a timeout.

External DNS

ExternalDNS makes Kubernetes resources discoverable via public DNS servers. It retrieves a list of resources (Services, Ingresses, etc.) from the Kubernetes API to determine a desired list of DNS records.

External DNS eliminates scripting that we have today to create/delete DNS entries in route53.
Changes:
Create a policy that external dns uses
use that policy as part of clusterawsadm so that it could be added to the same role.
deploy external dns
modify contour envoy service to add annotation to create a wild card entry once that service is created. External DNS does not support http-proxy CRD but does support ingressroute but once we have that wild card entry added for the envoy service then it will as it is working today.

Work for this issue has already started as part of tkg-lab <> tkg-hol branch.

Runaway Contour App Changes

Contour is generating tons of updates. kubectl get cm -n tanzu-system-ingress. I suspect this has started to happen since TKG 1.2 updates given the kapp-controller. I also suspect that adding the annotation to the envoy service after the fact is causing a condition where these two are constantly fighting each other. Recommend creating external-dsn ahead of time and then using an overlay to set the annotation on the envoy service when deploying contour extension.

YQ Failures

If you get YQ failures like this:

andrew@ubuntu-jump:~/tkg/tkg-lab$ ./scripts/deploy-workload-cluster.sh \

$(yq r params.yaml shared-services-cluster.name)
$(yq r params.yaml shared-services-cluster.worker-replicas)
12:32:55 main [ERRO] open /home/andrew/.tkg/config.yaml: permission denied

You may have an issue with the version of YQ installed. It appears that if installed on Ubuntu via SNAP, there is an issue:

andrew@ubuntu-jump:~/tkg/tkg-lab$ which yq
/snap/bin/yq

To remedy this, install it via APT-GET:

sudo snap remove yq
sudo add-apt-repository ppa:rmescandon/yq
sudo apt-get update
sudo apt install yq -y
logout and login again

Use FQDN and trusted CA for AVI Controller

The current AVI setup uses the AVI Controller IP and a self-signed certificate for that server. We want to align this with the rest of the lab and:

  • Configure a FQDN to access the AVI Controller
  • When we create the AVI Controller Server certificate we should use a trusted CA

Check consistency on script params vs env vars

Some scripts (contour, gangway) receive the cluster name and fqdn as params (extracted from params.yaml inline).
Some others (dex, elastic) receive no params and extract the info from the params.yaml inside the script.

One reason for this is that contour and gangway are deployed in many clusters, while dex and elastic are only deployed in one specific cluster.

We may want to do all scripts the same (like dex one), to favor cleanliness in the readme over reusability of the scripts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.