tanzu-solutions-engineering / tkg-lab Goto Github PK

View Code? Open in Web Editor NEW

140.0 96.0 75.0 14.94 MB

Day in the life of a TKG platform team.

Shell 97.92% HCL 2.08%

tkg-lab's Issues

Create lab for operational practices

Ideas include:

ssh into worker nodes
recover kubeconfig for management cluster
upgrade tkg cluster

Enhance the velero labs so they have appropriate vSphere adjustments

Currently the velero labs suggest that they should only be run if you are using AWS for your IaaS. With small accomidation for restic, I think the labs would run fine on vSphere and use S3 for the backups.

Incorporate PSPs

Since PSPs are likely to exist in clusters provisioned from vSphere7 or TMC (if we try to do the labs this way). we should set up a PSP and cluster roles/bindings that exemplifies what needs to be added in order to run the lab. At the least, we could apply privileged PSP to certain namespaces - (projectcontour|tanzu-system-ingress), vmware-system-tmc, (wavefront|tanzu-observability-saas).

Perhaps we only need to do this if we support a variation of the overall lab that uses vSphere7. But might also be a separate topic to document up front, general guidance on using PSP.

Remove Google Cloud DNS from Primary Flow

I think that Google Cloud DNS should be removed from the primary flow of the labs, with Route 53 chosen as the default option. We could add a separate doc page indicating how you could swap Route53 for Google Cloud DNS.

./scripts/deploy-workload-cluster.sh hard-coded

This script has the infrastructure provider version hard coded to 0.5.2 but for me it was 0.5.3. How do we make this more generic?

Argo CD Lab Improvements

Should be deploying to workload cluster
Create a service account for argo in the cluster
Kustomizations seem like they they are bloated and could be reduced to only the required changes over base

Move KubeApps to Workload Cluster

My original design for the demo would leverage TSM and make connections between clusters without external ports viable. Since TSM won't get added to the demo yet, Kubeapps should be on the cluster with the deployed applications so that you don't have to use LoadBalancer,etc to connect.

Check consistency on script params vs env vars

Some scripts (contour, gangway) receive the cluster name and fqdn as params (extracted from params.yaml inline).
Some others (dex, elastic) receive no params and extract the info from the params.yaml inside the script.

One reason for this is that contour and gangway are deployed in many clusters, while dex and elastic are only deployed in one specific cluster.

We may want to do all scripts the same (like dex one), to favor cleanliness in the readme over reusability of the scripts.

Update the Gitlab lab docs

This needs to be updated to work within the shared cluster

Remove helm deploy option as it is no longer matching TKG 1.2

tmc cluster group needs to be created in advance for tmc labs to work

Create Workload Cluster fails to add context

I failed on adding the Shared Services cluster to TMC:

Workload cluster 'tkg-shared' created

storageclass.storage.k8s.io/aws-sc unchanged
ubuntu@ip-172-31-39-146:/tkg-lab$ ./scripts/tmc-attach.sh $(yq r params.yaml shared-services-cluster.name)
error: no context exists with the name: "tkg-shared-admin@tkg-shared"
ubuntu@ip-172-31-39-146:/tkg-lab$ k config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg

    tkg-mgmt-admin@tkg-mgmt              tkg-mgmt                             tkg-mgmt-admin

But then I noticed the script to create it doesn't call "tkg get credentials". I ran it manually:

ubuntu@ip-172-31-39-146:/tkg-lab$ tkg get cluster
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES
tkg-shared default running 1/1 2/2 v1.18.2+vmware.1
ubuntu@ip-172-31-39-146:/tkg-lab$ tkg get credentials tkg-shared
Credentials of workload cluster 'tkg-shared' have been saved
You can now access the cluster by switching the context to 'tkg-shared-admin@tkg-shared'
ubuntu@ip-172-31-39-146:~/tkg-lab$ kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg

    tkg-mgmt-admin@tkg-mgmt              tkg-mgmt                             tkg-mgmt-admin
    tkg-shared-admin@tkg-shared          tkg-shared                           tkg-shared-admin

ubuntu@ip-172-31-39-146:~/tkg-lab$ ./scripts/tmc-attach.sh $(yq r params.yaml shared-services-cluster.name)
Switched to context "tkg-shared-admin@tkg-shared".
✔ cluster "gregoryan-tkg-shared" created successfully

Include Spring Boot app in addition or instead of Acme fitness

Gearing towards a Spring Boot app could facilitate expanding with other labs that showcase Tanzu Build portfolio (eg: TBS)

Wavefront values.yaml file collecting tech debt

The wavefront values file wf.yaml has set values beyond what is intentionally overriden, thus causing debt from default. The most explicit scenario are the container versions.

Instead, the values file should only contain that which is explicity set.

From what I can tell, that only includes the following:

kubeStateMetrics.enabled=true
proxy.ports(alltypes)

collector.discovery.config is defined twice, which questions whether it works or not. Recommend removing the second key and leaving this as an example of how to do discovery.

Enhance scripts to wait for pods Running

We could add another while after the apply deployment, same way we did to wait for the certs, instead of asking the users in the instructions to check after.
Ideally we should implement with a timeout.

tmc attach assumes you've logged into the tmc cli but I it's not explicitly stated anywhere

Make server a Env Var for ArgoCD Lab

One suggestion for the ArgoCD when you commit. You may want to try externalizing the --dest-server into an var that is easily reusable in the docs

e.g.: SERVER=$(argocd cluster list -o json | jq '.[0] .server')

Then in the docs your commands will work exactly as-is:
argocd app create fortune-app-prod
--repo https://github.com/Pivotal-Field-Engineering/tkg-lab.git
--revision argocd-integration-exercise --path argocd/production
--dest-server $SERVER
--dest-namespace production
--sync-policy automated

Runaway Contour App Changes

Contour is generating tons of updates. kubectl get cm -n tanzu-system-ingress. I suspect this has started to happen since TKG 1.2 updates given the kapp-controller. I also suspect that adding the annotation to the envoy service after the fact is causing a condition where these two are constantly fighting each other. Recommend creating external-dsn ahead of time and then using an overlay to set the annotation on the envoy service when deploying contour extension.

Dex fails with custom Okta endpoints with Let's Encrypt certificates

When you've got a custom URL and issuer on Okta and use Let's Encrypt for certs on it, Dex will fail because LE isn't a trusted CA in image it's built from. I originally addressed this in #105, but #100 switched to using the new extension mechanism and invalidated that approach.

I've got a working fix that depends on #115 and will be submitting a PR once this issue is in.

Replace MetalLB w/ Avi for vSphere load balancer

As we continue to integrate Avi w/ Tanzu, let's update the vSphere version of the lab to use Avi instead of MetalLB.

Offer a Concourse/Helm Lab

It would be good, given our modular approach, that we offer additional products, like concourse. I will investigate adding it as an additional lab to the shared services cluster.

AWS Install references v1.1 method for cloudformation

Per docs....

NOTE: If in Tanzu Kubernetes Grid v1.1 you set AWS_B64ENCODED_CREDENTIALS as an environment variable, unset the variable before deploying management clusters with v1.2 of the CLI. In v1.2 and later, Tanzu Kubernetes Grid calculates the value of AWS_B64ENCODED_CREDENTIALS automatically. To enable Tanzu Kubernetes Grid to calculate this value, you must set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION variables in .tkg/config.yaml or as environment variables. See Create the Cluster Configuration File in Deploy Management Clusters to Amazon EC2 with the CLI.

02-deploy-aws-mgmt-cluster.sh still refers to the legacy cloud formation stack approach and won't work for new users who have never executed this process in v1.1.

Re-structure TMC integrations

We should consider either:

Removing TMC from the main flow and adding the steps as a single lab or labs. If we do this, all of the labs up to deployment of workload would require manual RBAC
Creating 2 flows - with tmc and without. In the former, TMC should be connected early on and used for DP, Observability, Inspection, and RBAC

Use FQDN and trusted CA for AVI Controller

The current AVI setup uses the AVI Controller IP and a self-signed certificate for that server. We want to align this with the rest of the lab and:

Configure a FQDN to access the AVI Controller
When we create the AVI Controller Server certificate we should use a trusted CA

kubeapps to use tac.bitnami instead of charts.bitnami

tac.bitnami is more aligned with the services available for TAC than charts.bitnami. The latter has a more extensive inventory of services which could cause confusion when shown to customers as part of our TAC sales motion.

Fix Harbor integration

Typos and move it to shared services cluster

Create a lab for the Create Full Baseline Lab Configuration in one script.

The script create-all-aws.sh exists to create the baseline 3 clusters in one shot, but there is no lab supporting it. This could be added as a seperate doc. And then on the main Readme.md, there could be a section for alternative boot strap options. The helm chart baseline lab install lab could be referenced there too.

Gitlab integration needs ssh support

Gitlab helm install is currently using Countour which doesn't really support non-http workloads (TCP Ports). Some other apps, like Argo/Flux leverage ssh for git integration and Port 22 isn't exposed

Primary Readme Instructions do not direct user to deploy workload cluster

The workload-cluster deployment lab was created, but not referenced in the primary readme.

https://github.com/Pivotal-Field-Engineering/tkg-lab/blob/master/docs/workload-cluster/01_install_tkg_and_components_wlc.md

Oauth 2: Failed to refresh token

Hi,

I have followed the lab guide and deployed successfully all the steps.
But after few hours, I lose the connection to my kubernetes cluster with the following error:

Unable to connect to the server: failed to refresh token: oauth2: cannot fetch token: 500 Internal Server Error
Response: {"error":"server_error"}

I can get it back to normal after downloading a new kubeconf.
Could you help me to troubleshoot or configure to extend this or to autorefresh the token ?

Thanks

AWS Cert-manager ClusterIssuer requires HostedZoneID if >1 zones are in use.

If you have more than one hosted zone in AWS, it will fail to propagate the challenge, because it is ambiguous.
Status:
Presented: false
Processing: true
Reason: Failed to determine Route 53 hosted zone ID: Zone homelab.arg-pivotal.com. not found in Route 53 for domain _acme-challenge.dex.tkg-mgmt.tkg-vsphere-lab.homelab.arg-pivotal.com.
State: pending

I will look at adding it to the template for the clusterissuer.

Update AWS deployment to leverage single VPC

As of TKG 1.1, we have ability to deploy work-load clusters to an existing VPC. We would benefit in demonstrating how to add the shared services cluster and workload cluster to the VPC created when deploying the management cluster.

Update scripts and instructions to pull location of params.yaml from env var

Currently all instructions and scripts expect params.yaml to be at the root. This is not very friendly when you may maintain separate params.yaml files for multiple deployments. Like vsphere and aws.

Issues in 02-deploy-aws-mgmt-cluster

Docs doesn't guide how do you get ~/.tkg/config.yaml.
AWS_B64ENCODED_CREDENTIALS=$(yq r ~/.tkg/config.yaml AWS_B64ENCODED_CREDENTIALS)

Add option to choose different networks for mgmt cluster and other clusters on vSphere

Harbor OIDC login fails with custom Okta enpoint using Let's Encrypt certificate

Just like with #118, when you've got a custom URL and issuer on Okta and use Let's Encrypt for certs on it, Harbor will fail because LE isn't a trusted CA in image it's built from. I originally addressed this in #105, but #100 switched to using the new extension mechanism and invalidated that approach. Waiting on #108 to be complete before submitting a PR on this one.

I'll be using the same approach to this as #119, leveraging the overlay from #112 and the method from #115.

Consider refactors of install scripts based upon eduk8s style installer

https://github.com/failk8s/tkg-dev-install/blob/master/tkg-dev-install

Need to add jaeger integration for distributed tracing

This will be done:

Assume acme-fitness is running normally in workload cluster
Update the Wavefront proxy (via helm)
Update the deployment YAMLs for acme-fitness to point to wavefront proxy for tracing
Documentation

Bump TKG version to 1.1.3

Consider bumping TKG version to 1.1.3. This primarily impacts vsphere scripts as there is direct reference to versions in OVAs. However, due to the incompatibility of TMC and TKG 1.1.3 for health, perhaps we should wait until that is resolved.

DNS challenge fails with Google CloudDNS

Found cases where the DNS challenge fails with Google CloudDNS, where the CloudDNs certbot does not scan zones "below" the TLD. Recreating the cert-manager pods solves the issue as suggested here.
However this comment seems to indicate there are ways to fix the configuration of the ClusterIssuer to avoid thee problem entirely.

tmc delete requires -m and -p options

In order to detach a tmc cluster that was attached, you have to use delete, but also the -m and -p options are required though the CLI doesn't specify that. Without those options you get an error message such as the below:

x rpc error: code = InvalidArgument desc = invalid DeleteClusterRequest.FullName: embedded message failed validation | caused by: invalid FullName.ProvisionerName: value must be a valid Tanzu name | caused by: a name cannot be empty or be longer than 63 characters

eg. correct command:

tmc cluster delete $CLUSTER_NAME --force -m attached -p attached

"attached" is what tmc cluster list will show for both MANAGEMENTCLUSTER and PROVISIONER.

YQ Failures

If you get YQ failures like this:

andrew@ubuntu-jump:~/tkg/tkg-lab$ ./scripts/deploy-workload-cluster.sh \

$(yq r params.yaml shared-services-cluster.name)
$(yq r params.yaml shared-services-cluster.worker-replicas)
12:32:55 main [ERRO] open /home/andrew/.tkg/config.yaml: permission denied

You may have an issue with the version of YQ installed. It appears that if installed on Ubuntu via SNAP, there is an issue:

andrew@ubuntu-jump:~/tkg/tkg-lab$ which yq
/snap/bin/yq

To remedy this, install it via APT-GET:

sudo snap remove yq
sudo add-apt-repository ppa:rmescandon/yq
sudo apt-get update
sudo apt install yq -y
logout and login again

Add separate instructions for Google Cloud DNS provider options

Upgrade Harbor Lab to 1.2

Upgrade Harbor Lab to use the new Harbor Extension included in TKG 1.2 and the new concept of Shared Services.
Implementation will leverage the Envoy VIP so we will only install the tanzu-registry-webhook and not the tkg-connectivity-operator as per the documentation.

External DNS

ExternalDNS makes Kubernetes resources discoverable via public DNS servers. It retrieves a list of resources (Services, Ingresses, etc.) from the Kubernetes API to determine a desired list of DNS records.

External DNS eliminates scripting that we have today to create/delete DNS entries in route53.
Changes:
Create a policy that external dns uses
use that policy as part of clusterawsadm so that it could be added to the same role.
deploy external dns
modify contour envoy service to add annotation to create a wild card entry once that service is created. External DNS does not support http-proxy CRD but does support ingressroute but once we have that wild card entry added for the envoy service then it will as it is working today.

Work for this issue has already started as part of tkg-lab <> tkg-hol branch.

Add bonus lab for deploying eduk8s

https://confluence.eng.vmware.com/display/CNA/Running+eduk8s+Portal+on+a+TKG+Workload+Cluster

Data Protection Still Needs Work

This still needs to:

deploy-all-vsphere.sh script
delete the velero.sh script
update the folllowing readmes
- docs/mgmt-cluster/10_velero_mgmt.md
  - Need to write readme regarding setting up a data protection account in TMC
- docs/shared-services-cluster/09_velero_ssc.md

Incorporate kubeapps into the lab

Would like to see kubeapps in the lab. Certainly in the workload cluster and perhaps in the shared services cluster.

Issues in 01-prep-aws-objects

Few issues -

Please add steps to setup - clusterawsadm
https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases
SSH_KEY_FILE_NAME=$MANAGEMENT_CLUSTER_ENVIRONMENT_NAME-ssh.pem
MANAGEMENT_CLUSTER_ENVIRONMENT_NAME is not set

Support Google CloudDNS via external-dns

Reference implementation: http://tech.paulcz.net/kubernetes-cookbook/gcp/gcp-external-dns/

tanzu-solutions-engineering / tkg-lab Goto Github PK

tkg-lab's Issues

Recommend Projects

Recommend Topics

Recommend Org