tanzu-solutions-engineering / tkg-lab Goto Github PK
View Code? Open in Web Editor NEWDay in the life of a TKG platform team.
Day in the life of a TKG platform team.
Ideas include:
ssh into worker nodes
recover kubeconfig for management cluster
upgrade tkg cluster
Currently the velero labs suggest that they should only be run if you are using AWS for your IaaS. With small accomidation for restic, I think the labs would run fine on vSphere and use S3 for the backups.
Since PSPs are likely to exist in clusters provisioned from vSphere7 or TMC (if we try to do the labs this way). we should set up a PSP and cluster roles/bindings that exemplifies what needs to be added in order to run the lab. At the least, we could apply privileged PSP to certain namespaces - (projectcontour|tanzu-system-ingress), vmware-system-tmc, (wavefront|tanzu-observability-saas).
Perhaps we only need to do this if we support a variation of the overall lab that uses vSphere7. But might also be a separate topic to document up front, general guidance on using PSP.
I think that Google Cloud DNS should be removed from the primary flow of the labs, with Route 53 chosen as the default option. We could add a separate doc page indicating how you could swap Route53 for Google Cloud DNS.
This script has the infrastructure provider version hard coded to 0.5.2 but for me it was 0.5.3. How do we make this more generic?
My original design for the demo would leverage TSM and make connections between clusters without external ports viable. Since TSM won't get added to the demo yet, Kubeapps should be on the cluster with the deployed applications so that you don't have to use LoadBalancer,etc to connect.
Some scripts (contour, gangway) receive the cluster name and fqdn as params (extracted from params.yaml inline).
Some others (dex, elastic) receive no params and extract the info from the params.yaml inside the script.
One reason for this is that contour and gangway are deployed in many clusters, while dex and elastic are only deployed in one specific cluster.
We may want to do all scripts the same (like dex one), to favor cleanliness in the readme over reusability of the scripts.
This needs to be updated to work within the shared cluster
I failed on adding the Shared Services cluster to TMC:
Workload cluster 'tkg-shared' created
storageclass.storage.k8s.io/aws-sc unchanged
ubuntu@ip-172-31-39-146:/tkg-lab$ ./scripts/tmc-attach.sh $(yq r params.yaml shared-services-cluster.name)/tkg-lab$ k config get-contexts
error: no context exists with the name: "tkg-shared-admin@tkg-shared"
ubuntu@ip-172-31-39-146:
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg
tkg-mgmt-admin@tkg-mgmt tkg-mgmt tkg-mgmt-admin
But then I noticed the script to create it doesn't call "tkg get credentials". I ran it manually:
ubuntu@ip-172-31-39-146:/tkg-lab$ tkg get cluster/tkg-lab$ tkg get credentials tkg-shared
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES
tkg-shared default running 1/1 2/2 v1.18.2+vmware.1
ubuntu@ip-172-31-39-146:
Credentials of workload cluster 'tkg-shared' have been saved
You can now access the cluster by switching the context to 'tkg-shared-admin@tkg-shared'
ubuntu@ip-172-31-39-146:~/tkg-lab$ kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg
tkg-mgmt-admin@tkg-mgmt tkg-mgmt tkg-mgmt-admin
tkg-shared-admin@tkg-shared tkg-shared tkg-shared-admin
ubuntu@ip-172-31-39-146:~/tkg-lab$ ./scripts/tmc-attach.sh $(yq r params.yaml shared-services-cluster.name)
Switched to context "tkg-shared-admin@tkg-shared".
โ cluster "gregoryan-tkg-shared" created successfully
Gearing towards a Spring Boot app could facilitate expanding with other labs that showcase Tanzu Build portfolio (eg: TBS)
The wavefront values file wf.yaml has set values beyond what is intentionally overriden, thus causing debt from default. The most explicit scenario are the container versions.
Instead, the values file should only contain that which is explicity set.
From what I can tell, that only includes the following:
collector.discovery.config is defined twice, which questions whether it works or not. Recommend removing the second key and leaving this as an example of how to do discovery.
We could add another while after the apply deployment, same way we did to wait for the certs, instead of asking the users in the instructions to check after.
Ideally we should implement with a timeout.
One suggestion for the ArgoCD when you commit. You may want to try externalizing the --dest-server into an var that is easily reusable in the docs
e.g.: SERVER=$(argocd cluster list -o json | jq '.[0] .server')
Then in the docs your commands will work exactly as-is:
argocd app create fortune-app-prod
--repo https://github.com/Pivotal-Field-Engineering/tkg-lab.git
--revision argocd-integration-exercise --path argocd/production
--dest-server $SERVER
--dest-namespace production
--sync-policy automated
Contour is generating tons of updates. kubectl get cm -n tanzu-system-ingress. I suspect this has started to happen since TKG 1.2 updates given the kapp-controller. I also suspect that adding the annotation to the envoy service after the fact is causing a condition where these two are constantly fighting each other. Recommend creating external-dsn ahead of time and then using an overlay to set the annotation on the envoy service when deploying contour extension.
When you've got a custom URL and issuer on Okta and use Let's Encrypt for certs on it, Dex will fail because LE isn't a trusted CA in image it's built from. I originally addressed this in #105, but #100 switched to using the new extension mechanism and invalidated that approach.
I've got a working fix that depends on #115 and will be submitting a PR once this issue is in.
As we continue to integrate Avi w/ Tanzu, let's update the vSphere version of the lab to use Avi instead of MetalLB.
It would be good, given our modular approach, that we offer additional products, like concourse. I will investigate adding it as an additional lab to the shared services cluster.
Per docs....
NOTE: If in Tanzu Kubernetes Grid v1.1 you set AWS_B64ENCODED_CREDENTIALS as an environment variable, unset the variable before deploying management clusters with v1.2 of the CLI. In v1.2 and later, Tanzu Kubernetes Grid calculates the value of AWS_B64ENCODED_CREDENTIALS automatically. To enable Tanzu Kubernetes Grid to calculate this value, you must set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION variables in .tkg/config.yaml or as environment variables. See Create the Cluster Configuration File in Deploy Management Clusters to Amazon EC2 with the CLI.
02-deploy-aws-mgmt-cluster.sh
still refers to the legacy cloud formation stack approach and won't work for new users who have never executed this process in v1.1.
We should consider either:
The current AVI setup uses the AVI Controller IP and a self-signed certificate for that server. We want to align this with the rest of the lab and:
tac.bitnami
is more aligned with the services available for TAC than charts.bitnami
. The latter has a more extensive inventory of services which could cause confusion when shown to customers as part of our TAC sales motion.
Typos and move it to shared services cluster
The script create-all-aws.sh
exists to create the baseline 3 clusters in one shot, but there is no lab supporting it. This could be added as a seperate doc. And then on the main Readme.md, there could be a section for alternative boot strap options. The helm chart baseline lab install lab could be referenced there too.
Gitlab helm install is currently using Countour which doesn't really support non-http workloads (TCP Ports). Some other apps, like Argo/Flux leverage ssh for git integration and Port 22 isn't exposed
The workload-cluster deployment lab was created, but not referenced in the primary readme.
Hi,
I have followed the lab guide and deployed successfully all the steps.
But after few hours, I lose the connection to my kubernetes cluster with the following error:
Unable to connect to the server: failed to refresh token: oauth2: cannot fetch token: 500 Internal Server Error
Response: {"error":"server_error"}
I can get it back to normal after downloading a new kubeconf.
Could you help me to troubleshoot or configure to extend this or to autorefresh the token ?
Thanks
If you have more than one hosted zone in AWS, it will fail to propagate the challenge, because it is ambiguous.
Status:
Presented: false
Processing: true
Reason: Failed to determine Route 53 hosted zone ID: Zone homelab.arg-pivotal.com. not found in Route 53 for domain _acme-challenge.dex.tkg-mgmt.tkg-vsphere-lab.homelab.arg-pivotal.com.
State: pending
I will look at adding it to the template for the clusterissuer.
As of TKG 1.1, we have ability to deploy work-load clusters to an existing VPC. We would benefit in demonstrating how to add the shared services cluster and workload cluster to the VPC created when deploying the management cluster.
Currently all instructions and scripts expect params.yaml to be at the root. This is not very friendly when you may maintain separate params.yaml files for multiple deployments. Like vsphere and aws.
Docs doesn't guide how do you get ~/.tkg/config.yaml.
AWS_B64ENCODED_CREDENTIALS=$(yq r ~/.tkg/config.yaml AWS_B64ENCODED_CREDENTIALS)
Just like with #118, when you've got a custom URL and issuer on Okta and use Let's Encrypt for certs on it, Harbor will fail because LE isn't a trusted CA in image it's built from. I originally addressed this in #105, but #100 switched to using the new extension mechanism and invalidated that approach. Waiting on #108 to be complete before submitting a PR on this one.
I'll be using the same approach to this as #119, leveraging the overlay from #112 and the method from #115.
This will be done:
Consider bumping TKG version to 1.1.3. This primarily impacts vsphere scripts as there is direct reference to versions in OVAs. However, due to the incompatibility of TMC and TKG 1.1.3 for health, perhaps we should wait until that is resolved.
Found cases where the DNS challenge fails with Google CloudDNS, where the CloudDNs certbot does not scan zones "below" the TLD. Recreating the cert-manager pods solves the issue as suggested here.
However this comment seems to indicate there are ways to fix the configuration of the ClusterIssuer to avoid thee problem entirely.
In order to detach a tmc cluster that was attached, you have to use delete, but also the -m and -p options are required though the CLI doesn't specify that. Without those options you get an error message such as the below:
x rpc error: code = InvalidArgument desc = invalid DeleteClusterRequest.FullName: embedded message failed validation | caused by: invalid FullName.ProvisionerName: value must be a valid Tanzu name | caused by: a name cannot be empty or be longer than 63 characters
eg. correct command:
tmc cluster delete $CLUSTER_NAME --force -m attached -p attached
"attached" is what tmc cluster list
will show for both MANAGEMENTCLUSTER and PROVISIONER.
If you get YQ failures like this:
andrew@ubuntu-jump:~/tkg/tkg-lab$ ./scripts/deploy-workload-cluster.sh \
$(yq r params.yaml shared-services-cluster.name)
$(yq r params.yaml shared-services-cluster.worker-replicas)
12:32:55 main [ERRO] open /home/andrew/.tkg/config.yaml: permission denied
You may have an issue with the version of YQ installed. It appears that if installed on Ubuntu via SNAP, there is an issue:
andrew@ubuntu-jump:~/tkg/tkg-lab$ which yq
/snap/bin/yq
To remedy this, install it via APT-GET:
sudo snap remove yq
sudo add-apt-repository ppa:rmescandon/yq
sudo apt-get update
sudo apt install yq -y
logout and login again
Upgrade Harbor Lab to use the new Harbor Extension included in TKG 1.2 and the new concept of Shared Services.
Implementation will leverage the Envoy VIP so we will only install the tanzu-registry-webhook
and not the tkg-connectivity-operator
as per the documentation.
ExternalDNS makes Kubernetes resources discoverable via public DNS servers. It retrieves a list of resources (Services, Ingresses, etc.) from the Kubernetes API to determine a desired list of DNS records.
External DNS eliminates scripting that we have today to create/delete DNS entries in route53.
Changes:
Create a policy that external dns uses
use that policy as part of clusterawsadm so that it could be added to the same role.
deploy external dns
modify contour envoy service to add annotation to create a wild card entry once that service is created. External DNS does not support http-proxy CRD but does support ingressroute but once we have that wild card entry added for the envoy service then it will as it is working today.
Work for this issue has already started as part of tkg-lab <> tkg-hol branch.
This still needs to:
Would like to see kubeapps in the lab. Certainly in the workload cluster and perhaps in the shared services cluster.
Few issues -
Please add steps to setup - clusterawsadm
https://github.com/kubernetes-sigs/cluster-api-provider-aws/releases
SSH_KEY_FILE_NAME=$MANAGEMENT_CLUSTER_ENVIRONMENT_NAME-ssh.pem
MANAGEMENT_CLUSTER_ENVIRONMENT_NAME is not set
Reference implementation: http://tech.paulcz.net/kubernetes-cookbook/gcp/gcp-external-dns/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.