Giter VIP home page Giter VIP logo

tkg-lab's Issues

Incorporate PSPs

Since PSPs are likely to exist in clusters provisioned from vSphere7 or TMC (if we try to do the labs this way). we should set up a PSP and cluster roles/bindings that exemplifies what needs to be added in order to run the lab. At the least, we could apply privileged PSP to certain namespaces - (projectcontour|tanzu-system-ingress), vmware-system-tmc, (wavefront|tanzu-observability-saas).

Perhaps we only need to do this if we support a variation of the overall lab that uses vSphere7. But might also be a separate topic to document up front, general guidance on using PSP.

Remove Google Cloud DNS from Primary Flow

I think that Google Cloud DNS should be removed from the primary flow of the labs, with Route 53 chosen as the default option. We could add a separate doc page indicating how you could swap Route53 for Google Cloud DNS.

Argo CD Lab Improvements

  • Should be deploying to workload cluster
  • Create a service account for argo in the cluster
  • Kustomizations seem like they they are bloated and could be reduced to only the required changes over base

Move KubeApps to Workload Cluster

My original design for the demo would leverage TSM and make connections between clusters without external ports viable. Since TSM won't get added to the demo yet, Kubeapps should be on the cluster with the deployed applications so that you don't have to use LoadBalancer,etc to connect.

Check consistency on script params vs env vars

Some scripts (contour, gangway) receive the cluster name and fqdn as params (extracted from params.yaml inline).
Some others (dex, elastic) receive no params and extract the info from the params.yaml inside the script.

One reason for this is that contour and gangway are deployed in many clusters, while dex and elastic are only deployed in one specific cluster.

We may want to do all scripts the same (like dex one), to favor cleanliness in the readme over reusability of the scripts.

Create Workload Cluster fails to add context

I failed on adding the Shared Services cluster to TMC:

Workload cluster 'tkg-shared' created

storageclass.storage.k8s.io/aws-sc unchanged
ubuntu@ip-172-31-39-146:/tkg-lab$ ./scripts/tmc-attach.sh $(yq r params.yaml shared-services-cluster.name)
error: no context exists with the name: "tkg-shared-admin@tkg-shared"
ubuntu@ip-172-31-39-146:
/tkg-lab$ k config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg

  •     tkg-mgmt-admin@tkg-mgmt              tkg-mgmt                             tkg-mgmt-admin
    

But then I noticed the script to create it doesn't call "tkg get credentials". I ran it manually:

ubuntu@ip-172-31-39-146:/tkg-lab$ tkg get cluster
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES
tkg-shared default running 1/1 2/2 v1.18.2+vmware.1
ubuntu@ip-172-31-39-146:
/tkg-lab$ tkg get credentials tkg-shared
Credentials of workload cluster 'tkg-shared' have been saved
You can now access the cluster by switching the context to 'tkg-shared-admin@tkg-shared'
ubuntu@ip-172-31-39-146:~/tkg-lab$ kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg kind-tkg-kind-bqb1cdt3pmbpi3c12tjg

  •     tkg-mgmt-admin@tkg-mgmt              tkg-mgmt                             tkg-mgmt-admin
        tkg-shared-admin@tkg-shared          tkg-shared                           tkg-shared-admin
    

ubuntu@ip-172-31-39-146:~/tkg-lab$ ./scripts/tmc-attach.sh $(yq r params.yaml shared-services-cluster.name)
Switched to context "tkg-shared-admin@tkg-shared".
โœ” cluster "gregoryan-tkg-shared" created successfully

Wavefront values.yaml file collecting tech debt

The wavefront values file wf.yaml has set values beyond what is intentionally overriden, thus causing debt from default. The most explicit scenario are the container versions.

Instead, the values file should only contain that which is explicity set.

From what I can tell, that only includes the following:

  • kubeStateMetrics.enabled=true
  • proxy.ports(alltypes)

collector.discovery.config is defined twice, which questions whether it works or not. Recommend removing the second key and leaving this as an example of how to do discovery.

Enhance scripts to wait for pods Running

We could add another while after the apply deployment, same way we did to wait for the certs, instead of asking the users in the instructions to check after.
Ideally we should implement with a timeout.

Make server a Env Var for ArgoCD Lab

One suggestion for the ArgoCD when you commit. You may want to try externalizing the --dest-server into an var that is easily reusable in the docs

e.g.: SERVER=$(argocd cluster list -o json | jq '.[0] .server')

Then in the docs your commands will work exactly as-is:
argocd app create fortune-app-prod
--repo https://github.com/Pivotal-Field-Engineering/tkg-lab.git
--revision argocd-integration-exercise --path argocd/production
--dest-server $SERVER
--dest-namespace production
--sync-policy automated

Runaway Contour App Changes

Contour is generating tons of updates. kubectl get cm -n tanzu-system-ingress. I suspect this has started to happen since TKG 1.2 updates given the kapp-controller. I also suspect that adding the annotation to the envoy service after the fact is causing a condition where these two are constantly fighting each other. Recommend creating external-dsn ahead of time and then using an overlay to set the annotation on the envoy service when deploying contour extension.

Dex fails with custom Okta endpoints with Let's Encrypt certificates

When you've got a custom URL and issuer on Okta and use Let's Encrypt for certs on it, Dex will fail because LE isn't a trusted CA in image it's built from. I originally addressed this in #105, but #100 switched to using the new extension mechanism and invalidated that approach.

I've got a working fix that depends on #115 and will be submitting a PR once this issue is in.

Offer a Concourse/Helm Lab

It would be good, given our modular approach, that we offer additional products, like concourse. I will investigate adding it as an additional lab to the shared services cluster.

AWS Install references v1.1 method for cloudformation

Per docs....

NOTE: If in Tanzu Kubernetes Grid v1.1 you set AWS_B64ENCODED_CREDENTIALS as an environment variable, unset the variable before deploying management clusters with v1.2 of the CLI. In v1.2 and later, Tanzu Kubernetes Grid calculates the value of AWS_B64ENCODED_CREDENTIALS automatically. To enable Tanzu Kubernetes Grid to calculate this value, you must set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION variables in .tkg/config.yaml or as environment variables. See Create the Cluster Configuration File in Deploy Management Clusters to Amazon EC2 with the CLI.

02-deploy-aws-mgmt-cluster.sh still refers to the legacy cloud formation stack approach and won't work for new users who have never executed this process in v1.1.

Re-structure TMC integrations

We should consider either:

  1. Removing TMC from the main flow and adding the steps as a single lab or labs. If we do this, all of the labs up to deployment of workload would require manual RBAC
  2. Creating 2 flows - with tmc and without. In the former, TMC should be connected early on and used for DP, Observability, Inspection, and RBAC

Use FQDN and trusted CA for AVI Controller

The current AVI setup uses the AVI Controller IP and a self-signed certificate for that server. We want to align this with the rest of the lab and:

  • Configure a FQDN to access the AVI Controller
  • When we create the AVI Controller Server certificate we should use a trusted CA

kubeapps to use tac.bitnami instead of charts.bitnami

tac.bitnami is more aligned with the services available for TAC than charts.bitnami. The latter has a more extensive inventory of services which could cause confusion when shown to customers as part of our TAC sales motion.

Create a lab for the Create Full Baseline Lab Configuration in one script.

The script create-all-aws.sh exists to create the baseline 3 clusters in one shot, but there is no lab supporting it. This could be added as a seperate doc. And then on the main Readme.md, there could be a section for alternative boot strap options. The helm chart baseline lab install lab could be referenced there too.

Gitlab integration needs ssh support

Gitlab helm install is currently using Countour which doesn't really support non-http workloads (TCP Ports). Some other apps, like Argo/Flux leverage ssh for git integration and Port 22 isn't exposed

Oauth 2: Failed to refresh token

Hi,

I have followed the lab guide and deployed successfully all the steps.
But after few hours, I lose the connection to my kubernetes cluster with the following error:

Unable to connect to the server: failed to refresh token: oauth2: cannot fetch token: 500 Internal Server Error
Response: {"error":"server_error"}

I can get it back to normal after downloading a new kubeconf.
Could you help me to troubleshoot or configure to extend this or to autorefresh the token ?

Thanks

AWS Cert-manager ClusterIssuer requires HostedZoneID if >1 zones are in use.

If you have more than one hosted zone in AWS, it will fail to propagate the challenge, because it is ambiguous.
Status:
Presented: false
Processing: true
Reason: Failed to determine Route 53 hosted zone ID: Zone homelab.arg-pivotal.com. not found in Route 53 for domain _acme-challenge.dex.tkg-mgmt.tkg-vsphere-lab.homelab.arg-pivotal.com.
State: pending

I will look at adding it to the template for the clusterissuer.

Update AWS deployment to leverage single VPC

As of TKG 1.1, we have ability to deploy work-load clusters to an existing VPC. We would benefit in demonstrating how to add the shared services cluster and workload cluster to the VPC created when deploying the management cluster.

Harbor OIDC login fails with custom Okta enpoint using Let's Encrypt certificate

Just like with #118, when you've got a custom URL and issuer on Okta and use Let's Encrypt for certs on it, Harbor will fail because LE isn't a trusted CA in image it's built from. I originally addressed this in #105, but #100 switched to using the new extension mechanism and invalidated that approach. Waiting on #108 to be complete before submitting a PR on this one.

I'll be using the same approach to this as #119, leveraging the overlay from #112 and the method from #115.

Need to add jaeger integration for distributed tracing

This will be done:

  1. Assume acme-fitness is running normally in workload cluster
  2. Update the Wavefront proxy (via helm)
  3. Update the deployment YAMLs for acme-fitness to point to wavefront proxy for tracing
  4. Documentation

Bump TKG version to 1.1.3

Consider bumping TKG version to 1.1.3. This primarily impacts vsphere scripts as there is direct reference to versions in OVAs. However, due to the incompatibility of TMC and TKG 1.1.3 for health, perhaps we should wait until that is resolved.

DNS challenge fails with Google CloudDNS

Found cases where the DNS challenge fails with Google CloudDNS, where the CloudDNs certbot does not scan zones "below" the TLD. Recreating the cert-manager pods solves the issue as suggested here.
However this comment seems to indicate there are ways to fix the configuration of the ClusterIssuer to avoid thee problem entirely.

tmc delete requires -m and -p options

In order to detach a tmc cluster that was attached, you have to use delete, but also the -m and -p options are required though the CLI doesn't specify that. Without those options you get an error message such as the below:

x rpc error: code = InvalidArgument desc = invalid DeleteClusterRequest.FullName: embedded message failed validation | caused by: invalid FullName.ProvisionerName: value must be a valid Tanzu name | caused by: a name cannot be empty or be longer than 63 characters 

eg. correct command:

tmc cluster delete $CLUSTER_NAME --force -m attached -p attached

"attached" is what tmc cluster list will show for both MANAGEMENTCLUSTER and PROVISIONER.

YQ Failures

If you get YQ failures like this:

andrew@ubuntu-jump:~/tkg/tkg-lab$ ./scripts/deploy-workload-cluster.sh \

$(yq r params.yaml shared-services-cluster.name)
$(yq r params.yaml shared-services-cluster.worker-replicas)
12:32:55 main [ERRO] open /home/andrew/.tkg/config.yaml: permission denied

You may have an issue with the version of YQ installed. It appears that if installed on Ubuntu via SNAP, there is an issue:

andrew@ubuntu-jump:~/tkg/tkg-lab$ which yq
/snap/bin/yq

To remedy this, install it via APT-GET:

sudo snap remove yq
sudo add-apt-repository ppa:rmescandon/yq
sudo apt-get update
sudo apt install yq -y
logout and login again

Upgrade Harbor Lab to 1.2

Upgrade Harbor Lab to use the new Harbor Extension included in TKG 1.2 and the new concept of Shared Services.
Implementation will leverage the Envoy VIP so we will only install the tanzu-registry-webhook and not the tkg-connectivity-operator as per the documentation.

External DNS

ExternalDNS makes Kubernetes resources discoverable via public DNS servers. It retrieves a list of resources (Services, Ingresses, etc.) from the Kubernetes API to determine a desired list of DNS records.

External DNS eliminates scripting that we have today to create/delete DNS entries in route53.
Changes:
Create a policy that external dns uses
use that policy as part of clusterawsadm so that it could be added to the same role.
deploy external dns
modify contour envoy service to add annotation to create a wild card entry once that service is created. External DNS does not support http-proxy CRD but does support ingressroute but once we have that wild card entry added for the envoy service then it will as it is working today.

Work for this issue has already started as part of tkg-lab <> tkg-hol branch.

Data Protection Still Needs Work

This still needs to:

  • deploy-all-vsphere.sh script
  • delete the velero.sh script
  • update the folllowing readmes
    • docs/mgmt-cluster/10_velero_mgmt.md
      • Need to write readme regarding setting up a data protection account in TMC
    • docs/shared-services-cluster/09_velero_ssc.md

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.