Giter VIP home page Giter VIP logo

selkies-operator's Introduction

Selkies - Stateful Workload Operator

Discord

Selkies is a platform built on GKE to orchestrate per-user stateful workloads.

Quick start

Assumptions

  • You are a member of a Google Cloud organization.
    • This is required for setup/scripts/create_oauth_client.sh to use gcloud alpha iap oauth-brand commands, because these implicity operate on organization-internal brands. For more information, see this guide.
  • You are granted the Owner role in a project in that organization.
  • You have gcloud installed in your environment.

Steps

The steps below will create the infrastructure for the app launcher. You should deploy to a new project.

  1. Clone the source repository:

    git clone -b master https://github.com/selkies-project/selkies.git
    cd selkies
  2. Configure gcloud (replace XXX & us-west1 with your project ID & preferred region):

    export PROJECT_ID=XXX
    export REGION=us-west1
    gcloud config set project ${PROJECT_ID?}
    gcloud config set compute/region ${REGION?}
  3. Enable the required GCP project services:

    gcloud services enable \
        --project ${PROJECT_ID?} \
        cloudresourcemanager.googleapis.com \
        compute.googleapis.com \
        container.googleapis.com \
        cloudbuild.googleapis.com \
        servicemanagement.googleapis.com \
        serviceusage.googleapis.com \
        stackdriver.googleapis.com \
        secretmanager.googleapis.com \
        iap.googleapis.com
  4. Grant the cloud build service account permissions on your project:

    PROJECT_NUMBER=$(
      gcloud projects describe ${PROJECT_ID?} \
        --format='value(projectNumber)'
    ) && \
      CLOUDBUILD_SA="${PROJECT_NUMBER?}@cloudbuild.gserviceaccount.com" && \
      gcloud projects add-iam-policy-binding ${PROJECT_ID?} \
        --member serviceAccount:${CLOUDBUILD_SA?} \
        --role roles/owner && \
      gcloud projects add-iam-policy-binding ${PROJECT_ID?} \
        --member serviceAccount:${CLOUDBUILD_SA?} \
        --role roles/iam.serviceAccountTokenCreator
  5. Deploy with Cloud Build:

    ACCOUNT=$(gcloud config get-value account) && \
      gcloud builds submit \
        --project=${PROJECT_ID?} \
        --substitutions=_USER=${ACCOUNT?},_REGION=${REGION?}
  6. Deploy sample app:

    (cd examples/jupyter-notebook/ && \
      gcloud builds submit \
        --project=${PROJECT_ID?} \
        --substitutions=_REGION=${REGION?})
  7. Connect to the App Launcher web interface at the URL output below:

    echo "https://broker.endpoints.${PROJECT_ID?}.cloud.goog/"

Troubleshooting

  • If the initial cloud build fails with the message Step #2 - "create-oauth-client": ERROR: (gcloud.alpha.iap.oauth-brands.list) INVALID_ARGUMENT: Request contains an invalid argument., it is most likely due to running as a user that is not a member of the Cloud Identity Organization. See the assumption described above.

  • If the initial cloud build fails with the message Step #2 - "create-oauth-client": ERROR: (gcloud.alpha.iap.oauth-clients.create) FAILED_PRECONDITION: Precondition check failed., it is most likely due to reusing a project that already had its OAuth consent screen set to "External", which cannot be changed via gcloud. Click the "MAKE INTERNAL" button here in your project.

  • If a wget step fails, retry the same command. Some third-party artifact URLs are flaky (due to globally-rate-limited hosts).

  • If your region only has 500 GB of Persistent Disk SSD quota, run the following, but keep in mind the number of apps and image pull performance will be affected.

    cat - > selkies-min-ssd.auto.tfvars <<EOF
    default_pool_disk_size_gb = 100
    turn_pool_disk_size_gb = 100
    gpu_cos_pool_disk_size_gb = 100
    tier1_pool_disk_size_gb = 100
    EOF
    gcloud secrets create broker-tfvars-selkies-min-ssd \
        --replication-policy=automatic \
        --data-file selkies-min-ssd.auto.tfvars
  • If the load balancer never comes online and you receive 500 errors after the deployment has completed for at least 30 minutes, the autoneg controller annotation may need to be reset:

    gcloud container clusters get-credentials broker-${REGION?}
    ./setup/scripts/fix_autoneg.sh

selkies-operator's People

Contributors

danisla avatar deepak7093 avatar jancvanb avatar mike-ensor avatar reisbel avatar robinpercy avatar videlanicolas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

selkies-operator's Issues

Rename to `selkies-core`?

This would reinforce to users that this repo is not the entirety of "the Selkies project/platform", and it would make writing/reading/talking about this repo less ambiguous.

Remove TURN components

The TURN components have been externalized to the selkies-gstreamer repo and will be removed from the core repo.

This includes removing:

  • coturn images
  • coturn infra deployments
  • coturn manifests

Port project to vendor-agnostic Kubernetes

This is the time to do this. Selkies has diverged from a GCP-centric affiliate project to a much wider audience.
Other Kubernetes providers including Vast.ai, RunPod, and CoreWeave are watching and incorporating our projects.

Migrate images/gce-proxy to standalone repository

The GCE proxy is used by the vdi-vm example to create a localhost proxy that uses the GCE VM service account credentials to transparently authenticate to the pod broker.

This should be moved to a top-level selkies-project repo so it can be maintained separately and should remove most/all of the dependabot warnings.

TODO:

  • Copy images/gce-proxy to new repo in selkies-project
  • Remove cloud builder
  • Remove any other references in this repo.
  • Create GH workflow to automate image building.
  • Update selkies-examples/vdi-vm if necessary.

Failed to apply deployment

Brand new GCP project, after I run:

$ ACCOUNT=$(gcloud config get-value account) && \
  gcloud builds submit \
    --project=${PROJECT_ID?} \
    --substitutions=_USER=${ACCOUNT?},_REGION=${REGION?}

I get the following error:

Step #10 - "deploy-cluster-manifests-region": Step #1 - "deploy-manifests": error: unable to recognize "STDIN": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
Step #10 - "deploy-cluster-manifests-region": Step #1 - "deploy-manifests": Error: failed to apply deployment: failed to apply CustomResourceDefinition configuration file with name "istiocontrolplanes.install.istio.io" to cluster: failed to apply config from string: command to apply kubernetes config from string to cluster failed: exit status 1

I think the error is at https://github.com/selkies-project/selkies/blob/master/setup/manifests/deploy.sh:

log_cyan "Installing CRDs"
gke-deploy apply --project ${PROJECT_ID} --cluster ${CLUSTER_NAME} --location ${CLUSTER_LOCATION} --filename /opt/istio-operator/deploy/crds/istio_v1alpha2_istiocontrolplane_crd.yaml
gke-deploy apply --project ${PROJECT_ID} --cluster ${CLUSTER_NAME} --location ${CLUSTER_LOCATION} --filename base/pod-broker/crd.yaml

But I'm no expert. Will continue debugging this.

Any production dependencies on non-`master` branches?

I'm planning to delete every branch other than master, since they're all either behind (pull-secrets, dev), abandoned (app-launcher-pwa-1, dependabot/npm_and_yarn/images/gce-proxy/src/http-proxy-1.18.1), or replaced by a tag (v1.0.0). Do we know of any production/live environments that depend on those branches?

Switch from `master` to `main`?

main is becoming the new default (already has on GitHub), and now seems like a good time, since we're already de-forking and revisiting our branching workflow.

I've already pushed a main branch at the master commit, so if we want to be cautious with this, we could

  1. switch the "default branch" on GitHub (for cloning & PRs) to main
  2. manually fast-forward master to meet main as PRs merge
  3. slowly update production references from master to main
  4. delete master when it feels safe

without needing to update all production references from master to main right away.

Migrate from istio self-managed to Managed ASM

Remove dependency on fixed istio releases in favor of Managed ASM release channel.

Some things to keep in mind:

  • support for EnvoyFilters to preserve OPA addon support.

This will reduce the manual dependencies and deployment complexity.

Use a simple HEAD request to resolve tags to digests

You're calling list tags in order to determine the digest of an image here.

It is much cheaper for the registry to serve a simpler HEAD /v2/<repo>/manifests/<tag> request instead of producing the entire tags list response.

In some situations (e.g. if you need to resolve 100 tags in a single repository), it might be cheaper to do this by listing tags, but you don't appear to be deduping these findImageTags calls by their repository. You could cache the response by repo and re-use that for subsequent checks, but I would guess that the HEAD calls will average out to be cheaper in almost any circumstance.

I maintain a library that does most of the things you're doing, e.g. google.List, google.NewEnvAuthenticator, and remote.Head should cover most of it.

Create GitHub workflow for building images

After the web and coturn cleanup is done (blocked by #24 and #25) build a workflow to build all of the images and push them to GHCR with versions.

Follow the same build philosophy as the selkies-gstreamer repo:
https://github.com/selkies-project/selkies-gstreamer/tree/master/.github/workflows

Preserve the cloudbuild image building to allow folks to build and self-host images.

Strategy:

  • master and dev branches are built when pushed to, images are tagged with the branch name respectively.
  • Pushing a tag builds the images, creates a release and tags the images with latest

Migrate cluster ingress to GKE Gateway API

Instead of creating the GCLB, Managed Certs and AutoNEG controller for the ingress tier, create the GCLB using the new Gateway API.

This will reduce the manual dependencies and deployment complexity.

Migrate image references to GHCR

Depends on #34

Convert all image references in base yamls and kustomizations to use the GHCR images.

  • base yamls should reference the ghcr.io image with the latest tag.
  • kustomization image patches should point to a specific tagged version.
  • update cloud build pipelines, build scripts etc to parameterize passing a target version (tag).

Ready for v1.1.0 or v2.0.0 release?

Is master currently stable? If so, I think we should release what we currently have as v1.1.0 or v2.0.0, depending on whether it contains breaking changes from v1.0.0. My motivation is that we haven't "released" since v1.0.0 (17 months ago), and I think we should aim to be on a new-minor-version-every-month-or-two release cycle this year. Additionally, with a recent release we could re-pin our quick start to a version, instead of pointing it at live master.

I created a v1.1.0 tag & pre-release as a preview of what this release could look like, but I'm happy to delete those.

What do you think?

Error creating Topic: googleapi: Error 409: Resource already exists in the project (resource=gcr)

While setting up a new GCP project I got the following error:

Step #4 - "deploy-infra-base": Step #1 - "terraform-apply": Error: Error creating Topic: googleapi: Error 409: Resource already exists in the project (resource=gcr).
Step #4 - "deploy-infra-base": Step #1 - "terraform-apply":
Step #4 - "deploy-infra-base": Step #1 - "terraform-apply":   on gcr.tf line 17, in resource "google_pubsub_topic" "gcr":
Step #4 - "deploy-infra-base": Step #1 - "terraform-apply":   17: resource "google_pubsub_topic" "gcr" {
Step #4 - "deploy-infra-base": Step #1 - "terraform-apply":
Step #4 - "deploy-infra-base": Step #1 - "terraform-apply":
Step #4 - "deploy-infra-base": Step #1 - "terraform-apply":

Not sure why Terraform complains about the resource already existing? Shouldn't Terraform make sure the configured state is reflected in the GCP project?

Your client has issued a malformed or illegal request.

After a successful deployment I try to access https://broker.endpoints.<PROJECT_ID>.cloud.goog/ and I get the following message:

Error: Bad Request
Your client has issued a malformed or illegal request.

This is error HTTP 400, so it's a client error? What am I doing wrong here?

Make image puller and cloudbuild compatible with regional gcr registries

Per the docs:

Hostname Storage location
gcr.io Stores images in data centers in the United States
asia.gcr.io Stores images in data centers in Asia
eu.gcr.io Stores images in data centers within member states of the European Union1
us.gcr.io Stores images in data centers in the United States

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.