Giter VIP home page Giter VIP logo

ai-on-gke's Introduction

AI on GKE Assets

This repository contains assets related to AI/ML workloads on Google Kubernetes Engine (GKE).

Overview

Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers:

  • Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale
  • Flexible integration with distributed computing and data processing frameworks
  • Support for multiple teams on the same infrastructure to maximize utilization of resources

Infrastructure

The AI-on-GKE application modules assumes you already have a functional GKE cluster. If not, follow the instructions under infrastructure/README.md to install a Standard or Autopilot GKE cluster.

.
├── LICENSE
├── README.md
├── infrastructure
│   ├── README.md
│   ├── backend.tf
│   ├── main.tf
│   ├── outputs.tf
│   ├── platform.tfvars
│   ├── variables.tf
│   └── versions.tf
├── modules
│   ├── gke-autopilot-private-cluster
│   ├── gke-autopilot-public-cluster
│   ├── gke-standard-private-cluster
│   ├── gke-standard-public-cluster
│   ├── jupyter
│   ├── jupyter_iap
│   ├── jupyter_service_accounts
│   ├── kuberay-cluster
│   ├── kuberay-logging
│   ├── kuberay-monitoring
│   ├── kuberay-operator
│   └── kuberay-serviceaccounts
└── tutorial.md

To deploy new GKE cluster update the platform.tfvars file with the appropriate values and then execute below terraform commands:

terraform init
terraform apply -var-file platform.tfvars

Applications

The repo structure looks like this:

.
├── LICENSE
├── Makefile
├── README.md
├── applications
│   ├── jupyter
│   └── ray
├── contributing.md
├── dcgm-on-gke
│   ├── grafana
│   └── quickstart
├── gke-a100-jax
│   ├── Dockerfile
│   ├── README.md
│   ├── build_push_container.sh
│   ├── kubernetes
│   └── train.py
├── gke-batch-refarch
│   ├── 01_gke
│   ├── 02_platform
│   ├── 03_low_priority
│   ├── 04_high_priority
│   ├── 05_compact_placement
│   ├── 06_jobset
│   ├── Dockerfile
│   ├── README.md
│   ├── cloudbuild-create.yaml
│   ├── cloudbuild-destroy.yaml
│   ├── create-platform.sh
│   ├── destroy-platform.sh
│   └── images
├── gke-disk-image-builder
│   ├── README.md
│   ├── cli
│   ├── go.mod
│   ├── go.sum
│   ├── imager.go
│   └── script
├── gke-dws-examples
│   ├── README.md
│   ├── dws-queues.yaml
│   ├── job.yaml
│   └── kueue-manifests.yaml
├── gke-online-serving-single-gpu
│   ├── README.md
│   └── src
├── gke-tpu-examples
│   ├── single-host-inference
│   └── training
├── indexed-job
│   ├── Dockerfile
│   ├── README.md
│   └── mnist.py
├── jobset
│   └── pytorch
├── modules
│   ├── gke-autopilot-private-cluster
│   ├── gke-autopilot-public-cluster
│   ├── gke-standard-private-cluster
│   ├── gke-standard-public-cluster
│   ├── jupyter
│   ├── jupyter_iap
│   ├── jupyter_service_accounts
│   ├── kuberay-cluster
│   ├── kuberay-logging
│   ├── kuberay-monitoring
│   ├── kuberay-operator
│   └── kuberay-serviceaccounts
├── saxml-on-gke
│   ├── httpserver
│   └── single-host-inference
├── training-single-gpu
│   ├── README.md
│   ├── data
│   └── src
├── tutorial.md
└── tutorials
    ├── e2e-genai-langchain-app
    ├── finetuning-llama-7b-on-l4
    └── serving-llama2-70b-on-l4-gpus

Jupyter Hub

This repository contains a Terraform template for running JupyterHub on Google Kubernetes Engine. We've also included some example notebooks ( under applications/ray/example_notebooks), including one that serves a GPT-J-6B model with Ray AIR (see here for the original notebook). To run these, follow the instructions at applications/ray/README.md to install a Ray cluster.

This jupyter module deploys the following resources, once per user:

  • JupyterHub deployment
  • User namespace
  • Kubernetes service accounts

Learn more about JupyterHub on GKE here

Ray

This repository contains a Terraform template for running Ray on Google Kubernetes Engine.

This module deploys the following, once per user:

  • User namespace
  • Kubernetes service accounts
  • Kuberay cluster
  • Prometheus monitoring
  • Logging container

Learn more about Ray on GKE here

Important Considerations

  • Make sure to configure terraform backend to use GCS bucket, in order to persist terraform state across different environments.

Licensing

ai-on-gke's People

Contributors

achandrasekar avatar alizaidis avatar andrewsykim avatar annapendleton avatar artemvmin avatar arueth avatar benjaminkazemi avatar bjornsen avatar blackzlq avatar canliu-aha avatar chiayi avatar danielvegamyhre avatar dependabot[bot] avatar elfinhe avatar ftw-win avatar hsachdevah avatar imreddy13 avatar joezijunzhou avatar kenthua avatar nstogner avatar rbarberop avatar richardsliu avatar rsgowman avatar ruiwen-zhao avatar ryanaoleary avatar samos123 avatar umeshkumhar avatar vivianrwu avatar xiangshen-dk avatar yiyinglovecoding avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ai-on-gke's Issues

Use Ray interactive client in the example notebooks

The notebook used for RAG currently writes a python file locally and uses it as an entrypoint to a Ray job using JobSubmissionClient.

Instead of this approach, we can use Ray's interactive remote client to run the steps remotely. This would allow the notebook to be broken down into smaller cells that can run in isolation, as opposed to a single job that executes everything at once.

The trade-off here is that the remote interactivce client may not be as reliable as using a Ray job

gke-disk-image-builder: Configurable service account

Customer need: need to be able to use a preexisting GCP Service Account when running the image builder instance.

For example, what I did:

                                                Instance: compute.Instance{
                                                        Name:        fmt.Sprintf("%s-instance", name),
                                                        MachineType: fmt.Sprintf("zones/%s/machineTypes/%s", req.Zone, req.MachineType
),
+                                                       ServiceAccounts: []*compute.ServiceAccount{
+                                                               &compute.ServiceAccount{
+                                                                       Email: req.ServiceAccount,
+                                                                       Scopes: []string{
+                                                                               "https://www.googleapis.com/auth/devstorage.read_only"
,
+                                                                               "https://www.googleapis.com/auth/logging.write",
+                                                                               "https://www.googleapis.com/auth/monitoring.write",
+                                                                               "https://www.googleapis.com/auth/pubsub",
+                                                                               "https://www.googleapis.com/auth/service.management.re
adonly",
+                                                                               "https://www.googleapis.com/auth/servicecontrol",
+                                                                               "https://www.googleapis.com/auth/trace.append",
+                                                                       },
+                                                               },
+                                                       },

gke-disk-image-builder: Configurable shielded instance config

Customer requirement: need to be able to configure the compute instance to run using shielded settings:

What I needed:

+                                                       ShieldedInstanceConfig: &compute.ShieldedInstanceConfig{
+                                                               EnableSecureBoot:          true,
+                                                               EnableVtpm:                true,
+                                                               EnableIntegrityMonitoring: true,
+                                                       },

GCSFuse permission issues when deploying RAG on existing cluster

Ray pod is seeing this permission issue

  Warning  FailedMount  55s (x7 over 88s)  kubelet            MountVolume.SetUp failed for volume "gcs-fuse-csi-ephemeral" : rpc error: code = PermissionDenied desc = the sidecar container failed with error: mountWithArgs: mountWithStorageHandle: fs.NewServer: create file system: SetUpBucket: Error in iterating through objects: Get "https://storage.googleapis.com/storage/v1/b/rag-data-andrewsy/o?alt=json&delimiter=&endOffset=&includeTrailingDelimiter=false&matchGlob=&maxResults=1&pageToken=&prefix=&prettyPrint=false&projection=full&startOffset=&versions=false": compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden: Permission 'iam.serviceAccounts.getAccessToken' denied on resource (or it may not exist).
This error could be caused by a missing IAM policy binding on the target IAM service account.
For more information, refer to the Workload Identity documentation:
  https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

Error when running gpt-j-online.ipynb example: AttributeError: 'JobConfig' object has no attribute 'py_driver_sys_path'

Following the steps to access JupyterLab, I install ray from terminal with pip install -U "ray[air]" and then run gpt-j-online example. I get the following error:

ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 703, in Datapath
    if not self.proxy_manager.start_specific_server(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 298, in start_specific_server
    runtime_env_config = job_config.get_proto_runtime_env_config()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/job_config.py", line 143, in get_proto_runtime_env_config
    return self.get_proto_job_config().runtime_env_info.runtime_env_config
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/job_config.py", line 115, in get_proto_job_config
    pb.py_driver_sys_path.extend(self.py_driver_sys_path)
AttributeError: 'JobConfig' object has no attribute 'py_driver_sys_path'

Kuberay TPU Webhook

Create a mutating admission webhook to inject TPU environment variables into pods started by Kuberay operator service.

gke-disk-image-builder: Configurable attached disk source image

Customer requirement: need the compute instance to run with a configurable disk source image.

Reference:

                                                        Disks: []*compute.AttachedDisk{
                                                                &compute.AttachedDisk{
                                                                        AutoDelete: true,
@@ -136,9 +169,10 @@ func GenerateDiskImage(ctx context.Context, req Request) error {
                                                                        DeviceName: fmt.Sprintf("%s-bootable-disk", name),
                                                                        Mode:       "READ_WRITE",
                                                                        InitializeParams: &compute.AttachedDiskInitializeParams{
-                                                                               DiskSizeGb:  req.DiskSizeGB,
-                                                                               DiskType:    fmt.Sprintf("projects/%s/zones/%s/diskTyp
es/%s", req.ProjectName, req.Zone, req.DiskType),
-                                                                               SourceImage: /* NEED THIS TO BE CONFIGURABLE */

Autopilot e2e tests are flaky due to GMP webook

Sometimes the Autopilot e2e tests fail because GMP webhook:

Error: Internal error occurred: failed calling webhook "default.podmonitorings.gmp-operator.gke-gmp-system.monitoring.googleapis.com": failed to call webhook: Post "https://gmp-operator.gke-gmp-system.svc:443/default/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": No agent available

  with module.kuberay-monitoring[0].helm_release.gmp-engine,
  on ../../modules/kuberay-monitoring/main.tf line 16, in resource "helm_release" "gmp-engine":
  16: resource "helm_release" "gmp-engine" {

My guess is that this is because the Autopilot cluster has no nodes initially so the webhook it can't serve this request.

style guide: inconsistent representations of placeholder values

We should try to be consistent with documentation style, which maybe warrants adding a style guide. For starters, I've noticed inconsistencies in how <placeholder> values are represented:

<placeholder>
PLACEHOLDER
%placeholder%
<compound placeholder>
<compound-placeholder>
<compound_placeholder>
$PLACEHOLDER

Angle brackets <placeholder> seems to be most widely used (recommended by the k8s docs style guide). Capitalization and spacing might be optional, but my personal preference is for upper case with underscores: <COMPOUND_PLACEHOLDER>

JupyterHub service account is missing GCS related roles

When I tried to run JupyterHub backed by a GCSFuse, I ran into this error from JupyterHub:

2024-03-22T18:48:02Z [Warning] MountVolume.SetUp failed for volume "gcs-fuse-csi-ephemeral" : rpc error: code = PermissionDenied desc = failed to get GCS bucket "gcsfuse-admin": googleapi: Error 403: jupyter-sa@<project-id>.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist)., forbidden

It looks like the predefined roles are missing GCS related roles:

# TODO review all permissions
variable "predefined_iam_roles" {
description = "Predefined list of IAM roles to assign"
type = list(string)
default = ["roles/compute.networkViewer", "roles/viewer", "roles/cloudsql.client", "roles/artifactregistry.reader", "roles/storage.admin", "roles/iam.serviceAccountAdmin", "roles/compute.loadBalancerServiceUser", "roles/iam.serviceAccountTokenCreator"]
}

kubernetes ingress instead of kubernetes service to gather ingress data?

data "kubernetes_service" "jupyter-ingress" {

I got:

│ Error: services "jupyter-ingress" not found

│ with module.jupyterhub.data.kubernetes_service.jupyter-ingress,
│ on ../../modules/jupyter/main.tf line 184, in data "kubernetes_service" "jupyter-ingress":
│ 184: data "kubernetes_service" "jupyter-ingress" {

constantly until i updated this data to use data "kubernetes_ingress_v1"

gke-disk-image-builder: Configurable instance network settings with limited permissions

Customer requirement: need to be able to pass configurable network/subnet when running the compute instance and run with limited permissions.

NOTE: To get this working myself, I needed to update the daisy dependency (I only did this locally - did not push changes). Daisy was assuming I was either creating the network or had the permissions to list subnets. In this environment, I only have permissions to: gcloud compute networks subnets list-usable --project SOME_OTHER_PROJECT (not a full list operation on subnets).

Tutorial: Finetuning Llama 7b on GKE using L4 GPUs invalid gcloud container clusters create

Multiple syntax errors on the follow gcloud command:

gcloud container clusters create l4-demo --location ${REGION}
--workload-pool ${PROJECT_ID}.svc.id.goog
--enable-image-streaming --enable-shielded-nodes
--shielded-secure-boot --shielded-integrity-monitoring
--enable-ip-alias
--node-locations=${REGION}-a
--workload-pool=${PROJECT_ID}.svc.id.goog
--labels="ai-on-gke=l4-demo"
--addons GcsFuseCsiDriver

Resolve SQLAlchemy warning from RAG notebook

The RAG notebook spits out some warning from the SQLAlchemy library:

/tmp/ipykernel_71/582989623.py:7: MovedIn20Warning: The ``declarative_base()`` function is now available as sqlalchemy.orm.declarative_base(). (deprecated since: 2.0) (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
  Base = declarative_base()

Terraform variables best practices

Several READMEs recommend editing the terraform variable declarations directly. I do not think this is a best practice for working with terraform variables. The GCP terraform best practices docs provide the following recommendation:

Store variables in a tfvars file
For root modules, provide variables by using a .tfvars variables file. For consistency, name variable files terraform.tfvars.

Don't specify variables by using alternative var-files or var='key=val' command-line options. Command-line options are ephemeral and easy to forget. Using a default variables file is more predictable.

Other relevant variable best practices are defined here, and discourage providing default values for required variables.

Add requirements for GKE customer on ray-on-gke README

TL;DR - the ray-on-gke readme should be updated with the requirements needed for setting up the prerequisite GKE clusters.

  • GKE Cluster must have Workload Identity Enabled
  • GKE Cluster must have KubeRay Operator deployed

Here are the details on the errors when these prerequisites are NOT met. These are based on a GKE Standard cluster (1.27.3-gke.100)

module.service_accounts.google_project_iam_binding.monitoring-viewer: Creation complete after 7s [id=ie-raycluster-0f2aa542/roles/monitoring.viewer]
╷
│ Error: unable to build kubernetes objects from release manifest: resource mapping not found for name: "example-cluster-kuberay" namespace: "" from "": no matches for kind "RayCluster" in version "ray.io/v1alpha1"
│ ensure CRDs are installed first
│ 
│   with module.kuberay.helm_release.ray-cluster,
│   on modules/kuberay/kuberay.tf line 15, in resource "helm_release" "ray-cluster":
│   15: resource "helm_release" "ray-cluster" {
│ 
╵

This is a result of the KubeRay operator not being installed on the cluster

The iam-role-binding will also fail. This is the result of Workload Identity not being enabled.

Cannot view logs/metrics for Ray in cloud monitoring

I'm calling Ray on GKE cluster via Notebook on gpt-j-online.ipynb and prediction is working as expected. I'm trying to get logging results in log explorer as per instructions, so I'm adding this to the filters:

resource.type="k8s_container"
resource.labels.cluster_name=%CLUSTER_NAME%
resource.labels.pod_name=%RAY_HEAD_POD_NAME%
resource.labels.container_name="fluentbit"

(substituting cluster_name and pod_name)

However I don't get any results at all from resource.labels.container_name="fluentbit" container.
Any suggestions?

Remove dependency on GPUs in CI

Our cloud build presubmit creates a GPU node pool. This node pool creation can fail if there are no GPUs available. If possible, the clusters created as part of the presubmit should only use CPUs.

Update the README to reflect the correct sample dataset & surface this info earlier in the README

  1. Currently, https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/applications/rag/README.md mentions that the example uses https://www.kaggle.com/datasets/denizbilginn/google-maps-restaurant-reviews as the input dataset. However, per #385, the example uses a "Netflix shows" dataset.

  2. To help users quickly understand what the sample application does, consider mentioning the example dataset upfront in the introductory section of the README. Currently, the dataset info is buried rather deep in the README.

gke-disk-image-builder does not work with hyperdisks

When attempting to use the gke-disk-image-builder tool with hyperdisk, we ran into the following error:

Nov 22 22:58:32 debian google_metadata_script_runner[2040]: startup-script-url: Device /dev/sdb does not exist. Please rerun the tool to try it again.

Steps to Reproduce:

go run ./cli --project-name=$PROJECT_NAME --image-name=triton-2309-py3-hd --zone=$ZONE --gcs-path=gs://$GCS_PATH/ --container-image='nvcr.io/nvidia/tritonserver:23.09-py3' --machine-type=h3-standard-88 --disk-type=hyperdisk-balanced

RAG cannot be deployed on an existing cluster due to CloudSQL requiring at least 1 private service connection

Trying to deploy RAG on an existing cluster results in this error:

Step #0 - "Apply blueprint": on .terraform/modules/jupyterhub.jupyterhub-workload-identity/modules/workload-identity/main.tf line 51
Step #0 - "Apply blueprint": 
Step #0 - "Apply blueprint": time="2024-04-02T14:44:14Z" level=error msg="error writing to GCS: error closing temp logfile: googleapi: got HTTP response code 503 with body: Service Unavailable"
Step #0 - "Apply blueprint": Error: Error, failed to create instance because the network doesn't have at least 1 private services connection. Please see https://cloud.google.com/sql/docs/mysql/private-ip#network_requirements for how to create this connection.
Step #0 - "Apply blueprint": time="2024-04-02T14:44:14Z" level=error msg="Error (exit code 1) running \"terraform apply -json /tmp/tfplan-2667148229/plan.out\". Stderr:\n"
Step #0 - "Apply blueprint": Error: Error, failed to create instance because the network doesn't have at least 1 private services connection. Please see https://cloud.google.com/sql/docs/mysql/private-ip#network_requirements for how to create this connection.
Step #0 - "Apply blueprint": error: Error, failed to create instance because the network doesn't have at least 1 private services connection. Please see https://cloud.google.com/sql/docs/mysql/private-ip#network_requirements for how to create this connection.
Step #0 - "Apply blueprint": 
Step #0 - "Apply blueprint": on .terraform/modules/cloudsql.cloudsql/modules/postgresql/main.tf line 54

I think this is because we only create the private service connection when creating new networks, which doesn't execute when trying to run the RAG deployment on an existing cluster

RAG frontend is getting OOMKilled

Steps to reproduce:

  1. Scale down RAG frontend to 1 replica
  2. Send multiple prompts
  3. kubectl get pods to see that frontend pod is restarted due to OOM kill.

This is also possible to replicate when running multiple frontend replicas, but it just takes more prompts due to prompts being load balanced across the replicas.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.