ibm / autopilot Goto Github PK
View Code? Open in Web Editor NEWA tool to detect infrastructure issues on cloud native AI systems
License: Apache License 2.0
A tool to detect infrastructure issues on cloud native AI systems
License: Apache License 2.0
@cmisale I tried going through a helm install on the 2-node dev cluster but kept running into the same issue: Error: Chart.yaml file is missing
Can you help identify what I am doing wrong? Thanks!
Here's are the steps I took, following the instructions of the README. I tried to first uninstall and then install the autopilot daemon via the helm-charts:
jcadden@XXXXX:autopilot $ helm uninstall autopilot -n default
release "autopilot" uninstalled
jcadden@XXXXX:autopilot $ helm repo add autopilot git+https://github.com/IBM/autopilot.git@autopilot-daemon/helm-charts/autopilot?ref=gh-pages
"autopilot" has been added to your repositories
jcadden@XXXXX:autopilot $ cd ../
jcadden@XXXXX:git $ helm upgrade autopilot autopilot/autopilot-daemon --install -n default --set image.tag=vjcadden-dev -f ~/autopilot-config.yml
Release "autopilot" does not exist. Installing it now.
Error: Chart.yaml file is missing
The config file looks like this:
$ cat ~/autopilot-config.yml
namespace:
create: true
name: autopilot
image:
repository: quay.io/autopilot/autopilot
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: network.nvidia.com/operator.mofed.wait
operator: DoesNotExist
- matchExpressions:
- key: network.nvidia.com/operator.mofed.wait
operator: In
values:
- "false"
Autopilot is currently deployed using Helm. If this proves limiting, we may want to adopt an operator model.
I noticed the all
target in the switch statement of runAllTestsLocal
does not execute runGPUMem
Is this missing or is an intentional omission?
Code-wise, the all
target duplicates a decent amount of code that I would like to clean up.
all
as an option and require the list to be explicit?all
run all tests incl. runGPUMem
?A web-based dashboard, accessible through a OCP/K8s login, that provides controls for autopilots health checks to be configured and launched.
We need an entrypoint/healtz to assess autopilot is healthy.
It might even just run the briefings.sh
script.
Preferred output should be json format.
Cpu model and GPU model data points in needed prometheus outputs
autopilot-daemon/pkg/util/global.go:L22
Expand the HchecksGauge
with cpu
and gpu
columns
Pull the CPU and GPU values from os.Getenv
??
We need the ability to automatically launch a configurable set of test before the job starts. This could happen through an admission webhook, with object labels, or manually via an or API call or future dashboard (#2) . The job should not begin if the pre-flight test has failures.
Pre-flight test can include
dcgmi โr 3
Add the ability to toggle individual checks on/off
Granularity node or deployment?
Add the ability for autopilot to add/remove labels on its node based on the results of tests.
e.g., autopilot-node-status=healthy
, autopilot-node-status=unhealthy
autopilot-node-status=unknown
Introduce the k8s client-go
library to interact with node labels
https://github.com/kubernetes/client-go/tree/v0.29.2/examples
You likely will need to extend the helm rules to allow node modifications:
The below assume a single autopilot label per node
type label_num int
labelMap := map[label_num] string{
0: "healthy"
1: "unhealthy"
2: "unknown"
... // more?
}
func AddLabel(label_num)
func ClearLabel()
func GetLabel() label_num
func printLabel(label_num) string
Add the ability for Autopilot to perform pairwise NCCL tests across all the active GPU nodes in the cluster. Pairwise NCCL tests are use to identify bad nodes. This should be available as a pre-flight test (#1)
Autopilot will need the ability to launch new pods on specific nodes (cluster-level awareness)
https://github.com/NVIDIA/nccl-tests
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.