Giter VIP home page Giter VIP logo

docs's Introduction

Run:ai Documentation Library

Welcome to the Run:ai documentation area. For an introduction about what is the Run:ai Platform see Run:ai platform on the run.ai website

This documentation is based on mkdocs. To view the library as a website go to docs.run.ai.

The Run:ai documentation is targeting three personas:

  • Run:ai Administrator - Responsible for the setup and the day-to-day administration of the product. Administrator documentation can be found here.

  • Researcher - Using Run:ai to submit jobs. Researcher documentation can be found here.

  • Developer - Using various APIs to manipulate Jobs and integrate with other systems. Developer documentation can be found here.

Example Docker Images

Code for the Docker images referred to in these docs is available here.

How to get Support

To get support use the following channels:

  • On our website, under Support use the support form.

  • On the bottom right of the Run:ai user interface, use the Help widget.

  • On the bottom right of this page, use the Help widget.

docs's People

Contributors

davidlif avatar doronkg avatar enoodle avatar eyal-run-ai avatar gal-revach avatar galbarnissan avatar galbenyair avatar gshaibi avatar gshaibi-runai avatar guysalton21 avatar hagay-runai avatar itayvallach avatar jasonnovichrunai avatar javiplav avatar jonathancosme avatar liohill avatar lliranbabi avatar morangux avatar natasharomm avatar oferla avatar omer-dayan avatar omerbenedict avatar omerbenedictrunai avatar ozrunai avatar razrotenberg avatar roir avatar romanbaron avatar saranachmias avatar yarongol avatar yodarshafrir1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docs's Issues

The docs are inconsistent. There is supposed to be an On-prem tab just like

The docs are inconsistent. There is supposed to be an On-prem tab just like
there is for 2.9. Not sure why you would get rid of an on-prem tab when
it’s the majority of deployment models.

On Mon, Jul 24, 2023 at 11:13 PM Yaron @.***> wrote:

It says "Follow the Getting Started guide to install the NVIDIA GPU
Operator, or see the distribution-specific instructions below...." So there
are a number of supported environments (including native k8s which you
mention) that fall under this catch-all phrase...
Since you are the native English speaker here, you are welcome to
re-phrase it and send a pull request.

I will change to 16GB @kirson-git https://github.com/kirson-git

β€”
Reply to this email directly, view it on GitHub
#411 (comment),
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AZPAV763YUBTH77XD3A7YITXR5BU7ANCNFSM6AAAAAA2OAY57I
.
You are receiving this because you commented.Message ID:
@.***>

--

Regards,

Michael Burrows

Solution Architect

Customer Success

www.run.ai

Originally posted by @runneramb in #411 (comment)

Help message suggests cluster URL may use an IP, but the code will only accept URL

Hello. When trying to authenticate to Run:ai the input form has this in the help message next to the field labeled "Cluster URL":

The Run:ai user interface requires a URL or IP address of the Kubernetes cluster (e.g. https://143.23.55.2 or https://cluster.myorg.com)

However, when I use the IP of ingress controller (which brings us to the next problem), I receive the following error:

runai system is not yet available due to: enabled operands handling error: Ingress.extensions "researcher-service-ingress" is invalid: spec.rules[0].host: Invalid value: "10.150.98.170": must be a DNS name, not an IP address

There's no such thing as "Cluster URL"

Not only this help message is contradictory to the implementation, there's no way to tell what did you mean when you wrote "Cluster URL". Please name it in a way that describes what you actually wanted this to be and change the wording of the help tooltip to reflect that. There's no need to include examples in the help message: users who can make it as far as that message had already opened a Web browser before and have seen examples of URLs.

(Chromium) Run:ai binary download documentation page misses the part that allows download

This page: https://docs.run.ai/admin/researcher-setup/cli-install/#install-runai-cli reads:

  • Go to the Run:ai user interface. On the top right select Researcher Command Line Interface.
  • Select Mac or Linux.
  • Download directly using the button or copy the command and run it on a remote machine
  • ...

However, when pressing on Researcher Command Line Interface the user is simply redirected to the documentation page again.

Other users reported that when using Google Chrome or Brave they are actually able to reach the download form, so, it may be a compatibility issue. The users who succeed in reaching the download page also show the screenshot with the drop-down

  • Documentation Center
  • Administrator API
  • Researcher Command Line Interface
  • Contact Support

to look different (the Researcher Command Line Interface item doesn't have an "arrow poking out of the box" icon next to it).

RunAI CLI

Opening an issue here as there isn't the possibility to open one in the RunAI CLI repository:

https://github.com/run-ai/runai-cli/releases
On your latest version 2.3.1 your install script copies 'charts' into the installation folder, however it is not present in the install files. For now I've copied over the 'charts' folder from your version 2.3.0 and it seems to work.

Additionally you're missing an uninstall option to remove the command line interface in a user friendly manner.

Audit Log - Broken link

The Audit Log page refers to a broken link:

To retrieve the Audit log you need to call an API. You can do this via code or by using the Audit function via a [user interface for calling APIs](https://yaron.runailabs.net/api/docs/#/Audit/get_v1_k8s_audit){target=_blank}.

The link redirects to a site that cannot be reached.

WARN[0000] Error in getting user details: token is missing the 'username' claim

Hello.

I'm trying to understand whether this is a configuration error on my part, or is this a problem with Run:ai.

I'm getting this warning in following contexts:

root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 list jobs -p test1
WARN[0000] Error in getting user details: token is missing the 'username' claim 
ERRO[0000] project test1 does not exist. Run 'runai list project' to view all available projects 
root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 list jobs -A
WARN[0000] Error in getting user details: token is missing the 'username' claim 
WARN[0000] Error in getting user details: token is missing the 'username' claim 
NAME  STATUS  AGE  NODE  IMAGE  TYPE  PROJECT  USER  GPUs Allocated (Requested)  PODs Running (Pending)  SERVICE URL(S)

As you can see, sometimes it results in error, and other times -- not so much.

Additionally, the jobs are actually running since:

root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 describe job train1 -p test1 | head -5
WARN[0000] Error in getting user details: token is missing the 'username' claim 
Name: train1
Namespace: runai-test1
Type: Train
Status: Running
Duration: 9m

succeeds, but with the same warning.

Also, when trying to list projects, none of them are found:

root@ose-t-u2004-02-28-1:~# ./runai-cli-linux-amd64 list project                                                                                                                                                                            
WARN[0000] Error in getting user details: token is missing the 'username' claim 
PROJECT  DEPARTMENT  DESERVED GPUs  ALLOCATED GPUs  INT LIMIT  INT AFFINITY  TRAIN AFFINITY  MANAGED NAMESPACE
root@ose-t-u2004-02-28-1:~# 

Even though I was able to submit the job and execute describe on it.

So, what is going on?


NB. It would be nice if instead of the warning message being directed at developers, the message would convey useful information to the users (administrators are users for this purpose). There's no way to understand for the user what token is this message talking about.

Unlabeled Node Roles

The following is described under node roles doc:

## Dedicated GPU & CPU Nodes

Separate nodes into those that:

* Run GPU workloads
* Run CPU workloads
* Do not run Run:ai at all. these jobs will not be monitored using the Run:ai Administration User interface. 

This is actually not true, all nodes in the cluster are displayed under Nodes tab in the Administration UI.
That includes Run:ai worker nodes, Run:ai system nodes, regular workers, and cluster masters.

All nodes containing GPUs and having DCGM exporting metrics upon them, would count as "GPU nodes" in the Overview dashboard.
That includes nodes that don't have the runai-container-toolkit & runai-container-toolkit-exporter DaemonSets running on them - that means that any Run:ai pod won't be scheduled upon them, but they are still counted.

Review nodes names using `kubectl get nodes`. For each such node run:

'```
runai-adm set node-role --gpu-worker <node-name>
'```

or 

'```
runai-adm set node-role --cpu-worker <node-name>
'```

Nodes not marked as GPU worker or CPU worker will not run Run:ai at all.

That's also not true, nodes that are not marked as GPU workers nor CPU workers would run any kind of Run:ai workload.
The same behavior will be achieved if both roles are assigned to a node.

Elaborate about runai-reservation namespace

When installing Run:ai, a namespace is created as part of the cluster named runai-reservation.
This namespace is purposed to reserve GPUs that are used for jobs with fractional GPUs.

When a new job with fractional GPU is submitted, a new pod is created within runai-reservation namespace and is responsible for preventing the "full GPU" workload from using that GPU.

There is no reference for that namespace at all in the docs.
An official elaboration would be great :)

Upgrade Run:ai Airgapped 2.8.X

In Run:ai installation tar (for airgapped), you could find the following:

deploy/
|__...
|__runai-backend/
|__runai-backend-<version>.tgz
|__...

The runai-backend is an empty folder, while the runai-backend-<version>.tgz should be inside of it according to the docs:

helm upgrade runai-backend runai-backend/runai-backend-<version>.tgz -n \
    runai-backend  -f runai-backend-values.yaml

Therefore, the tar should be moved into it, or otherwise, the command in the docs should be changed to ./runai-backend-<version>.tgz instead.

Following the k8s install docs yields errors

Following the doc here, I get the following errors on a fresh install of Ubuntu 22.10:

...
Err:6 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY B53DC80D13EDEF05
Reading package lists... Done
W: GPG error: https://packages.cloud.google.com/apt kubernetes-xenial InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY B53DC80D13EDEF05
E: The repository 'https://apt.kubernetes.io kubernetes-xenial InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
 installing kubectl kubeadm kubelet...
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done

No apt package "kubeadm", but there is a snap with that name.
Try "snap install kubeadm"


No apt package "kubectl", but there is a snap with that name.
Try "snap install kubectl"


No apt package "kubelet", but there is a snap with that name.
Try "snap install kubelet"

E: Unable to locate package kubelet
E: Unable to locate package kubeadm
E: Unable to locate package kubectl

Unable to submit Inference workloads using quick start docs

Hi, I was trying to deploy inference workload using UI and as well as YAML. Here is the YAML. The example here was taken from quickstart guide for Run:ai 2.15 docs.

apiVersion: run.ai/v2alpha1
kind: InferenceWorkload
metadata:
  name: inference1
  namespace: runai-demo
spec:
  name:
    value: inference1
  gpu:
    value: "0.5"
  image:
    value: "gcr.io/run-ai-demo/example-triton-server"
  minScale:
    value: 1
  maxScale:
    value: 2
  metric:
    value: concurrency
  target:
    value: 80
  ports:
      items:
        port1:
          value:
            container: 8000
            protocol: http

I get the following error

Error from server (validation failed: must not set the field(s): spec.template.spec.schedulerName, spec.template.spec.securityContext): error when creating "inferenceworkload.yaml": admission webhook "workload-controller.runai.svc" denied the request: validation failed: must not set the field(s): spec.template.spec.schedulerName, spec.template.spec.securityContex

This was done on openshift 4.13.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.