Giter VIP home page Giter VIP logo

dstackai / dstack Goto Github PK

View Code? Open in Web Editor NEW
1.2K 10.0 86.0 112.18 MB

dstack is an easy-to-use and flexible container orchestrator for running AI workloads in any cloud or data center.

Home Page: https://dstack.ai

License: Mozilla Public License 2.0

Python 90.58% Dockerfile 0.28% Shell 0.99% Go 7.73% Mako 0.03% HCL 0.22% Jinja 0.18%
machine-learning python aws azure gcp gpu llms cloud orchestration fine-tuning

dstack's Introduction

dstack is an open-source container orchestration engine designed for running AI workloads across any cloud or data center.

The supported cloud providers include AWS, GCP, Azure, OCI, Lambda, TensorDock, Vast.ai, RunPod, and CUDO. You can also use dstack ro run workloads on on-prem servers.

Latest news ✨

Installation

Before using dstack through CLI or API, set up a dstack server.

Install the server

The easiest way to install the server, is via pip:

pip install "dstack[all]" -U

Configure backends

If you have default AWS, GCP, Azure, or OCI credentials on your machine, the dstack server will pick them up automatically.

Otherwise, you need to manually specify the cloud credentials in ~/.dstack/server/config.yml.

See the server/config.yml reference for details on how to configure backends for all supported cloud providers.

Start the server

To start the server, use the dstack server command:

$ dstack server

Applying ~/.dstack/server/config.yml...

The admin token is "bbae0f28-d3dd-4820-bf61-8f4bb40815da"
The server is running at http://127.0.0.1:3000/

Note It's also possible to run the server via Docker.

CLI & API

Once the server is up, you can use either dstack's CLI or API to run workloads. Below is a live demo of how it works with the CLI.

Dev environments

You specify the required environment and resources, then run it. dstack provisions the dev environment in the cloud and enables access via your desktop IDE.

Tasks

Tasks allow for convenient scheduling of any kind of batch jobs, such as training, fine-tuning, or data processing, as well as running web applications.

Specify the environment and resources, then run it. dstack executes the task in the cloud, enabling port forwarding to your local machine for convenient access.

Services

Services make it very easy to deploy any kind of model or web application as public endpoints.

Use any serving frameworks and specify required resources. dstack deploys it in the configured backend, handles authorization, and provides an OpenAI-compatible interface if needed.

Pools

Pools simplify managing the lifecycle of cloud instances and enable their efficient reuse across runs.

You can have instances provisioned in the cloud automatically, or add them manually, configuring the required resources, idle duration, etc.

Examples

Here are some featured examples:

Browse examples for more examples.

More information

For additional information and examples, see the following links:

Licence

Mozilla Public License 2.0

dstack's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dstack's Issues

Allow to schedule runs

It will be great to have an option either to specify the time to launch for the run or just say "this run to start in 2 hours"

Git patch apply error

If a run or a job is restarted on the same runner, the runner tries to apply the Git patch (repo diff) and fails because of a conflict as it's trying to apply it to the folder where it has already applied the patch.

Steps to reproduce:

  1. Make sure you have only one runner (e.g. disable on-demand runners, and start a runner locally)
  2. Submit a run (or job, and wait when it finishes (you can stop if if needed)
  3. Restart the run.

Expected:

  1. The run is executed exactly as the first time

Actual:

  1. There is an error

Log:

ERRO[2022-05-25T11:21:58Z] diff applier error                            ae=ApplyError{Fragment: 1, FragmentLine: 3, Line: 3} run_name=odd-rabbit-1 job_id=e7fa162e70b1 workflow=train-mnist filename=.dstack/variables.yaml err=conflict: fragment line does not match src line
ERRO[2022-05-25T11:21:58Z] run job is finished with error                job_id=e7fa162e70b1 err=conflict: fragment line does not match src line workflow=train-mnist run_name=odd-rabbit-1
INFO[2022-05-25T11:24:57Z] New job submitted                             job_id=e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1
WARN[2022-05-25T11:24:57Z] count of log arguments must be odd            job_id=e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 count=1
INFO[2022-05-25T11:24:58Z] git checkout                                  path=/root/.dstack/tmp/runs/odd-rabbit-1/e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 url=https://github.com/dstackai/dstack-examples.git branch=main hash=f219066b2379c69263f281f65167c8f6046874a2 job_id=e7fa162e70b1 auth=*http.BasicAuth
WARN[2022-05-25T11:24:58Z] git clone ref==nil                            branch=main hash=f219066b2379c69263f281f65167c8f6046874a2 job_id=e7fa162e70b1 path=/root/.dstack/tmp/runs/odd-rabbit-1/e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 url=https://github.com/dstackai/dstack-examples.git
INFO[2022-05-25T11:24:58Z] apply diff start                              run_name=odd-rabbit-1 dir=/root/.dstack/tmp/runs/odd-rabbit-1/e7fa162e70b1 job_id=e7fa162e70b1 workflow=train-mnist
ERRO[2022-05-25T11:24:58Z] diff applier error                            job_id=e7fa162e70b1 workflow=train-mnist run_name=odd-rabbit-1 filename=.dstack/variables.yaml err=conflict: fragment line does not match src line ae=ApplyError{Fragment: 1, FragmentLine: 3, Line: 3}
ERRO[2022-05-25T11:24:58Z] run job is finished with error                run_name=odd-rabbit-1 job_id=e7fa162e70b1 err=conflict: fragment line does not match src line workflow=train-mnist

Simplify AWS settings

Questions:

  1. How to avoid requiring the user to add limits manually through the UI?
  2. How to determine in what regions and what instance types it's allowed to use?
  3. Is it possible to let the user configure it through the code and not UI?
  4. How to make AWS configuration as easy as possible?

Provide on-demand compute

Currently, in order to use dstack, the user needs to either have an existing cloud account or own hardware.
It would be great if dstack provided its own compute provider and allow users to use dstack without having their own cloud account or hardware.
On one hand, dstack could provide a number of free GPU hours for the trial.
On the other hand, dstack could provide a way to pay for the spent hours, e.g. via a card.

tag doesn't work after uploading

using dstack artifacts upload I've provided the tag I wanted to assign to the data. Unfortunately, further runs depending on this data were failing without any logs. I've just removed the tag from the data (after it was successfully uploaded) and assigned it (the same tag) once again. Without any changes in the local repo, the code was resurrected and launched easily on dstack.

Support multiple CUDA versions

Here's one way to do it:

  1. Allow dstack-runner to read the CUDA version from the runner.yaml. Add config --cuda <...> argument to dstack-runner.
  2. Allow dstack-runner to replace ${{ cuda }} within jobs' image_namewith the configured CUDA version. Do the same for the docker image that is used to runnvidia-smi`.

Log runner errors related to run with the run logs

Currently, the user doesn't see why a run is failing...

Examples:

  • workflows.yaml is missing
  • Specified workflow cannot be found
  • Specified provider cannot be found
  • Specified tag cannot be found
  • Can’t fetch the repo
  • Can’t apply the diff + error message
  • Can’t download the provider
  • Can’t create/start Docker container
  • Can’t find/mount the artifact

etc

dstack fails with large uncommited changes

Regularly happens if you work with ipynb notebooks locally and going to submit a python file regardless if the latter was changed or not.
sometimes fails on the stage or dstack run with requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://api.dstack.ai/runs/submit
what is worse, sometimes this successfully goes to the servers but fails there without any notification

Show workflow parameters in dashboard

image

1. apart from depends-on would be nice to have a textual representation of the dependency (name) 2. apart from run itself, it's tag may be helpful

Allow CLI to upload artifacts

In the tutorial, the data is downloaded via the library which is not customisable enough.
It would be nice to have an option to pass the data to the execution environment. For example, it may be a tag in the workflows for the data to be taken from the specified path to the aws instance.

Thank you in advance!

dstack run tagging feature

Would be nice to be able to set a tag for the run right from the console like
dstack run train-model --tag latest

Reduce the number of dependencies

There are tons of dependencies apart from the ones passed by the user. These dependencies are installed each time the run is submitted. It would be nice to optimize this part.
Ideas:

  1. Select really necessary libraries
  2. Make several pre-built sets for the most common use-cases

Containers from colab\kaggle would be really nice as they are +- classical and have expected behaviour regarding popular libraries

Support "restart" command

The command should work similarly to dstack run but instead of creating new jobs, it should change the existing jobs to theSubmitted status.

Pass job environment variables to the container

Now, every job may have its own environment variables set by the provider – see the property environment in the job. It's a map of string to string. The runner should pass these environment variables to the job container.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.