lablup / backend.ai Goto Github PK

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.

Home Page: https://www.backend.ai

License: GNU Lesser General Public License v3.0

Python 94.78% Shell 1.41% Dockerfile 0.04% Starlark 0.57% Go 0.04% Java 0.03% Mako 0.01% CSS 0.32% Jinja 0.14% Vim Script 0.01% C 0.07% HTML 2.36% JavaScript 0.21%

python docker distributed-computing api documentation cloud-computing backendai containers hpc monitoring

backend.ai's Introduction

Backend.AI

It allocates and isolates the underlying computing resources for multi-tenant computation sessions on-demand or in batches with customizable job schedulers with its own orchestrator. All its functions are exposed as REST/GraphQL/WebSocket APIs.

Contents in This Repository

This repository contains all open-source server-side components and the client SDK for Python as a reference implementation of API clients.

Directory Structure

src/ai/backend/: Source codes
- manager/: Manager
- manager/api: Manager API handlers
- agent/: Agent
- agent/docker/: Agent's Docker backend
- agent/k8s/: Agent's Kubernetes backend
- kernel/: Agent's kernel runner counterpart
- runner/: Agent's in-kernel prebuilt binaries
- helpers/: Agent's in-kernel helper package
- common/: Shared utilities
- client/: Client SDK
- cli/: Unified CLI for all components
- storage/: Storage proxy
- storage/api: Storage proxy's manager-facing and client-facing APIs
- web/: Web UI server
  - static/: Backend.AI WebUI release artifacts
- plugin/: Plugin subsystem
- test/: Integration test suite
- testutils/: Shared utilities used by unit tests
- meta/: Legacy meta package
docs/: Unified documentation
tests/
- manager/, agent/, ...: Per-component unit tests
configs/
- manager/, agent/, ...: Per-component sample configurations
docker/: Dockerfiles for auxiliary containers
fixtures/
- manager/, ...: Per-component fixtures for development setup and tests
plugins/: A directory to place plugins such as accelerators, monitors, etc.
scripts/: Scripts to assist development workflows
- install-dev.sh: The single-node development setup script from the working copy
stubs/: Type annotation stub packages written by us
tools/: A directory to host Pants-related tooling
dist/: A directory to put build artifacts (.whl files) and Pants-exported virtualenvs
changes/: News fragments for towncrier
pants.toml: The Pants configuration
pyproject.toml: Tooling configuration (towncrier, pytest, mypy)
BUILD: The root build config file
**/BUILD: Per-directory build config files
BUILD_ROOT: An indicator to mark the build root directory for Pants
requirements.txt: The unified requirements file
*.lock, tools/*.lock: The dependency lock files
docker-compose.*.yml: Per-version recommended halfstack container configs
README.md: This file
MIGRATION.md: The migration guide for updating between major releases
VERSION: The unified version declaration

Server-side components are licensed under LGPLv3 to promote non-proprietary open innovation in the open-source community while other shared libraries and client SDKs are distributed under the MIT license.

There is no obligation to open your service/system codes if you just run the server-side components as-is (e.g., just run as daemons or import the components without modification in your codes). Please contact us (contact-at-lablup-com) for commercial consulting and more licensing details/options about individual use-cases.

Getting Started

Installation for Single-node Development

Run scripts/install-dev.sh after cloning this repository.

This script checks availability of all required dependencies such as Docker and bootstrap a development setup. Note that it requires sudo and a modern Python installed in the host system based on Linux (Debian/RHEL-likes) or macOS.

Installation for Multi-node Tests & Production

Please consult our documentation for community-supported materials. Contact the sales team ([email protected]) for professional paid support and deployment options.

Accessing Compute Sessions (aka Kernels)

Backend.AI provides websocket tunneling into individual computation sessions (containers), so that users can use their browsers and client CLI to access in-container applications directly in a secure way.

Jupyter: data scientists' favorite tool
- Most container images have intrinsic Jupyter and JupyterLab support.
Web-based terminal
- All container sessions have intrinsic ttyd support.
SSH
- All container sessions have intrinsic SSH/SFTP/SCP support with auto-generated per-user SSH keypair. PyCharm and other IDEs can use on-demand sessions using SSH remote interpreters.
VSCode
- Most container sessions have intrinsic web-based VSCode support.

Working with Storage

Backend.AI provides an abstraction layer on top of existing network-based storages (e.g., NFS/SMB), called vfolders (virtual folders). Each vfolder works like a cloud storage that can be mounted into any computation sessions and shared between users and user groups with differentiated privileges.

Major Components

Manager

It routes external API requests from front-end services to individual agents. It also monitors and scales the cluster of multiple agents (a few tens to hundreds).

src/ai/backend/manager
- README
- Legacy per-pkg repo: https://github.com/lablup/backend.ai-manager
- Available plugin interfaces
  - backendai_scheduler_v10
  - backendai_hook_v20
  - backendai_webapp_v20
  - backendai_monitor_stats_v10
  - backendai_monitor_error_v10

Agent

It manages individual server instances and launches/destroys Docker containers where REPL daemons (kernels) run. Each agent on a new EC2 instance self-registers itself to the instance registry via heartbeats.

src/ai/backend/agent
- README
- Legacy per-pkg repo: https://github.com/lablup/backend.ai-agent
- Available plugin interfaces
  - backendai_accelerator_v21
  - backendai_monitor_stats_v10
  - backendai_monitor_error_v10

Storage Proxy

It provides a unified abstraction over multiple different network storage devices with vendor-specific enhancements such as real-time performance metrics and filesystem operation acceleration APIs.

src/ai/backend/storage
- README
- Legacy per-pkg repo: https://github.com/lablup/backend.ai-storage-proxy

Webserver

It hosts the SPA (single-page application) packaged from our web UI codebase for end-users and basic administration tasks.

src/ai/backend/web
- README
- Legacy per-pkg repo: https://github.com/lablup/backend.ai-webserver

Synchronizing the static Backend.AI WebUI version:

$ scripts/download-webui-release.sh <target version to download>

Kernels

Computing environment recipes (Dockerfile) to build the container images to execute on top of the Backend.AI platform.

https://github.com/lablup/backend.ai-kernels

Jail

A programmable sandbox implemented using ptrace-based system call filtering written in Rust.

https://github.com/lablup/backend.ai-jail

Hook

A set of libc overrides for resource control and web-based interactive stdin (paired with agents).

https://github.com/lablup/backend.ai-hook

Client SDK Libraries

We offer client SDKs in popular programming languages. These SDKs are freely available with MIT License to ease integration with both commercial and non-commercial software products and services.

Python (provides the command-line interface)
- pip install backend.ai-client
- https://github.com/lablup/backend.ai/tree/main/src/ai/backend/client
Java
- Currently only available via GitHub releases
- https://github.com/lablup/backend.ai-client-java
Javascript
- npm install backend.ai-client
- https://github.com/lablup/backend.ai-client-js
PHP (under preparation)
- composer require lablup/backend.ai-client
- https://github.com/lablup/backend.ai-client-php

Plugins

backendai_accelerator_v21
- ai.backend.accelerator.cuda: CUDA accelerator plugin
- ai.backend.accelerator.cuda (mock): CUDA mockup plugin
  - This emulates the presence of CUDA devices without actual CUDA devices, so that developers can work on CUDA integration without real GPUs.
- ai.backend.accelerator.rocm: ROCm accelerator plugin
- More available in the enterprise edition!
backendai_monitor_stats_v10
- ai.backend.monitor.stats
  - Statistics collector based on the Datadog API
backendai_monitor_error_v10
- ai.backend.monitor.error
  - Exception collector based on the Sentry API

Legacy Components

These components still exist but are no longer actively maintained.

Media

The front-end support libraries to handle multi-media outputs (e.g., SVG plots, animated vector graphics)

The Python package (lablup) is installed inside kernel containers.
To interpret and display media generated by the Python package, you need to load the Javascript part in the front-end.
https://github.com/lablup/backend.ai-media

IDE and Editor Extensions

Visual Studio Code Extension
- Search “Live Code Runner” among VSCode extensions.
- https://github.com/lablup/vscode-live-code-runner
Atom Editor plugin
- Search “Live Code Runner” among Atom plugins.
- https://github.com/lablup/atom-live-code-runner

We now recommend using in-kernel applications such as Jupyter Lab, Visual Studio Code Server, or native SSH connection to kernels via our client SDK or desktop apps.

Python Version Compatibility

Backend.AI Core Version	Python Version	Pantsbuild version
24.03.x / 24.09.x	3.12.x	2.21.x
23.03.x / 23.09.x	3.11.x	2.19.x
22.03.x / 22.09.x	3.10.x
21.03.x / 21.09.x	3.8.x

License

Refer to LICENSE file.

backend.ai's People

Contributors

Stargazers

Watchers

Forkers

charsyam honux77 gitter-badger hephaex 0r0i-00 choi-jinil icute1349 y2b-zz kukgini bart2001 srjung7 westamine johndpope xyloon yzq1979 secure-codenator shoman2 rheehot yonggeon-shin inspire12 inureyes syeong2 terryjx mookiekim pydemia prabindh binbinmeng catataw keunmo seeun-320 rlatjcj siyul-park rimhoho mybirth0407 spongebob03 gy-ulbak96 goolee0123 ddanggle janiicelee lizable surromind-ai zel0rd khjcode seungyeup chunmk gkatz99 soitun ernest-rakhmatulin onenos-com nokchalatte sdeun24 qqq-tech esantomi agatha197 yujung7768903 true-socialqualitynetwork mindgitrwx mailnguyen123 s0ykim hyunsungk yangchoi kaos geonwoovincentkim dayowoo kwon4450 iamupd jangjichang hotkimho di-uni soheeeep qkoo0833 rjwharry suyeon12 signalman linakim93 achimnol mihilt uchaen cdpath yaminyam wonhyeongseo atralupus hekang42 tottale jean1042 papavhub 100sun vivi108 chromato99 upa-r-upa snaag yeslee-v trellixvulnteam cmarchena jopemachine cloudbreadpapa kimjayney leejiwon1125 kyujin-cho lectomt

backend.ai's Issues

Self auto-upgrade when there is no kernels

Let each agent to automatically check and upgrade themselves when it has no running kernels for some specified amount of time (e.g., 1 min). This should be controllabe by configuration and the version range should be restricted into the same minor series.

Optionally, during upgrades, let it pull the latest docker images.

User-friendly demo and initial setup process

Kubernetes Helm Chart (ref: https://github.com/kubernetes/charts)
Vagrant
docker-compose
Development setup with an one-liner shell script
- macOS (#20)
- Linux
  - Overall setup process
  - Setting Linux capabilities to the Python binary (for agent)

┆Issue is synchronized with this Asana task by Unito

Automation of Kubeflow

https://github.com/google/kubeflow

Docs.Backend.AI revision

Including corrections related to #10 , revision of Docs.Backend.AI is required.

┆Issue is synchronized with this Asana task by Unito

Show server-side notice embedded in API responses

For instance, the server may want to say "Please upgrade your client!".

Shell Auto-completion Support

Implement a completion entry point that is aware of our subcommands.
Extend it to support completion of remote session names.
Guide the user on how to install shell completions after installation.

Consider using argcomplete?
Consider migrating to argh?

Internal ticket: OP#707

Reduce the size of kernel docker images

In Sorna REPL repository and its Docker hub counterpart, we maintain several container images for different programming environment.

Currently, most images are over 400 MB and often 1~2 GB in size. This prevents fast iteration of testing and deployments. Let's reduce the image size using various techniques, such as sharing a common base image, building from smaller base images, using prebuilt system and binary packages, and/or removing unncessary packages.

Another approach is to use rocker so that build processes can utilize shared pip cache directories and etc.

Enhance testing and CI

We are using pytest and Travis CI for testing and continuous integration.
However, we still have limited code coverage and the test cases are lagging behind due to eager feature additions and code refactoring.

The goal: raise the code coverage as much as possible, by adding or modifying test cases.

The benefits:

You will get familiar with unit testing on Python. We hope that you would feel natural to mkdir tests from the very beginning of all your future Python projects afterwards.
You will get guidances on how to write "good" tests as well as realistic tests with various techniques including mocks and virtual server/clients.

The challenges:

You need to understand how the code works and what is the intended behavior. The code may have bugs (of course) and you might find surprising mistakes!
Our codebase is changing fast, as we implement new features and refactor them at a fast pace. Modifying existing test cases may require some time-consuming repetitive work. Your editor/regex/scripting/typing skills would shine here.
The internal functions and APIs are not well-documented. Some small functions are trivial but others are not. You need a good communication skill to ask frequently and discuss with us.

Coverage:

You may choose all or subset of Sorna subprojects to work with.
sorna-media project also has Javascript codes; it will be also interesting if you want to play with testing on Javascript.

┆Issue is synchronized with this Asana task by Unito

Extra functionalities for shared vfolders

Renaming of private vfolders and shared vfolders with no invitees and joined collaborators
Leaving from shared vfolders already joined
Kicking out collaborators from self-owned shared vfolders
Forking/cloning shared (read-only) vfolders to my (read-writable) private vfolder

PyCharm Integration

Let's do this.

client: Read config from config file

Read configuration from config file
under .config/backend.ai/config.ini
Follow .ini format
Support multiple configuration using section

When aiotools num_worker set as many as cpu cores, dead with error.

If I specify num_worker as the number of cores as mentioned in the title, an error occurs.
So I specify 20 worker. any other way to solve this?

$ grep -c processor /proc/cpuinfo
40

ERROR aiotools.server [Process-7] Worker 7: Error during initialization
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiotools/server.py", line 56, in _worker_main
    loop.run_until_complete(ctx.__aenter__())
  File "uvloop/loop.pyx", line 1203, in uvloop.loop.Loop.run_until_complete (uvloop/loop.c:25636)
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiotools/context.py", line 78, in __aenter__
    return (await self._agen.__anext__())
  File "/home/ubuntu/backend.ai-manager/ai/backend/gateway/server.py", line 284, in server_main
    await gw_init(app)
  File "/home/ubuntu/backend.ai-manager/ai/backend/gateway/server.py", line 188, in gw_init
    timeout=30, pool_recycle=30,
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/utils.py", line 70, in __await__
    resp = yield from self._coro
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/sa/engine.py", line 70, in _create_engine
    pool_recycle=pool_recycle, **kwargs)
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/utils.py", line 65, in __iter__
    resp = yield from self._coro
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/pool.py", line 47, in _create_pool
    yield from pool._fill_free_pool(False)
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/pool.py", line 209, in _fill_free_pool
    **self._conn_kwargs)
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/utils.py", line 65, in __iter__
    resp = yield from self._coro
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/connection.py", line 74, in _connect
    yield from conn._poll(waiter, timeout)
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/connection.py", line2018-03-30 14:02:57 INFO ai.backend.gateway.server [Process-6] shutting down...
 238, in _poll
    yield from asyncio.wait_for(self._waiter, timeout, loop=self._loop)
  File "/home/ubuntu/.pyenv/versions/3.6.4/lib/python3.6/asyncio/tasks.py", line 358, in wait_for
    return fut.result()
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/connection.py", line 135, in _ready
    state = self._conn.poll()
psycopg2.OperationalError: FATAL:  sorry, too many clients already

Missing kernel creation parameter in API Documentation

One kernel creation parameter is missing in API Documentation. Current document and example points three parameters:

{
  "lang": "python3",
  "resourceLimits": {
    "maxMem": 51240,
    "timeout": 5000
  }
}

However, additional clientSessionToken parameter format is required (mandatory) to call the API.

Kernel get-or-create error when agent is temporarily lost

2018-10-05 05:28:57 WARNING ai.backend.manager.registry [Process-0] agent i-033... heartbeat timeout detected.
2018-10-05 05:28:58 INFO ai.backend.gateway.kernel [Process-1] GET_OR_CREATE (u:AKIA..., lang:python-tensorflow, token:a4d36...)
2018-10-05 05:28:58 ERROR ai.backend.gateway.kernel [Process-1] GET_OR_CREATE: unexpected error!
Traceback (most recent call last):
  File "/home/devops/backend.ai-manager/ai/backend/manager/registry.py", line 330, in get_or_create_session
    kern = await self.get_session(sess_id, access_key)
  File "/home/devops/backend.ai-manager/ai/backend/manager/registry.py", line 274, in get_session
    raise KernelNotFound
ai.backend.gateway.exceptions.KernelNotFound: (404, 'Not Found', 'https://api.backend.ai/probs/kernel-not-found')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/devops/backend.ai-manager/ai/backend/gateway/kernel.py", line 106, in create
    conn=conn)
  File "/home/devops/backend.ai-manager/ai/backend/manager/registry.py", line 340, in get_or_create_session
    conn=conn)
  File "/home/devops/backend.ai-manager/ai/backend/manager/registry.py", line 446, in create_session
    result = await conn.execute(query)
  File "/home/devops/venv/lib/python3.6/site-packages/aiopg/utils.py", line 70, in __await__
    resp = yield from self._coro
  File "/home/devops/venv/lib/python3.6/site-packages/aiopg/sa/connection.py", line 110, in _execute
    yield from cursor.execute(str(compiled), post_processed_params[0])
  File "/home/devops/venv/lib/python3.6/site-packages/aiopg/cursor.py", line 114, in execute
    yield from self._conn._poll(waiter, timeout)
  File "/home/devops/venv/lib/python3.6/site-packages/aiopg/connection.py", line 238, in _poll
    yield from asyncio.wait_for(self._waiter, timeout, loop=self._loop)
  File "/home/devops/.pyenv/versions/3.6.4/lib/python3.6/asyncio/tasks.py", line 358, in wait_for
    return fut.result()
  File "/home/devops/venv/lib/python3.6/site-packages/aiopg/connection.py", line 135, in _ready
    state = self._conn.poll()
psycopg2.IntegrityError: duplicate key value violates unique constraint "ix_kernels_unique_sess_token"
DETAIL:  Key (access_key, sess_id)=(AKIA..., a4d36...) already exists.

This happens when a user tries to reuse a kernel on a temporarily lost agent after previous kernel launches on that agent has succeeded.

Rename to "backend.ai"

One-line installation script for developers

Description

Preparation for developing backend.ai is hard and tedious.

Expected / actual behavior

Simple shell script to install complete developer environment

Backend.AI Client API library for PHP 7

Sorna Client API library for PHP 7

┆Issue is synchronized with this Asana task by Unito

Add physical hardware information for admins

In the admin GraphQL queries for agents, let's include physical hardware information including:

CPU architecture, vendor, family, cores, frequency, feature flags
Memory
NUMA or not, number of NUMA nodes
Accelerators (driver/runtime version, compute capability, processors and memory, etc.)
- Include utilization reports like nvidia-smi
Disk size

In conjunction with lablup/backend.ai-manager#103, let's add the followings to the agent:

Add collect_live_stats() abstract method to AbstractComputeDevice
Add get_physical_info() abstract method to AbstractComputeDevice
Add collect_live_stats_summary() abstract method to AbstractComputePlugin
Add get_physical_info_summary() abstract method to AbstractComputePlugin

All above methods should return an arbitrary JSON-serializable dict.

Optional support for GPU sharing with CUDA MPS

NVIDIA offers a proxy process that coalesces CUDA commands from multiple processes for concurrent kernel execution, called MPS (Multi-Process Service).
However, it requires --ipc=host option for nvidia-docker setups (NVIDIA/nvidia-docker#419) which may compromise security in multi-tenant setups like Backend.AI.

Let's keep track of how this technology is going to evolve and apply to Backend.AI when appropriate.

First we could make it an opt-in feature so that our customers who use private Backend.AI clusters with semi-trusted users (e.g., employees in the same company) can benefit from performance improvements.

Limit sizes of scratch directories

Let's separate the functionality and refactor out the giant agent body.

Let's determine which design would be better:
- ~~A docker volume plugin~~
  - We will migrate to containerd and/or CRI and CSI standards. For now, let's implement on our side.
- An agent-specific abstract interface for scratch directory mounts
- Could each design be used with k8s as well?
  - Let's first focus on the Docker backend.
Enforce a size limit upon container filesystem (the rest of container except /home/work) (ref: docker API docs)
- NOTE: For the overlay2 filesystem (our default setup), this limit only works if the backing filesystem is xfs. (ref: docker run command reference) Let's set this to a Docker default for old device-mapper (10 GiB) for CentOS 7 environments via storage-opt option.
Replace existing scratch dir mount implementation to use the new mechanism
Implement size limits for scratch directories
- When creating kernels
  - Create a loopback-mounted file with limited size
  - mkfs on the file
  - mount the file as a host directory in the scratch root = the scratch directory
  - bind-mount the scratch directory into the container
- When destroying kernels (after deleting the container)
  - umount the scratch directory
  - rm the loopback file
- Wrap above operations in an asyncio executor
~~Add the disk space to the manager's resource slot types and consider it in the scheduler (#49)~~
- This will be handled in generalization of agent selection strategy
Provide configuration options
- [agent] section in agent.toml
  - scratch-type: one of "hostdir", "hostfile", "xxx-plugin" (depending on what we implement)
    - "hostdir": the current implementation with no size enforcement
  - reuse scratch-size option which already exists but is not used yet
  - Decide a "good" default size (maybe 1GiB at least, 10GiB at most)
Perform stress-test
- Run a code that fills out all available space of the scratch directory (/home/work) as a kernel session
- Spawn such kernel sessions to the limit of the agent host and repeat at least 10 times
Update documentation
- Add the description on how to configure the result implementation to README

Related: lablup/backend.ai-agent#60

Prepare open-source release

Add LICENSE files to each sub-project.
Fill setup.py with rich information (license, classifiers, etc.)
Clean up / reorganize milestones.
Register packages to PyPI (Python Package Index).
Update documentation

Change the version scheme

We are going to use the year.month style version numbering like Docker (e.g., 18.03).

Add installation to docs

Description

Add installation docs to docs.backend.ai
Add markdown support for backend.ai page and readthedocs.

┆Issue is synchronized with this Asana task by Unito

Per-user web shell

Google Cloud provides a personalized web-shell so that users don't have to install the gcloud CLI tools by themselves but use them on their web browsers. (official docs)

Let's provide something similar via our cloud.backend.ai and local web UI.

Features:

Pre-configured access/secret keys
Pre-installed latest stable version of backend.ai command-line tools
Auto-mounts of all virtual folders owned by the user and shared with the user
1 GiB of free personal shell storage - need to implement lablup/backend.ai-agent#86
One-click reset option in case of any errors and problems

I think we could build a minimal CLI-dedicated kernel image and use it via a new set of web-shell APIs, based on current streaming APIs.
We also need to update and embed backend.ai-media library into our work-in-progress web console app for seamless integration.

┆Issue is synchronized with this Asana task by Unito

Make debian packages

Provide easier way to install for Linux users, with shell auto-completion support (#92) natively activated.

Multi-container scheduler based on Dask

Make Snap packages for server & client

Manual pip install often causes various issues due to different PyPI index servers, etc.

Let's make a single one-step universally reproducible package using Snap.

┆Issue is synchronized with this Asana task by Unito

Our own PyPI repository to host Backend.AI-optimized Python packages

We are now managing our own builds of TensorFlow and Alpine Linux version of precompiled Python pacakges.

Let's host them on a private PyPI repository so that we could simplify docker-build steps here.
For instance, we could run a separate script to build those packages using containers and automatically upload them to the private PyPI repository. Then downloading/installing those packages inside the Dockerfiles will become much simpler than using multi-staged builds. (The big shortcoming of multi-staged builds is that it is difficult to control caches and intermediate image name/tags...)

Previously it was cumbersome and resource-consuming to host a private PyPI repository because we had to run a dedicated server. But now, there is a tool called s3pypi which allows us to build and host a PyPI repository with only S3 static website hosting + CloudFront.

Suggested repository URLs:

https://pypi.backend.ai/kernels/alpine/3.8/
https://pypi.backend.ai/kernels/ubuntu/16.04/
https://pypi.backend.ai/kernels/ubuntu/18.04/

This could eliminate use of our "build.py" script and complicated build chains.

Also, @tlqaksqhr's https://github.com/lablup/backend.ai-packages could be hosted in this way as well, probably via:

https://pypi.backend.ai/dist/ubuntu/16.04/ or something similar.
In this case the repository is for distributing Backend.AI itself.
Specifically, we could also make a secret repository for enterprise customers.

Automatic instance recycling

There is a private report that after creating/destroying Docker containers more than 20K times (!), the Linux kernel panics. We do not have experimentally confirmed results yet, though.

Let's add an automatic instance recycling (e.g., reboot after 1K executions of kernels) as a safety measure, if the same problem is confirmed.

┆Issue is synchronized with this Asana task by Unito

Customizable formats for "backend.ai ps" CLI command

Add some formatting options for ps - It's too long.

suggestions
--full
--format

etc

Bundling containers for a kernel session

For large-scale computations, sometimes we need to run multiple containers on different hosts for resource aggregation and distributed/parallel processing.

In the past, this was very difficult to implement because Docker's networking was limited to mounting a container to another via an hostname alias (--link), which is essentially one-to-one private links. Now, it's 2017, and Docker offers a nice distributed coordination called "Swarm" which includes overlay networking.

Docker Swarm uses the Raft algorithm to share node information and any new Docker daemon can join to an existing Swarm via host:port and a secret token. Once joined, any containers of the daemons in the swarm can be connected to volatile overlay networks created and destroyed at runtime.

Let's try this and support multi-container distributed computing!

Update for 2020!

Docker Swarm has problems with overlapped IP addresses in different overlay networks and creating/destroying and attaching/detaching networks is proven to be unstable.

After some testing by @kyujin-cho , we decided to fall back to the "classic" Swarm mode, which uses an external etcd to manage multi-host networks, and use namespaced container hostname aliasing to access other containers in the same overlay network.

Basically we keep the same "kernels" table extension as we prototyped in 2017-2018. A single record of the kernels table correspond to a container and multiple records may share the same sess_id indicating that they belong to a overlay cluster.

Phase 1

Automatic reconfiguration of existing Docker daemons in agents so that they use our etcd.
Add a simple integration test script that manually spawns two containers in different hosts to check if the overlay network actually works. Let's place the script into the scripts directory of this meta repository.

Phase 2

Design and implement a template format for a compute session. → #70
- A task template contains the target image, resource occupation, environment variables, default vfolder mounts, bootstrap script, etc.
Design and implement a template format for a cluster composed of session templates.
- The cluster template may declare multiple "roles" of containers and the min/max numbers of containers for each role.
- Each container should have special environment variables so that they can detect the role and index. (e.g., BACKEND_CLUSTER_ROLE, BACKEND_CLUSTER_ROLE_IDX)
(Re-)implement lifecycle mgmt. of multiple containers for a single session
- First, use a plain for-loop to trigger creation/destruction of each kernel.
- Set up custom hostnames via Docker daemon/API so that all containers in the same overlay cluster can access each other.
  - e.g., Spawning 2 "master" and 12 "worker" nodes: "master1", "master2", "worker1", "worker2", ..., "worker12"
  - Don't pad zeros for lexicographical ordering beacuse it causes extra complexity in automated scripts for images and task templates.
- We need to extend the semantics of kernel lifecycle operations:
  - All lifecycle operations against a session must wait until the same operations applied to all kernels of the same session to complete.
  - Restarting and terminating any of the kernels triggers the same operation of the whole overlay cluster.
- The kernel creation options like environment variables, vfolder mounts must be applied to all kernels of a session and override those defined by task templates which also override those defined in the images.
Extend the GUI to have "collapsable" kernel list so that kernels for a single session are grouped and folded together.
- The primary kernel of a session (which is exposed in the main session list) is the first indexed kernel of the first role as in the defined order.
- Each kernel with different roles may define their own service ports and those service ports are transparently supported.

Phase 3

Keep consistency when the manager is interrupted (e.g., restarted) during multi-container spawning and destruction (i.e., partially provisioned).
- We may need to add a column specifying "desired state" so that we could resume and continue the cluster provisioning jobs upon manager restarts.
Optimize the provisioning using asyncio.gather() with proper interruption handling.

┆Issue is synchronized with this Asana task by Unito

Enhance vector drawing library in sorna-media

We have a very early implementation of vector drawing library in sorna-media.
It allows users to write drawing codes in Python but see the result in the browser in codeonweb.com. Currently we use a home-brew Javascript library that parses msgpack'ed data generated by sorna-media Python library and render shapes using fabric.js.

It currently supports drawing of simple lines, rectangles, circles, and triangles with some color and line widths, and animation by translation, fade in/out, etc.

Goal 1: Refactor the current Javascript/Python libraries composed of monolithic long bodies of functions so that we can add new features with ease.
Goal 2: Extend the drawing API for more various types of shapes such as bezier curves and text outputs. Also add support to specifying more detailed shape attributes such as patterns and gradients as well as animation of such various attributes.
Goal 3: Implement/port turtle API over the drawing API.

As long as you keep the compatibility of the user-facing Python API (which of course requires a lot of extension), all implementation details are fully up to you!

Experiment-based configuration and result tracker

Something like https://github.com/IDSIA/sacred
Let's build an extension to Backend.AI kernels upon Sacred or something a derivative to monitor long-running computations.

┆Issue is synchronized with this Asana task by Unito

Add more integration tests

First, read the official integration testing guide. Let's realize it.

Some of below functionalities are already covered in the test cases of the manager, but we need to test also API's input validation and matching of the client-side and server-side implementation.

Extend the manager CLI

To inspect the current registry, we need to access the Redis database directly or issue the API call.
Let's make a simple interactive shell to inspect and control the registry.
It will be extremely useful in service operation.

┆Issue is synchronized with this Asana task by Unito

Fully functional docker-compose configuration

Currently, launching Sorna's server-side gateway with working agents from scratch is very very difficult. Let's add a docker-compose configuration so that new developers can instantly run it.

Rolling update of agents

Check if etcd could be used as a distributed lock coordinater so that only a small fraction of the instances enters the updating state.
Implement the locking mechanism.
Notify the manager so that no new kernel requests come in.
Implement the self-update procedure with automatic restarts.

This issue is linked with lablup/backend.ai-agent#43.

┆Issue is synchronized with this Asana task by Unito

Document image metadata and etcd data structures

Let's document an old issue which is implemented in many ways such as configuration APIs and enhanced CLI supports.

Original issue content

Let's remove the necessity of managing etcd manually. Just pulling images in agents should be sufficient for users.
- When scanning install images in the agent, let's scan metadata labels as well.
- The following metadata will be used:
  - ai.backend.limits.*.min, ai.backend.limits.*.preferred, ai.backend.limits.*.soft
  - min is the minimum required amount.
  - preferred is the default amount.
  - There is no maximum limit: preferred is the default maximum limit and a scaling-group configuration can override it. (0 means no limit except agent's host limits)
Automatically generate aliases (while keeping manual aliases):
- labup/kernel-python-tensorflow:1.12-py36-cuda10
  => (tensorflow, python-tensorflow) x (:1.12-cuda, :1.12-py36-cuda, :1.12-cuda10, :1.12-py36-cuda10)
- If there are multiple platform tag versions and the kernel creation request does not specify a specific version, prefer latest ones.
- The accelerator tags are always required to use with accelerators of the same type.

Accuracy badge for public ML models

Idea by @serialx.

Let's provide "accuracy" badges for public ML models – visiting their repositories shows the accuracy value as a bdage in README (like CI build status badges)
Feature ideas
- per-dataset / per-organization ranking
- per-commit history (i.e., how the accuracy has changed over time?)
- public dashboard to compare various open-source ML repositories

┆Issue is synchronized with this Asana task by Unito

Update API authentication for streaming-first communication

Currently our API uses the body payload to make authentication signatures.
This is good for strong authentication with short messages.

However, as we are migrating to streaming-first environments now,
there are several issues for body-included authentication here:

When the request/response bodies are fully streamed, there are many edge cases that complicates implementing "correct" body-included authentication scheme, such as checks for existence & combinations of Content-Length / Transfer-Encoding headers and multipart payloads.
The current API implementation simply treats the body as an empty string (b'') for bodyless and multipart requests, but this is counter-intuitive.
When an API request has a very long body with finite Content-Length, we should read all the body even when the request is not authenticated successfully. (We cannot shut down the connection in such cases!) This wastes our server resources and may be a potential attack surface for DDoS.
The current Python SDK implementation has potential bugs about aiohttp request context managements, such as that reading streaming response may happen outside of request contexts. This has worked luckily until now, but we should avoid designs disallowed by aiohttp.
Implementing a transparent & high-performance API proxy becomes a lot more complicated because we cannot simply stream up/down requests/responses but need to inspect the headers and the full body depending on the header combinations.

So I am going to change the authentication scheme as follows:

Now calculation of authentication signature does not include any body bytes. The body is simply assumed to be an empty (zero-length) string.
Whatever combinations of content-length & transfer-encoding headers, multipart messages are used, the API gateway performs authentication against the headers only, upfront, and drops the connection if the signature verification fails as sson as possible.
Not having authentication against request bodies would not be a big security drawback as we have always recommended to use HTTPS for production deployments.

Potential impacts of this change:

We need to notice & inform our customers in various places to update their API client.
We need to coordinate update of cloud.backend.ai service and CodeOnWeb.

One-line developer repository update script

Description

Need to simply many dev repository updates into one-liner command

Expected / actual behavior

Simple shell script to update currently working repository to latest state.

┆Issue is synchronized with this Asana task by Unito

How do I setup key in on-premise

I've set up backend.ai (manager, agent, client) to local server.
Then, I've set up BACKEND_ENDPOINT for client.

but I didn't found key for my local manager.
BACKEND_ACCESS_KEY
BACKEND_SECRET_KEY

Where can I find/setting them?

User activity logs

Some customers want to keep detailed command logs executed by their users due to their security audit policies.
Let's add an option to store all query-mode and batch-mode code snippets.

1st phase:

Add an etcd configuration option
Add the activity_logs database table
Add an admin API to retrieve activity logs

2nd phase:

Extend jail to log system call arguments and return values
- This will incur non-negligible performance overheads. We should implement this very efficiently, and buffer & compress the logs on the fly as hundreds of syscalls may be issued per second.

Additional agent status: UNHEALTHY, UPDATING

Some corrupted agents may destroy the whole cluster availability. 😞
This should work along with #107 (automatic instance recycling).

Implement network access restriction in sorna-jail

Sorna's "jail" subproject⁽¹⁾ is a seccomp-based sandbox written in Go. We use it for all kernel containers to prevent malicious user codes from executing potentially dangerous system calls as well as to enforce our customized ACL upon file systems and networks.

⁽¹⁾ This will be moved to a separate repository.

Already implemented:

Limitation of the maximum allowed number of threads/processes
seccomp-based system call filter

Half-implemented:

Path-based file system operation check: reading path string from syscall arguments works but there is no detailed policy implementation. This would be a practice before getting into the network restriction work.

To do for you:

Host-based and IP-based network connection restriction. For example, allow only HTTPS/SSH access to GitHub but forbid network connections to everything else.
- This requires intercepting DNS resolution and connect() system call with some inspection to the socket file descriptor.

┆Issue is synchronized with this Asana task by Unito

Extend plug-in architecture

Re-establish DB connection pool when DB failover happens

Rarely cloud DB instances are automatically gone through a fail-over process.

Change the configuration to use a DNS name instead of a static IP for DB.
If possible, detect the connection resets and re-create the DB connection pool.

Agent backend abstraction

Currently we use only Docker containers as the kernel session host.
However, this may be extended to other type of services, such as cloud vendor-specific remote APIs or local processes for debugging and testing purposes.
For instance, we could implement a "data agent" by specializing a subset of agents in a cluster to manage a memory-cached distributed filesystem such as Alluxio.

Configurable driver selection on startup from agent.toml
#363
#865
Report the list of active drivers when sending heartbeats to the manager

Support RStudio

rocker-org/rocker#295 (comment)
RStudio's execution model requires the root account since it manages non-root users by itself -- spawning a session in non-root user (usually "rstudio").

This conflicts with non-root single-user container scenarios, like us or OpenShift.

We need to support the followings in entrypoint.sh when running RStudio images:

Bypass gosu so that the initial RStudio daemon run as root.
Set additional environment variables (e.g., USERID) to use our LOCAL_USER_ID value.
- See https://github.com/rocker-org/rocker-versioned/blob/master/rstudio/userconf.sh

To customize the agent and kernel-runner behavior, let's add new kernel labels:

ai.backend.run-as-root=1 to enable bypassing gosu. (default: disabled)
ai.backend.envs.userid=USERID to set additional environment variables to have the agent-indicated user IDs, similarly to ai.backend.envs.corecount. (default: empty)

Public shared virtual folder

Let's add ability to make shared virtual folders for the public and arbitrary (access key) groups.
This is required to do tutorials/workshops with large datasets with ease.

lablup / backend.ai Goto Github PK

backend.ai's Introduction

Backend.AI

Contents in This Repository

Directory Structure

Getting Started

Installation for Single-node Development

Installation for Multi-node Tests & Production

Accessing Compute Sessions (aka Kernels)

Working with Storage

Major Components

Manager

Agent

Storage Proxy

Webserver

Kernels

Jail

Hook

Client SDK Libraries

Plugins

Legacy Components

Media

IDE and Editor Extensions

Python Version Compatibility

License

backend.ai's People

Contributors

Stargazers

Watchers

Forkers

backend.ai's Issues

Description

Expected / actual behavior

Description

Update for 2020!

Description

Expected / actual behavior

Recommend Projects

Recommend Topics

Recommend Org