Giter VIP home page Giter VIP logo

lablup / backend.ai Goto Github PK

View Code? Open in Web Editor NEW
468.0 24.0 146.0 454.19 MB

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.

Home Page: https://www.backend.ai

License: GNU Lesser General Public License v3.0

Python 94.43% Shell 1.58% Dockerfile 0.05% Starlark 0.56% Go 0.04% Java 0.04% Mako 0.01% CSS 0.34% Jinja 0.09% Vim Script 0.01% C 0.08% HTML 2.55% JavaScript 0.23%
python docker distributed-computing api documentation cloud-computing backendai containers hpc monitoring

backend.ai's People

Contributors

achimnol avatar adrysn avatar agatha197 avatar chisacam avatar dependabot[bot] avatar fregataa avatar hephaex avatar hoyajigi avatar hydroxyde avatar inureyes avatar jopemachine avatar kangjuseong avatar kimjinmyeong avatar kimjmin avatar kyujin-cho avatar leejiwon1125 avatar lizable avatar minseokey avatar mirageoasis avatar nayeonkeum avatar pderer avatar qkoo0833 avatar rapsealk avatar sangwonyoon avatar sanxiyn avatar soheeeep avatar studioego avatar syeong2 avatar yaminyam avatar zeniuus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

backend.ai's Issues

Add more integration tests

First, read the official integration testing guide. Let's realize it.

Some of below functionalities are already covered in the test cases of the manager, but we need to test also API's input validation and matching of the client-side and server-side implementation.

  • Kernel integration tests (tests/test_kernel.py)
    • Use this a starting point and reference. (It also may require some updates though)
  • Admin integration tests (tests/test_admin.py) -> needs updates!
    • Create, delete, modify domains: use testing-XXXX format (with resource limits)
    • Create, delete, modify groups (with resource limits)
    • Create, delete, modify keypairs and users in different domains and groups (with resource limits)
    • Check if the current user's resource limit reflects the domain/group limits as well if configured.
  • Advanced kernel integration tests (tests/test_kernel.py)
    • As a fixture, you need to create a domain, a group, and a user who belongs to them.
    • Execute with custom environment variables using the above user
    • Execute via websockets (stream APIs) using the above user
    • Execute with the batch mode's build/clean commands, including cases with both satisfying and exceeding the batch-mode API's file number/size limits
    • Activate service ports after creating kernels, and check if they give valid responses (e.g., some HTML codes for web-based container services like Jupyter)
    • Group/domain segregation
      • As a fixture, you need to create two or more domain/group/user sets.
      • Check if running kernels for a specific domain and group is not listed in other user in a different domain and group.
    • Resource limits
      • Execute with different resource limits and check/hit those limits (e.g., check the number of CPU cores, check if an OOM event (forced termination) occurs)
      • Try to create kernels whose creation configs exceed the configured domain, group, user (keypair) resource limits and check if the limits are enforced as expected.
  • VFolder integration tests (tests/test_vfolder.py)
    • Create and delete personal vfolders, check them using the listing API
    • Hit the vfolder limit of the keypair resource policy by creating too many vfolders
    • Upload/download small (10 MiB) and large (10 GiB) random-generated files with hash checks
    • Group vfolders
      • As a fixture, you need to create two or more groups and corresponding domain admin users.
      • Create and delete group vfolders, check them using the listing API
      • Check if users who belong to only a specific group cannot list & access another group's vfolders.
    • Domain segregation
      • As a fixture, you need to create two or more domain/group/user sets.
      • Check if vfolders for a specific domain cannot send/receive invitations to/from users in a different domain and group.

Add physical hardware information for admins

In the admin GraphQL queries for agents, let's include physical hardware information including:

  • CPU architecture, vendor, family, cores, frequency, feature flags
  • Memory
  • NUMA or not, number of NUMA nodes
  • Accelerators (driver/runtime version, compute capability, processors and memory, etc.)
    • Include utilization reports like nvidia-smi
  • Disk size

In conjunction with lablup/backend.ai-manager#103, let's add the followings to the agent:

  • Add collect_live_stats() abstract method to AbstractComputeDevice
  • Add get_physical_info() abstract method to AbstractComputeDevice
  • Add collect_live_stats_summary() abstract method to AbstractComputePlugin
  • Add get_physical_info_summary() abstract method to AbstractComputePlugin

All above methods should return an arbitrary JSON-serializable dict.

Our own PyPI repository to host Backend.AI-optimized Python packages

We are now managing our own builds of TensorFlow and Alpine Linux version of precompiled Python pacakges.

Let's host them on a private PyPI repository so that we could simplify docker-build steps here.
For instance, we could run a separate script to build those packages using containers and automatically upload them to the private PyPI repository. Then downloading/installing those packages inside the Dockerfiles will become much simpler than using multi-staged builds. (The big shortcoming of multi-staged builds is that it is difficult to control caches and intermediate image name/tags...)

Previously it was cumbersome and resource-consuming to host a private PyPI repository because we had to run a dedicated server. But now, there is a tool called s3pypi which allows us to build and host a PyPI repository with only S3 static website hosting + CloudFront.

Suggested repository URLs:

  • https://pypi.backend.ai/kernels/alpine/3.8/
  • https://pypi.backend.ai/kernels/ubuntu/16.04/
  • https://pypi.backend.ai/kernels/ubuntu/18.04/

This could eliminate use of our "build.py" script and complicated build chains.

Also, @tlqaksqhr's https://github.com/lablup/backend.ai-packages could be hosted in this way as well, probably via:

  • https://pypi.backend.ai/dist/ubuntu/16.04/ or something similar.
    In this case the repository is for distributing Backend.AI itself.
    Specifically, we could also make a secret repository for enterprise customers.

Kernel get-or-create error when agent is temporarily lost

2018-10-05 05:28:57 WARNING ai.backend.manager.registry [Process-0] agent i-033... heartbeat timeout detected.
2018-10-05 05:28:58 INFO ai.backend.gateway.kernel [Process-1] GET_OR_CREATE (u:AKIA..., lang:python-tensorflow, token:a4d36...)
2018-10-05 05:28:58 ERROR ai.backend.gateway.kernel [Process-1] GET_OR_CREATE: unexpected error!
Traceback (most recent call last):
  File "/home/devops/backend.ai-manager/ai/backend/manager/registry.py", line 330, in get_or_create_session
    kern = await self.get_session(sess_id, access_key)
  File "/home/devops/backend.ai-manager/ai/backend/manager/registry.py", line 274, in get_session
    raise KernelNotFound
ai.backend.gateway.exceptions.KernelNotFound: (404, 'Not Found', 'https://api.backend.ai/probs/kernel-not-found')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/devops/backend.ai-manager/ai/backend/gateway/kernel.py", line 106, in create
    conn=conn)
  File "/home/devops/backend.ai-manager/ai/backend/manager/registry.py", line 340, in get_or_create_session
    conn=conn)
  File "/home/devops/backend.ai-manager/ai/backend/manager/registry.py", line 446, in create_session
    result = await conn.execute(query)
  File "/home/devops/venv/lib/python3.6/site-packages/aiopg/utils.py", line 70, in __await__
    resp = yield from self._coro
  File "/home/devops/venv/lib/python3.6/site-packages/aiopg/sa/connection.py", line 110, in _execute
    yield from cursor.execute(str(compiled), post_processed_params[0])
  File "/home/devops/venv/lib/python3.6/site-packages/aiopg/cursor.py", line 114, in execute
    yield from self._conn._poll(waiter, timeout)
  File "/home/devops/venv/lib/python3.6/site-packages/aiopg/connection.py", line 238, in _poll
    yield from asyncio.wait_for(self._waiter, timeout, loop=self._loop)
  File "/home/devops/.pyenv/versions/3.6.4/lib/python3.6/asyncio/tasks.py", line 358, in wait_for
    return fut.result()
  File "/home/devops/venv/lib/python3.6/site-packages/aiopg/connection.py", line 135, in _ready
    state = self._conn.poll()
psycopg2.IntegrityError: duplicate key value violates unique constraint "ix_kernels_unique_sess_token"
DETAIL:  Key (access_key, sess_id)=(AKIA..., a4d36...) already exists.

This happens when a user tries to reuse a kernel on a temporarily lost agent after previous kernel launches on that agent has succeeded.

How do I setup key in on-premise

I've set up backend.ai (manager, agent, client) to local server.
Then, I've set up BACKEND_ENDPOINT for client.

but I didn't found key for my local manager.
BACKEND_ACCESS_KEY
BACKEND_SECRET_KEY

Where can I find/setting them?

Automatic instance recycling

There is a private report that after creating/destroying Docker containers more than 20K times (!), the Linux kernel panics. We do not have experimentally confirmed results yet, though.

Let's add an automatic instance recycling (e.g., reboot after 1K executions of kernels) as a safety measure, if the same problem is confirmed.

┆Issue is synchronized with this Asana task by Unito

Per-user web shell

Google Cloud provides a personalized web-shell so that users don't have to install the gcloud CLI tools by themselves but use them on their web browsers. (official docs)

Let's provide something similar via our cloud.backend.ai and local web UI.

Features:

  • Pre-configured access/secret keys
  • Pre-installed latest stable version of backend.ai command-line tools
  • Auto-mounts of all virtual folders owned by the user and shared with the user
  • 1 GiB of free personal shell storage - need to implement lablup/backend.ai-agent#86
  • One-click reset option in case of any errors and problems

I think we could build a minimal CLI-dedicated kernel image and use it via a new set of web-shell APIs, based on current streaming APIs.
We also need to update and embed backend.ai-media library into our work-in-progress web console app for seamless integration.

┆Issue is synchronized with this Asana task by Unito

When aiotools num_worker set as many as cpu cores, dead with error.

If I specify num_worker as the number of cores as mentioned in the title, an error occurs.
So I specify 20 worker. any other way to solve this?

$ grep -c processor /proc/cpuinfo
40
ERROR aiotools.server [Process-7] Worker 7: Error during initialization
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiotools/server.py", line 56, in _worker_main
    loop.run_until_complete(ctx.__aenter__())
  File "uvloop/loop.pyx", line 1203, in uvloop.loop.Loop.run_until_complete (uvloop/loop.c:25636)
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiotools/context.py", line 78, in __aenter__
    return (await self._agen.__anext__())
  File "/home/ubuntu/backend.ai-manager/ai/backend/gateway/server.py", line 284, in server_main
    await gw_init(app)
  File "/home/ubuntu/backend.ai-manager/ai/backend/gateway/server.py", line 188, in gw_init
    timeout=30, pool_recycle=30,
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/utils.py", line 70, in __await__
    resp = yield from self._coro
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/sa/engine.py", line 70, in _create_engine
    pool_recycle=pool_recycle, **kwargs)
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/utils.py", line 65, in __iter__
    resp = yield from self._coro
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/pool.py", line 47, in _create_pool
    yield from pool._fill_free_pool(False)
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/pool.py", line 209, in _fill_free_pool
    **self._conn_kwargs)
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/utils.py", line 65, in __iter__
    resp = yield from self._coro
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/connection.py", line 74, in _connect
    yield from conn._poll(waiter, timeout)
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/connection.py", line2018-03-30 14:02:57 INFO ai.backend.gateway.server [Process-6] shutting down...
 238, in _poll
    yield from asyncio.wait_for(self._waiter, timeout, loop=self._loop)
  File "/home/ubuntu/.pyenv/versions/3.6.4/lib/python3.6/asyncio/tasks.py", line 358, in wait_for
    return fut.result()
  File "/home/ubuntu/.pyenv/versions/venv-manager/lib/python3.6/site-packages/aiopg/connection.py", line 135, in _ready
    state = self._conn.poll()
psycopg2.OperationalError: FATAL:  sorry, too many clients already

Accuracy badge for public ML models

Idea by @serialx.

  • Let's provide "accuracy" badges for public ML models – visiting their repositories shows the accuracy value as a bdage in README (like CI build status badges)
  • Feature ideas
    • per-dataset / per-organization ranking
    • per-commit history (i.e., how the accuracy has changed over time?)
    • public dashboard to compare various open-source ML repositories

┆Issue is synchronized with this Asana task by Unito

Self auto-upgrade when there is no kernels

Let each agent to automatically check and upgrade themselves when it has no running kernels for some specified amount of time (e.g., 1 min). This should be controllabe by configuration and the version range should be restricted into the same minor series.

Optionally, during upgrades, let it pull the latest docker images.

Re-establish DB connection pool when DB failover happens

Rarely cloud DB instances are automatically gone through a fail-over process.

  • Change the configuration to use a DNS name instead of a static IP for DB.
  • If possible, detect the connection resets and re-create the DB connection pool.

Extend the manager CLI

To inspect the current registry, we need to access the Redis database directly or issue the API call.
Let's make a simple interactive shell to inspect and control the registry.
It will be extremely useful in service operation.

  • Manage DB fixtures.
    • List fixtures
    • Popuplate fixtures
    • Reset fixtures
  • List current running instances.
    • Filter instances by having running kernels or not.
    • Pagination
  • List current running kernels.
    • Filter by users and entries
    • Pagination
  • Ping instances and kernels.
  • Manually add instance.
  • Manually destroy instance and kernel.
  • Manually schedule kernel clean up operation.

┆Issue is synchronized with this Asana task by Unito

Extra functionalities for shared vfolders

  • Renaming of private vfolders and shared vfolders with no invitees and joined collaborators
  • Leaving from shared vfolders already joined
  • Kicking out collaborators from self-owned shared vfolders
  • Forking/cloning shared (read-only) vfolders to my (read-writable) private vfolder

Fully functional docker-compose configuration

Currently, launching Sorna's server-side gateway with working agents from scratch is very very difficult. Let's add a docker-compose configuration so that new developers can instantly run it.

Agent backend abstraction

Currently we use only Docker containers as the kernel session host.
However, this may be extended to other type of services, such as cloud vendor-specific remote APIs or local processes for debugging and testing purposes.
For instance, we could implement a "data agent" by specializing a subset of agents in a cluster to manage a memory-cached distributed filesystem such as Alluxio.

  • Configurable driver selection on startup from agent.toml
  • #363
  • #865
  • Report the list of active drivers when sending heartbeats to the manager

User activity logs

Some customers want to keep detailed command logs executed by their users due to their security audit policies.
Let's add an option to store all query-mode and batch-mode code snippets.

1st phase:

  • Add an etcd configuration option
  • Add the activity_logs database table
  • Add an admin API to retrieve activity logs

2nd phase:

  • Extend jail to log system call arguments and return values
    • This will incur non-negligible performance overheads. We should implement this very efficiently, and buffer & compress the logs on the fly as hundreds of syscalls may be issued per second.

Extend plug-in architecture

  • Plugin base interfaces (a plugin may provide multiple interfaces):
    • Web application: provides additional API endpoints
    • Auth provider: provides an additional authentication/authorization scheme
    • Scaling driver: provides an auto-scaling driver
    • CLI command: provides additional manager commands
    • DB guest: declares additional DB tables
    • Kernel tracer: inserts hooks in the kernel lifecycle
    • others to come
  • Auto-discovery of installed plugins (like pytest)
    • Plugins should provide its metadata: name, version, available interfaces, and its version.
  • Plugin inventory DB table for the manager
    • Includes plugin metadata, (discovered) import path, currently installed version, when (de)activated last time.
    • Additional fields:
      • config: json "{}"
      • scope: string "global"
        In the future, we may support scaling-group-specific plugins.
  • Manager commands to activate/deactivate plugins and listing them
    • Current draft "BACKEND_EXTENSIONS" will be dropped if this is implemented.
    • The changes are applied upon next restart of the server.
      • DB table creation/drops also happen on the next restart.
    • Live reload is future work.
  • Public framework APIs for the plugin authors
    • Access to intrinsic DB tables/models
    • Access to the configuration
    • Interface-specific additions
  • Auth provider plugin
    • The base interface defines authenticate().
    • Target implementations: File-based, LDAP/AD-based, SAML-based
  • Web application plugin
    • The base interface is aiohttp.web.Application.
    • The plugin routes are added under "/ext/{plugin-id}/".
  • Scaling driver plugin
  • CLI command plugin
    • The base interface defines signature for adding custom commands.
    • Utilize register_command like intrinsic manager commands
    • Could we seamlessly support auto-completion like lablup/backend.ai-client-py#24 ?
  • DB guest plugin
    • The base interface defines signatures for creation, upgrades, and deletion of prefixed DB tables
    • On activation, the db tables are created.
    • On deactivation, the admin is asked if to drop the tables or keep them at rest.
    • The upgrade is automatically performed via the plugin. The plugin must implement its own way to migrate its table schemas.
    • Plugin databases are not part of the admin GraphQL interface. The plugin should provide its own API endpoint via web application interface if such functionality is required.
  • Kernel tracer plugin
    • The base interface defines hook methods that are called upon kernel/session-related events (e.g., creation and termination)
    • It can also configure hooks to intercept execution requests and inspect/modify the requested codes.

Implement network access restriction in sorna-jail

Sorna's "jail" subproject(1) is a seccomp-based sandbox written in Go. We use it for all kernel containers to prevent malicious user codes from executing potentially dangerous system calls as well as to enforce our customized ACL upon file systems and networks.

(1) This will be moved to a separate repository.

Already implemented:

  • Limitation of the maximum allowed number of threads/processes
  • seccomp-based system call filter

Half-implemented:

  • Path-based file system operation check: reading path string from syscall arguments works but there is no detailed policy implementation. This would be a practice before getting into the network restriction work.

To do for you:

  • Host-based and IP-based network connection restriction. For example, allow only HTTPS/SSH access to GitHub but forbid network connections to everything else.
    • This requires intercepting DNS resolution and connect() system call with some inspection to the socket file descriptor.

┆Issue is synchronized with this Asana task by Unito

Shell Auto-completion Support

  • Implement a completion entry point that is aware of our subcommands.
  • Extend it to support completion of remote session names.
  • Guide the user on how to install shell completions after installation.

Consider using argcomplete?
Consider migrating to argh?

Internal ticket: OP#707

Prepare open-source release

  • Add LICENSE files to each sub-project.
  • Fill setup.py with rich information (license, classifiers, etc.)
  • Clean up / reorganize milestones.
  • Register packages to PyPI (Python Package Index).
  • Update documentation

Rolling update of agents

  • Check if etcd could be used as a distributed lock coordinater so that only a small fraction of the instances enters the updating state.
  • Implement the locking mechanism.
  • Notify the manager so that no new kernel requests come in.
  • Implement the self-update procedure with automatic restarts.

This issue is linked with lablup/backend.ai-agent#43.

┆Issue is synchronized with this Asana task by Unito

Bundling containers for a kernel session

For large-scale computations, sometimes we need to run multiple containers on different hosts for resource aggregation and distributed/parallel processing.

In the past, this was very difficult to implement because Docker's networking was limited to mounting a container to another via an hostname alias (--link), which is essentially one-to-one private links. Now, it's 2017, and Docker offers a nice distributed coordination called "Swarm" which includes overlay networking.

Docker Swarm uses the Raft algorithm to share node information and any new Docker daemon can join to an existing Swarm via host:port and a secret token. Once joined, any containers of the daemons in the swarm can be connected to volatile overlay networks created and destroyed at runtime.

Let's try this and support multi-container distributed computing!

Update for 2020!

Docker Swarm has problems with overlapped IP addresses in different overlay networks and creating/destroying and attaching/detaching networks is proven to be unstable.

After some testing by @kyujin-cho , we decided to fall back to the "classic" Swarm mode, which uses an external etcd to manage multi-host networks, and use namespaced container hostname aliasing to access other containers in the same overlay network.

Basically we keep the same "kernels" table extension as we prototyped in 2017-2018. A single record of the kernels table correspond to a container and multiple records may share the same sess_id indicating that they belong to a overlay cluster.

Phase 1

  • Automatic reconfiguration of existing Docker daemons in agents so that they use our etcd.
  • Add a simple integration test script that manually spawns two containers in different hosts to check if the overlay network actually works. Let's place the script into the scripts directory of this meta repository.

Phase 2

  • Design and implement a template format for a compute session. → #70
    • A task template contains the target image, resource occupation, environment variables, default vfolder mounts, bootstrap script, etc.
  • Design and implement a template format for a cluster composed of session templates.
    • The cluster template may declare multiple "roles" of containers and the min/max numbers of containers for each role.
    • Each container should have special environment variables so that they can detect the role and index. (e.g., BACKEND_CLUSTER_ROLE, BACKEND_CLUSTER_ROLE_IDX)
  • (Re-)implement lifecycle mgmt. of multiple containers for a single session
    • First, use a plain for-loop to trigger creation/destruction of each kernel.
    • Set up custom hostnames via Docker daemon/API so that all containers in the same overlay cluster can access each other.
      • e.g., Spawning 2 "master" and 12 "worker" nodes: "master1", "master2", "worker1", "worker2", ..., "worker12"
      • Don't pad zeros for lexicographical ordering beacuse it causes extra complexity in automated scripts for images and task templates.
    • We need to extend the semantics of kernel lifecycle operations:
      • All lifecycle operations against a session must wait until the same operations applied to all kernels of the same session to complete.
      • Restarting and terminating any of the kernels triggers the same operation of the whole overlay cluster.
    • The kernel creation options like environment variables, vfolder mounts must be applied to all kernels of a session and override those defined by task templates which also override those defined in the images.
  • Extend the GUI to have "collapsable" kernel list so that kernels for a single session are grouped and folded together.
    • The primary kernel of a session (which is exposed in the main session list) is the first indexed kernel of the first role as in the defined order.
    • Each kernel with different roles may define their own service ports and those service ports are transparently supported.

Phase 3

  • Keep consistency when the manager is interrupted (e.g., restarted) during multi-container spawning and destruction (i.e., partially provisioned).
    • We may need to add a column specifying "desired state" so that we could resume and continue the cluster provisioning jobs upon manager restarts.
  • Optimize the provisioning using asyncio.gather() with proper interruption handling.

┆Issue is synchronized with this Asana task by Unito

Limit sizes of scratch directories

Let's separate the functionality and refactor out the giant agent body.

  • Let's determine which design would be better:
    • A docker volume plugin
      • We will migrate to containerd and/or CRI and CSI standards. For now, let's implement on our side.
    • An agent-specific abstract interface for scratch directory mounts
    • Could each design be used with k8s as well?
      • Let's first focus on the Docker backend.
  • Enforce a size limit upon container filesystem (the rest of container except /home/work) (ref: docker API docs)
    • NOTE: For the overlay2 filesystem (our default setup), this limit only works if the backing filesystem is xfs. (ref: docker run command reference) Let's set this to a Docker default for old device-mapper (10 GiB) for CentOS 7 environments via storage-opt option.
  • Replace existing scratch dir mount implementation to use the new mechanism
  • Implement size limits for scratch directories
    • When creating kernels
      • Create a loopback-mounted file with limited size
      • mkfs on the file
      • mount the file as a host directory in the scratch root = the scratch directory
      • bind-mount the scratch directory into the container
    • When destroying kernels (after deleting the container)
      • umount the scratch directory
      • rm the loopback file
    • Wrap above operations in an asyncio executor
  • Add the disk space to the manager's resource slot types and consider it in the scheduler (#49)
    • This will be handled in generalization of agent selection strategy
  • Provide configuration options
    • [agent] section in agent.toml
      • scratch-type: one of "hostdir", "hostfile", "xxx-plugin" (depending on what we implement)
        • "hostdir": the current implementation with no size enforcement
      • reuse scratch-size option which already exists but is not used yet
      • Decide a "good" default size (maybe 1GiB at least, 10GiB at most)
  • Perform stress-test
    • Run a code that fills out all available space of the scratch directory (/home/work) as a kernel session
    • Spawn such kernel sessions to the limit of the agent host and repeat at least 10 times
  • Update documentation
    • Add the description on how to configure the result implementation to README

Related: lablup/backend.ai-agent#60

Document image metadata and etcd data structures

Let's document an old issue which is implemented in many ways such as configuration APIs and enhanced CLI supports.

Original issue content
  • Let's remove the necessity of managing etcd manually. Just pulling images in agents should be sufficient for users.
    • When scanning install images in the agent, let's scan metadata labels as well.
    • The following metadata will be used:
      • ai.backend.limits.*.min, ai.backend.limits.*.preferred, ai.backend.limits.*.soft
      • min is the minimum required amount.
      • preferred is the default amount.
      • There is no maximum limit: preferred is the default maximum limit and a scaling-group configuration can override it. (0 means no limit except agent's host limits)
  • Automatically generate aliases (while keeping manual aliases):
    • labup/kernel-python-tensorflow:1.12-py36-cuda10
      => (tensorflow, python-tensorflow) x (:1.12-cuda, :1.12-py36-cuda, :1.12-cuda10, :1.12-py36-cuda10)
    • If there are multiple platform tag versions and the kernel creation request does not specify a specific version, prefer latest ones.
    • The accelerator tags are always required to use with accelerators of the same type.

Make debian packages

Provide easier way to install for Linux users, with shell auto-completion support (#92) natively activated.

Public shared virtual folder

Let's add ability to make shared virtual folders for the public and arbitrary (access key) groups.
This is required to do tutorials/workshops with large datasets with ease.

Support RStudio

rocker-org/rocker#295 (comment)
RStudio's execution model requires the root account since it manages non-root users by itself -- spawning a session in non-root user (usually "rstudio").

This conflicts with non-root single-user container scenarios, like us or OpenShift.

We need to support the followings in entrypoint.sh when running RStudio images:

To customize the agent and kernel-runner behavior, let's add new kernel labels:

  • ai.backend.run-as-root=1 to enable bypassing gosu. (default: disabled)
  • ai.backend.envs.userid=USERID to set additional environment variables to have the agent-indicated user IDs, similarly to ai.backend.envs.corecount. (default: empty)

Optional support for GPU sharing with CUDA MPS

NVIDIA offers a proxy process that coalesces CUDA commands from multiple processes for concurrent kernel execution, called MPS (Multi-Process Service).
However, it requires --ipc=host option for nvidia-docker setups (NVIDIA/nvidia-docker#419) which may compromise security in multi-tenant setups like Backend.AI.

Let's keep track of how this technology is going to evolve and apply to Backend.AI when appropriate.

First we could make it an opt-in feature so that our customers who use private Backend.AI clusters with semi-trusted users (e.g., employees in the same company) can benefit from performance improvements.

Missing kernel creation parameter in API Documentation

One kernel creation parameter is missing in API Documentation. Current document and example points three parameters:

{
  "lang": "python3",
  "resourceLimits": {
    "maxMem": 51240,
    "timeout": 5000
  }
}

However, additional clientSessionToken parameter format is required (mandatory) to call the API.

Enhance vector drawing library in sorna-media

We have a very early implementation of vector drawing library in sorna-media.
It allows users to write drawing codes in Python but see the result in the browser in codeonweb.com. Currently we use a home-brew Javascript library that parses msgpack'ed data generated by sorna-media Python library and render shapes using fabric.js.

It currently supports drawing of simple lines, rectangles, circles, and triangles with some color and line widths, and animation by translation, fade in/out, etc.

  • Goal 1: Refactor the current Javascript/Python libraries composed of monolithic long bodies of functions so that we can add new features with ease.
  • Goal 2: Extend the drawing API for more various types of shapes such as bezier curves and text outputs. Also add support to specifying more detailed shape attributes such as patterns and gradients as well as animation of such various attributes.
  • Goal 3: Implement/port turtle API over the drawing API.

As long as you keep the compatibility of the user-facing Python API (which of course requires a lot of extension), all implementation details are fully up to you!

Update API authentication for streaming-first communication

Currently our API uses the body payload to make authentication signatures.
This is good for strong authentication with short messages.

However, as we are migrating to streaming-first environments now,
there are several issues for body-included authentication here:

  1. When the request/response bodies are fully streamed, there are many edge cases that complicates implementing "correct" body-included authentication scheme, such as checks for existence & combinations of Content-Length / Transfer-Encoding headers and multipart payloads.
    The current API implementation simply treats the body as an empty string (b'') for bodyless and multipart requests, but this is counter-intuitive.

  2. When an API request has a very long body with finite Content-Length, we should read all the body even when the request is not authenticated successfully. (We cannot shut down the connection in such cases!) This wastes our server resources and may be a potential attack surface for DDoS.

  3. The current Python SDK implementation has potential bugs about aiohttp request context managements, such as that reading streaming response may happen outside of request contexts. This has worked luckily until now, but we should avoid designs disallowed by aiohttp.

  4. Implementing a transparent & high-performance API proxy becomes a lot more complicated because we cannot simply stream up/down requests/responses but need to inspect the headers and the full body depending on the header combinations.

So I am going to change the authentication scheme as follows:

  • Now calculation of authentication signature does not include any body bytes. The body is simply assumed to be an empty (zero-length) string.
  • Whatever combinations of content-length & transfer-encoding headers, multipart messages are used, the API gateway performs authentication against the headers only, upfront, and drops the connection if the signature verification fails as sson as possible.
  • Not having authentication against request bodies would not be a big security drawback as we have always recommended to use HTTPS for production deployments.

Potential impacts of this change:

  • We need to notice & inform our customers in various places to update their API client.
  • We need to coordinate update of cloud.backend.ai service and CodeOnWeb.

Reduce the size of kernel docker images

In Sorna REPL repository and its Docker hub counterpart, we maintain several container images for different programming environment.

Currently, most images are over 400 MB and often 1~2 GB in size. This prevents fast iteration of testing and deployments. Let's reduce the image size using various techniques, such as sharing a common base image, building from smaller base images, using prebuilt system and binary packages, and/or removing unncessary packages.

Another approach is to use rocker so that build processes can utilize shared pip cache directories and etc.

Enhance testing and CI

We are using pytest and Travis CI for testing and continuous integration.
However, we still have limited code coverage and the test cases are lagging behind due to eager feature additions and code refactoring.

The goal: raise the code coverage as much as possible, by adding or modifying test cases.

The benefits:

  • You will get familiar with unit testing on Python. We hope that you would feel natural to mkdir tests from the very beginning of all your future Python projects afterwards.
  • You will get guidances on how to write "good" tests as well as realistic tests with various techniques including mocks and virtual server/clients.

The challenges:

  • You need to understand how the code works and what is the intended behavior. The code may have bugs (of course) and you might find surprising mistakes!
  • Our codebase is changing fast, as we implement new features and refactor them at a fast pace. Modifying existing test cases may require some time-consuming repetitive work. Your editor/regex/scripting/typing skills would shine here.
  • The internal functions and APIs are not well-documented. Some small functions are trivial but others are not. You need a good communication skill to ask frequently and discuss with us.

Coverage:

  • You may choose all or subset of Sorna subprojects to work with.
  • sorna-media project also has Javascript codes; it will be also interesting if you want to play with testing on Javascript.

┆Issue is synchronized with this Asana task by Unito

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.