tacc / abaco Goto Github PK

View Code? Open in Web Editor NEW

This project forked from waltermoreira/abaco

29.0 29.0 14.0 8.24 MB

Actor Based Co(mputing)ntainers

License: BSD 3-Clause "New" or "Revised" License

Python 86.82% Shell 1.20% HTML 9.73% CSS 0.09% Dockerfile 0.43% Makefile 0.53% Jinja 1.20%

agaveapi ansible cloud docker functions-as-a-service paas python rest-api serverless tacc

abaco's People

Contributors

Stargazers

Watchers

Forkers

jjlittlejohn mwvaughn tacc-cloud julianpistorius jlooney kwhitley54 ehb54 danbryce kreshel kwhitley33 naufiero shresnis000 ccianos webclinic017

abaco's Issues

Executions should return msg and any query params

For sake of reproducibility, executions should report the message and any query parameters that were sent. The message (but not query params) are reported when the message is posted, but not the query params, and neither are apparently available when one polls an actor's /executions endpoint.

register actors from tarballs

Allow users to register actors from tarballs ('.tar', '.gz', '.tgz').

This issue depends upon having a solution to the image caching issue.

Extend actors with a schema (or schemas) field

For actors designed to receive JSON messages (a good majority of the current crop), it is difficult to know the format of a valid message. The sd2e tenant has adopted a convention of bundling a descriptive JSONschema document with the deployed container which is used to validate incoming messages. However, one must consult either the source code, some sort of central registry, or the container itself to be able to discover the appropriate format for a JSON message. I propose adding a schema field to the actor data model, which is populated at registration/update time and then returned either by default as part of an actor's listing or at dedicated child endpoint. This enhances discoverability and permits use of client tools that can render JSON schema into user interface or language-specific objects. Here's an example schema from one of the sd2e actors:

{
	"$schema": "http://json-schema.org/draft-04/schema#",
	"title": "PipelinesJobStateEvent",
	"description": "Directly send a state-change event to a Pipelines Job",
	"type": "object",
	"properties": {
		"uuid": {
			"description": "a UUID referring to a known Pipeline Job",
			"type": "string"
		},
		"event": {
			"description": "a valid Pipeline Job state-change event",
			"type": "string",
			"enum": ["update", "run", "fail", "finish", "validate", "validated", "reject", "finalize", "retire"]
		},
		"data": {
			"description": "an object containing additional context about the event (optional)",
			"type": "object"
		},
		"token": {
			"description": "an authorization token issued when the job was created",
			"type": "string",
			"minLength": 16,
			"maxLength": 17
		},
		"__options": {
			"type": "object",
			"description": "an object used to pass runtime options to a pipeline (private, optional)"
		}
	},
	"required": ["uuid", "event", "token"],
	"additionalProperties": false
}

image caching

Currently, when the first worker on a given docker host is created for an actor, the actor's image is pulled from the docker hub. If the image has been updated on the hub since the actor was registered, the newer image will be pulled. This might not be desirable.

Instead, we should cache actor images in the abaco system (perhaps, a private docker registry).

Actors' lastUpdateTime does not change after update

Here's an extract of an actor record for reference. This entity has been updated several times, but lastUpdated never changes. This makes it hard to trust that the update has gone through when debugging on a rapid iteration cycle.

{ "_links": {
      "executions": "https://api.sd2e.org/actors/v2/w7LMK0k7JGZZQ/executions",
      "owner": "https://api.sd2e.org/profiles/v2/sd2eadm",
      "self": "https://api.sd2e.org/actors/v2/w7LMK0k7JGZZQ"
    },
    "createTime": "2018-04-19 16:32:00.506656",
    "defaultEnvironment": {},
    "description": "",
    "gid": 845002,
    "id": "w7LMK0k7JGZZQ",
    "image": "sd2e/agave-test:dev",
    "lastUpdateTime": "2018-04-19 16:32:00.506656" }

Add support for resource limiting

Proposal:
Global defaults for memory and cpu limits are provided as configurations in the [worker] stanza. Operators with elevated privileges can override these defaults when registering actors. These values are stored with the actor in the actors_store and are retrieved (and enforced) by the workers.

add search across API

date range, name, free text on messages, owner
free text on logs a plus.

Implement streaming logs

One Abaco tenant rolled their own immediate-term logging system within their Docker images so that events could be captured from inside the execution into an ELK stack instance in real time, rather than waiting for the execution to complete. This has become important for monitoring and debugging complex, long-running, or highly stateful actors, and added sophisticated search and analytics to the logs as well. But, it seems a shame to replicate the Abaco logging functionality. This is not a very concrete proposal, but it would be interesting to explore if this function could be pushed down to the service level.

Abaco performance study

Let's do a performance study of the aggregate throughput for the Abaco system across a few different dimensions. We should write a program to perform the measurement on a fixed instance and then run the program a number of times on different instance sizes (compute cluster sizes) to determine how performance scales.

Here are some initial thoughts; we should move to a google doc to more easily collaborate on the design.

We should consider CPU-bound workloads and I/O bound workloads. We should use standard benchmarks where possible; i.e., LINPACK (or a derivative thereof) for CPU-bound workloads, and possibly reading from and writing to a cloud/HTTP storage API for the I/O bound workloads.
We should measure runs with pre-scheduled workers as well as runs that only leverage the autoscaler. The latter will be less performant, but if it is within a small percentage of the former that will be a compelling result.

Enable authenticated access to Docker registries to support use of private images

So that every single Abaco actor image doesn't have to be publicly accessible, add support for authenticated access to Docker registries.

privileged mode

Allow users to register actors to run in privileged mode.

set autoremove for workers

When the spawner schedules worker containers to run they should be scheduled with the autoremove flag set to true unless the "leave_containers" configuration is set in the workers stanza of the config file.

A previously analysis indicated that an upgrade to dockerpy 2.x might be required due to bugs in 1.x.

Validate instance configs at startup

Currently, the config.py module reads an abaco.conf file provided by the Abaco instance operator at start up time but does not validate the config values themselves. We should be able to add some validation directly in the config.py module and exit immediately if the required configs are missing or invalid.

Support an ephemeral scratch disk

Executions run as non-root users, which is a security bonus as it makes the container immutable . However, there is some demand for temporary writeable space, which is accomplished in the sd2e and iplantc.org tenants by having a writable directory (/mnt/ephemeral) in their tenant-specific base image. This is not ideal, as files written there persist in the container after it exits and can't be easily cleaned up. I propose something like the following:

Provision a hardened directory on the host (permissions, acls, etc)
Extend Abaco so that it mount this directory, extended by unique directory name for each execution, at /mnt/ephemeral directory inside the container.
Add a cron task that empties the contents of this directory on a recurring basis

Example

On the Docker host, provision /scratch/executions
Extend Abaco to implement:
docker run -v /scratch/executions/<executionId>:/mnt/ephemeral:rw tacc/abaco_container
Add a cron job that runs a purge script that looks like:

#!/bin/bash

# delete files after 1 day
find /scratch/executions -type f -mtime 1 -exec rm -rf {} \;

Implement log redaction

Related to the bespoke logging solution described in #51, the implementers also included log redaction of any environment variables set when the actor was deployed. This helps ensures the safety of sensitive information, especially as we await development of more fully-featured secrets management.

default environment

Allow users to register a default environment (key:value pairs) for actors. The values should be overwritten by values passed in query parameters of the execution request.

Actor invocations should clean up after themselves

Currently when invoking actors repeatedly, the completed containers are left on the execution host. This creates a heck of a stockpile of containers and disk. Having them cleaned up after completion would help the system stay up longer, achieve better throughput, and handle requests for larger containers more promptly.

core service build broken

The images in the Docker Registry build fine, but they are not produced from an automated build, so I'm not sure where they're coming from or whether they reflect the current head in this repo.

The image builds fine. When running via Docker Toolbox on Yosemite, the admin and message apis fail due to an issue in the underlying Flash dependencies. Google seems to think it's a Python version compatibility. Here is my setup and log info.

Docker daemon

$ docker info
Containers: 14
Images: 486
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 514
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.0.9-boot2docker
Operating System: Boot2Docker 1.8.1 (TCL 6.3); master : 7f12e95 - Thu Aug 13 03:24:56 UTC 2015
CPUs: 1
Total Memory: 1.956 GiB
Name: default
ID: XXN6:UBNG:JWHG:GQGL:IW3Z:CMRK:PU7C:XROS:DEVN:PPXU:J2O4:KL3A
Debug mode (server): true
File Descriptors: 33
Goroutines: 77
System Time: 2015-11-09T01:36:40.657932324Z
EventsListeners: 0
Init SHA1: 
Init Path: /usr/local/bin/docker
Docker Root Dir: /mnt/sda1/var/lib/docker
Username: deardooley
Registry: https://index.docker.io/v1/
Labels:
 provider=virtualbox

Docker Compose version:

$ docker-compose -v
docker-compose version: 1.4.0

Docker build logs:

$ docker build --rm=true -t jstubbs/abaco_core .
Sending build context to Docker daemon 366.6 kB
Step 0 : FROM alpine:3.2
 ---> f4fddc471ec2
Step 1 : RUN apk add --update musl python3 && rm /var/cache/apk/*
 ---> Running in 95e23743d3c3
fetch http://dl-4.alpinelinux.org/alpine/v3.2/main/x86_64/APKINDEX.tar.gz
(1/6) Installing libbz2 (1.0.6-r3)
(2/6) Installing libffi (3.2.1-r0)
(3/6) Installing ncurses-terminfo-base (5.9-r3)
(4/6) Installing ncurses-widec-libs (5.9-r3)
(5/6) Installing sqlite-libs (3.8.10.2-r0)
(6/6) Installing python3 (3.4.3-r2)
Executing busybox-1.23.2-r0.trigger
OK: 55 MiB in 21 packages
 ---> 0cd0db9ee740
Removing intermediate container 95e23743d3c3
Step 2 : RUN apk add --update bash && rm -f /var/cache/apk/*
 ---> Running in 7e68151bc9e4
fetch http://dl-4.alpinelinux.org/alpine/v3.2/main/x86_64/APKINDEX.tar.gz
(1/3) Installing ncurses-libs (5.9-r3)
(2/3) Installing readline (6.3.008-r0)
(3/3) Installing bash (4.3.33-r0)
Executing busybox-1.23.2-r0.trigger
OK: 56 MiB in 24 packages
 ---> e2a68b5d8e77
Removing intermediate container 7e68151bc9e4
Step 3 : RUN apk add --update git && rm -f /var/cache/apk/*
 ---> Running in c8b2c86bc586
fetch http://dl-4.alpinelinux.org/alpine/v3.2/main/x86_64/APKINDEX.tar.gz
(1/11) Installing run-parts (4.4-r0)
(2/11) Installing openssl (1.0.2d-r0)
(3/11) Installing lua5.2-libs (5.2.4-r0)
(4/11) Installing lua5.2 (5.2.4-r0)
(5/11) Installing lua5.2-posix (33.3.1-r2)
(6/11) Installing ca-certificates (20141019-r2)
(7/11) Installing libssh2 (1.5.0-r0)
(8/11) Installing curl (7.42.1-r0)
(9/11) Installing expat (2.1.0-r1)
(10/11) Installing pcre (8.37-r1)
(11/11) Installing git (2.4.1-r0)
Executing busybox-1.23.2-r0.trigger
Executing ca-certificates-20141019-r2.trigger
OK: 73 MiB in 35 packages
 ---> 588a7086fde7
Removing intermediate container c8b2c86bc586
Step 4 : RUN apk add --update g++ -f /var/cache/apk/*
 ---> Running in a7ef117082c9
fetch http://dl-4.alpinelinux.org/alpine/v3.2/main/x86_64/APKINDEX.tar.gz
(1/16) Installing libgcc (4.9.2-r5)
(2/16) Installing libstdc++ (4.9.2-r5)
(3/16) Installing binutils-libs (2.25-r3)
(4/16) Installing binutils (2.25-r3)
(5/16) Installing libgomp (4.9.2-r5)
(6/16) Installing pkgconf (0.9.11-r0)
(7/16) Installing pkgconfig (0.25-r1)
(8/16) Installing gmp (6.0.0a-r0)
(9/16) Installing mpfr3 (3.1.2-r0)
(10/16) Installing mpc1 (1.0.1-r0)
(11/16) Installing gcc (4.9.2-r5)
(12/16) Installing musl-dbg (1.1.11-r2)
(13/16) Installing libc6-compat (1.1.11-r2)
(14/16) Installing musl-dev (1.1.11-r2)
(15/16) Installing libc-dev (0.7-r0)
(16/16) Installing g++ (4.9.2-r5)
Executing busybox-1.23.2-r0.trigger
OK: 197 MiB in 51 packages
 ---> 83bb9ca81428
Removing intermediate container a7ef117082c9
Step 5 : RUN apk add --update python3-dev -f /var/cache/apk/*
 ---> Running in 8d6070c2c740
fetch http://dl-4.alpinelinux.org/alpine/v3.2/main/x86_64/APKINDEX.tar.gz
(1/3) Installing python3-doc (3.4.3-r2)
(2/3) Installing python3-tests (3.4.3-r2)
(3/3) Installing python3-dev (3.4.3-r2)
Executing busybox-1.23.2-r0.trigger
OK: 255 MiB in 54 packages
 ---> b405e6129a1c
Removing intermediate container 8d6070c2c740
Step 6 : ADD actors/requirements.txt /requirements.txt
 ---> 40a83b96dce9
Removing intermediate container d6f00812c437
Step 7 : RUN pip3 install -r /requirements.txt
 ---> Running in 9598e2892fd9
You are using pip version 6.0.8, however version 7.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Obtaining channelpy from git+git://github.com/TACC/channelpy.git#egg=channelpy (from -r /requirements.txt (line 1))
  Cloning git://github.com/TACC/channelpy.git to /src/channelpy
Collecting flask==0.10.1 (from -r /requirements.txt (line 2))
  Downloading Flask-0.10.1.tar.gz (544kB)
Collecting Flask-RESTful==0.3.3 (from -r /requirements.txt (line 3))
  Downloading Flask_RESTful-0.3.3-py2.py3-none-any.whl
Collecting redis==2.10.3 (from -r /requirements.txt (line 4))
  Downloading redis-2.10.3.tar.gz (86kB)
Collecting pika==0.9.13 (from -r /requirements.txt (line 5))
  Downloading pika-0.9.13.tar.gz (63kB)
Collecting docker-py (from -r /requirements.txt (line 6))
  Downloading docker-py-1.5.0.tar.gz (59kB)
Collecting pycrypto==2.6.1 (from -r /requirements.txt (line 7))
  Downloading pycrypto-2.6.1.tar.gz (446kB)
Collecting PyJWT==0.2.3 (from -r /requirements.txt (line 8))
  Downloading PyJWT-0.2.3-py2.py3-none-any.whl
Collecting rabbitpy (from channelpy->-r /requirements.txt (line 1))
  Downloading rabbitpy-0.26.2-py2.py3-none-any.whl (47kB)
Collecting pytest (from channelpy->-r /requirements.txt (line 1))
  Downloading pytest-2.8.2-py2.py3-none-any.whl (149kB)
Collecting six (from channelpy->-r /requirements.txt (line 1))
  Downloading six-1.10.0-py2.py3-none-any.whl
Collecting PyYAML (from channelpy->-r /requirements.txt (line 1))
  Downloading PyYAML-3.11.tar.gz (248kB)
Collecting Werkzeug>=0.7 (from flask==0.10.1->-r /requirements.txt (line 2))
  Downloading Werkzeug-0.11-py2.py3-none-any.whl (304kB)
Collecting Jinja2>=2.4 (from flask==0.10.1->-r /requirements.txt (line 2))
  Downloading Jinja2-2.8-py2.py3-none-any.whl (263kB)
Collecting itsdangerous>=0.21 (from flask==0.10.1->-r /requirements.txt (line 2))
  Downloading itsdangerous-0.24.tar.gz (46kB)
Collecting aniso8601>=0.82 (from Flask-RESTful==0.3.3->-r /requirements.txt (line 3))
  Downloading aniso8601-1.1.0.tar.gz (49kB)
Collecting pytz (from Flask-RESTful==0.3.3->-r /requirements.txt (line 3))
  Downloading pytz-2015.7-py2.py3-none-any.whl (476kB)
Collecting requests>=2.5.2 (from docker-py->-r /requirements.txt (line 6))
  Downloading requests-2.8.1-py2.py3-none-any.whl (497kB)
Collecting websocket-client>=0.32.0 (from docker-py->-r /requirements.txt (line 6))
  Downloading websocket_client-0.34.0.tar.gz (193kB)
Collecting pamqp<2.0,>=1.6.1 (from rabbitpy->channelpy->-r /requirements.txt (line 1))
  Downloading pamqp-1.6.1-py2.py3-none-any.whl
Collecting py>=1.4.29 (from pytest->channelpy->-r /requirements.txt (line 1))
  Downloading py-1.4.30-py2.py3-none-any.whl (81kB)
Collecting MarkupSafe (from Jinja2>=2.4->flask==0.10.1->-r /requirements.txt (line 2))
  Downloading MarkupSafe-0.23.tar.gz
Collecting python-dateutil (from aniso8601>=0.82->Flask-RESTful==0.3.3->-r /requirements.txt (line 3))
  Downloading python_dateutil-2.4.2-py2.py3-none-any.whl (188kB)
Installing collected packages: python-dateutil, MarkupSafe, py, pamqp, websocket-client, requests, pytz, aniso8601, itsdangerous, Jinja2, Werkzeug, PyYAML, six, pytest, rabbitpy, PyJWT, pycrypto, docker-py, pika, redis, Flask-RESTful, flask, channelpy

  Running setup.py install for MarkupSafe
    building 'markupsafe._speedups' extension
    gcc -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Os -fomit-frame-pointer -fPIC -I/usr/include/python3.4m -c markupsafe/_speedups.c -o build/temp.linux-x86_64-3.4/markupsafe/_speedups.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/markupsafe/_speedups.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/markupsafe/_speedups.cpython-34m.so


  Running setup.py install for websocket-client
    changing mode of build/scripts-3.4/wsdump.py from 644 to 755
    changing mode of /usr/bin/wsdump.py to 755


  Running setup.py install for aniso8601
  Running setup.py install for itsdangerous


  Running setup.py install for PyYAML
    checking if libyaml is compilable
    gcc -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Os -fomit-frame-pointer -fPIC -I/usr/include/python3.4m -c build/temp.linux-x86_64-3.4/check_libyaml.c -o build/temp.linux-x86_64-3.4/check_libyaml.o
    build/temp.linux-x86_64-3.4/check_libyaml.c:2:18: fatal error: yaml.h: No such file or directory
     #include <yaml.h>
                      ^
    compilation terminated.
    libyaml is not found or a compiler error: forcing --without-libyaml
    (if libyaml is installed correctly, you may need to
     specify the option --include-dirs or uncomment and
     modify the parameter include_dirs in setup.cfg)




  Running setup.py install for pycrypto
    checking for gcc... gcc
    checking whether the C compiler works... yes
    checking for C compiler default output file name... a.out
    checking for suffix of executables...
    checking whether we are cross compiling... no
    checking for suffix of object files... o
    checking whether we are using the GNU C compiler... yes
    checking whether gcc accepts -g... yes
    checking for gcc option to accept ISO C89... none needed
    checking for __gmpz_init in -lgmp... no
    checking for __gmpz_init in -lmpir... no
    checking whether mpz_powm is declared... no
    checking whether mpz_powm_sec is declared... no
    checking how to run the C preprocessor... gcc -E
    checking for grep that handles long lines and -e... /bin/grep
    checking for egrep... /bin/grep -E
    checking for ANSI C header files... yes
    checking for sys/types.h... yes
    checking for sys/stat.h... yes
    checking for stdlib.h... yes
    checking for string.h... yes
    checking for memory.h... yes
    checking for strings.h... yes
    checking for inttypes.h... yes
    checking for stdint.h... yes
    checking for unistd.h... yes
    checking for inttypes.h... (cached) yes
    checking limits.h usability... yes
    checking limits.h presence... yes
    checking for limits.h... yes
    checking stddef.h usability... yes
    checking stddef.h presence... yes
    checking for stddef.h... yes
    checking for stdint.h... (cached) yes
    checking for stdlib.h... (cached) yes
    checking for string.h... (cached) yes
    checking wchar.h usability... yes
    checking wchar.h presence... yes
    checking for wchar.h... yes
    checking for inline... inline
    checking for int16_t... yes
    checking for int32_t... yes
    checking for int64_t... yes
    checking for int8_t... yes
    checking for size_t... yes
    checking for uint16_t... yes
    checking for uint32_t... yes
    checking for uint64_t... yes
    checking for uint8_t... yes
    checking for stdlib.h... (cached) yes
    checking for GNU libc compatible malloc... yes
    checking for memmove... yes
    checking for memset... yes
    configure: creating ./config.status
    config.status: creating src/config.h
    building 'Crypto.Hash._MD2' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/MD2.c -o build/temp.linux-x86_64-3.4/src/MD2.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/MD2.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Hash/_MD2.cpython-34m.so
    building 'Crypto.Hash._MD4' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/MD4.c -o build/temp.linux-x86_64-3.4/src/MD4.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/MD4.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Hash/_MD4.cpython-34m.so
    building 'Crypto.Hash._SHA256' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/SHA256.c -o build/temp.linux-x86_64-3.4/src/SHA256.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/SHA256.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Hash/_SHA256.cpython-34m.so
    building 'Crypto.Hash._SHA224' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/SHA224.c -o build/temp.linux-x86_64-3.4/src/SHA224.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/SHA224.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Hash/_SHA224.cpython-34m.so
    building 'Crypto.Hash._SHA384' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/SHA384.c -o build/temp.linux-x86_64-3.4/src/SHA384.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/SHA384.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Hash/_SHA384.cpython-34m.so
    building 'Crypto.Hash._SHA512' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/SHA512.c -o build/temp.linux-x86_64-3.4/src/SHA512.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/SHA512.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Hash/_SHA512.cpython-34m.so
    building 'Crypto.Hash._RIPEMD160' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -DPCT_LITTLE_ENDIAN=1 -Isrc/ -I/usr/include/python3.4m -c src/RIPEMD160.c -o build/temp.linux-x86_64-3.4/src/RIPEMD160.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/RIPEMD160.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Hash/_RIPEMD160.cpython-34m.so
    building 'Crypto.Cipher._AES' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/AES.c -o build/temp.linux-x86_64-3.4/src/AES.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/AES.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Cipher/_AES.cpython-34m.so
    building 'Crypto.Cipher._ARC2' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/ARC2.c -o build/temp.linux-x86_64-3.4/src/ARC2.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/ARC2.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Cipher/_ARC2.cpython-34m.so
    building 'Crypto.Cipher._Blowfish' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/Blowfish.c -o build/temp.linux-x86_64-3.4/src/Blowfish.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/Blowfish.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Cipher/_Blowfish.cpython-34m.so
    building 'Crypto.Cipher._CAST' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/CAST.c -o build/temp.linux-x86_64-3.4/src/CAST.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/CAST.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Cipher/_CAST.cpython-34m.so
    building 'Crypto.Cipher._DES' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -Isrc/libtom/ -I/usr/include/python3.4m -c src/DES.c -o build/temp.linux-x86_64-3.4/src/DES.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/DES.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Cipher/_DES.cpython-34m.so
    building 'Crypto.Cipher._DES3' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -Isrc/libtom/ -I/usr/include/python3.4m -c src/DES3.c -o build/temp.linux-x86_64-3.4/src/DES3.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/DES3.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Cipher/_DES3.cpython-34m.so
    building 'Crypto.Cipher._ARC4' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/ARC4.c -o build/temp.linux-x86_64-3.4/src/ARC4.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/ARC4.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Cipher/_ARC4.cpython-34m.so
    building 'Crypto.Cipher._XOR' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/XOR.c -o build/temp.linux-x86_64-3.4/src/XOR.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/XOR.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Cipher/_XOR.cpython-34m.so
    building 'Crypto.Util.strxor' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/strxor.c -o build/temp.linux-x86_64-3.4/src/strxor.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/strxor.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Util/strxor.cpython-34m.so
    building 'Crypto.Util._counter' extension
    gcc -Wno-unused-result -fwrapv -Wall -Wstrict-prototypes -fomit-frame-pointer -fPIC -std=c99 -O3 -fomit-frame-pointer -Isrc/ -I/usr/include/python3.4m -c src/_counter.c -o build/temp.linux-x86_64-3.4/src/_counter.o
    gcc -shared -Wl,--as-needed build/temp.linux-x86_64-3.4/src/_counter.o -L/usr/lib -lpython3.4m -o build/lib.linux-x86_64-3.4/Crypto/Util/_counter.cpython-34m.so
    warning: GMP or MPIR library not found; Not building Crypto.PublicKey._fastmath.
  Running setup.py install for docker-py
  Running setup.py install for pika
  Running setup.py install for redis

  Running setup.py install for flask
  Running setup.py develop for channelpy
    Creating /usr/lib/python3.4/site-packages/channelpy.egg-link (link to .)
    Adding channelpy 0.2 to easy-install.pth file
    Installed /src/channelpy
Successfully installed Flask-RESTful-0.3.3 Jinja2-2.8 MarkupSafe-0.23 PyJWT-0.2.3 PyYAML-3.11 Werkzeug-0.11 aniso8601-1.1.0 channelpy docker-py-1.5.0 flask-0.10.1 itsdangerous-0.24 pamqp-1.6.1 pika-0.9.13 py-1.4.30 pycrypto-2.6.1 pytest-2.8.2 python-dateutil-2.4.2 pytz-2015.7 rabbitpy-0.26.2 redis-2.10.3 requests-2.8.1 six-1.10.0 websocket-client-0.34.0
 ---> 4445768e3634
Removing intermediate container 9598e2892fd9
Step 8 : ADD abaco.conf /etc/abaco.conf
 ---> 09730e4640f9
Removing intermediate container 3ef3f4d9b638
Step 9 : ADD actors /actors
 ---> 8529a438e53a
Removing intermediate container 243277df6000
Step 10 : ADD entry.sh /entry.sh
 ---> f97dca57f57d
Removing intermediate container 3698d9190965
Step 11 : RUN chmod +x /entry.sh
 ---> Running in 79d7d10f9416
 ---> 7b6d74decc80
Removing intermediate container 79d7d10f9416
Step 12 : EXPOSE 5000
 ---> Running in 23084bc683b2
 ---> 8d81dd16f223
Removing intermediate container 23084bc683b2
Step 13 : CMD ./entry.sh
 ---> Running in e1238ba90d30
 ---> 098eb584d679
Removing intermediate container e1238ba90d30
Successfully built 098eb584d679

Docker Compose instantiation logs:

$ docker-compose -f docker-compose-local.yml up
Creating abaco_nginx_1...
Creating abaco_rabbit_1...
Creating abaco_redis_1...
Creating abaco_spawner_1...
Creating abaco_admin_1...
Creating abaco_mes_1...
Creating abaco_reg_1...
Attaching to abaco_nginx_1, abaco_rabbit_1, abaco_redis_1, abaco_spawner_1, abaco_admin_1, abaco_mes_1, abaco_reg_1
redis_1   | 1:C 09 Nov 01:23:37.621 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
redis_1   |                 _._                                                  
redis_1   |            _.-``__ ''-._                                             
redis_1   |       _.-``    `.  `_.  ''-._           Redis 3.0.5 (00000000/0) 64 bit
redis_1   |   .-`` .-```.  ```\/    _.,_ ''-._                                   
redis_1   |  (    '      ,       .-`  | `,    )     Running in standalone mode
redis_1   |  |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
redis_1   |  |    `-._   `._    /     _.-'    |     PID: 1
redis_1   |   `-._    `-._  `-./  _.-'    _.-'                                   
redis_1   |  |`-._`-._    `-.__.-'    _.-'_.-'|                                  
redis_1   |  |    `-._`-._        _.-'_.-'    |           http://redis.io        
redis_1   |   `-._    `-._`-.__.-'_.-'    _.-'                                   
redis_1   |  |`-._`-._    `-.__.-'    _.-'_.-'|                                  
redis_1   |  |    `-._`-._        _.-'_.-'    |                                  
redis_1   |   `-._    `-._`-.__.-'_.-'    _.-'                                   
redis_1   |       `-._    `-.__.-'    _.-'                                       
redis_1   |           `-._        _.-'                                           
redis_1   |               `-.__.-'                                               
redis_1   | 
redis_1   | 1:M 09 Nov 01:23:37.624 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
redis_1   | 1:M 09 Nov 01:23:37.624 # Server started, Redis version 3.0.5
redis_1   | 1:M 09 Nov 01:23:37.624 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
redis_1   | 1:M 09 Nov 01:23:37.624 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
redis_1   | 1:M 09 Nov 01:23:37.624 * The server is now ready to accept connections on port 6379
rabbit_1  | 
rabbit_1  |               RabbitMQ 3.5.3. Copyright (C) 2007-2014 GoPivotal, Inc.
rabbit_1  |   ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
rabbit_1  |   ##  ##
rabbit_1  |   ##########  Logs: tty
rabbit_1  |   ######  ##        tty
rabbit_1  |   ##########
rabbit_1  |               Starting broker...
rabbit_1  | =INFO REPORT==== 9-Nov-2015::01:23:39 ===
rabbit_1  | Starting RabbitMQ 3.5.3 on Erlang 17.5.3
rabbit_1  | Copyright (C) 2007-2014 GoPivotal, Inc.
rabbit_1  | Licensed under the MPL.  See http://www.rabbitmq.com/
rabbit_1  | 
rabbit_1  | =INFO REPORT==== 9-Nov-2015::01:23:39 ===
rabbit_1  | node           : abaco-rabbit@3d15b54d7b6b
rabbit_1  | home dir       : /var/lib/rabbitmq
rabbit_1  | config file(s) : /etc/rabbitmq/rabbitmq.config
rabbit_1  | cookie hash    : Di4rqcuayMGOy+h/gO2vnQ==
rabbit_1  | log            : tty
rabbit_1  | sasl log       : tty
rabbit_1  | database dir   : /var/lib/rabbitmq/mnesia/abaco-rabbit
admin_1   |  * Restarting with stat
mes_1     |  * Restarting with stat
reg_1     |  * Restarting with stat
mes_1     |  * Debugger is active!
mes_1     | Traceback (most recent call last):
mes_1     |   File "/actors/message_api.py", line 21, in <module>
mes_1     |     app.run(host='0.0.0.0', debug=True)
mes_1     |   File "/usr/lib/python3.4/site-packages/flask/app.py", line 772, in run
mes_1     |     run_simple(host, port, self, **options)
mes_1     |   File "/usr/lib/python3.4/site-packages/werkzeug/serving.py", line 633, in run_simple
mes_1     |     application = DebuggedApplication(application, use_evalex)
mes_1     |   File "/usr/lib/python3.4/site-packages/werkzeug/debug/__init__.py", line 169, in __init__
mes_1     |     if self.pin is None:
mes_1     |   File "/usr/lib/python3.4/site-packages/werkzeug/debug/__init__.py", line 179, in _get_pin
mes_1     |     self._pin, self._pin_cookie = get_pin_and_cookie_name(self.app)
mes_1     |   File "/usr/lib/python3.4/site-packages/werkzeug/debug/__init__.py", line 96, in get_pin_and_cookie_name
mes_1     |     h.update('cookiesalt')
mes_1     | TypeError: Unicode-objects must be encoded before hashing
admin_1   |  * Debugger is active!
admin_1   | Traceback (most recent call last):
admin_1   |   File "/actors/admin_api.py", line 22, in <module>
admin_1   |     app.run(host='0.0.0.0', debug=True)
admin_1   |   File "/usr/lib/python3.4/site-packages/flask/app.py", line 772, in run
admin_1   |     run_simple(host, port, self, **options)
admin_1   |   File "/usr/lib/python3.4/site-packages/werkzeug/serving.py", line 633, in run_simple
admin_1   |     application = DebuggedApplication(application, use_evalex)
admin_1   |   File "/usr/lib/python3.4/site-packages/werkzeug/debug/__init__.py", line 169, in __init__
admin_1   |     if self.pin is None:
admin_1   |   File "/usr/lib/python3.4/site-packages/werkzeug/debug/__init__.py", line 179, in _get_pin
admin_1   |     self._pin, self._pin_cookie = get_pin_and_cookie_name(self.app)
admin_1   |   File "/usr/lib/python3.4/site-packages/werkzeug/debug/__init__.py", line 96, in get_pin_and_cookie_name
admin_1   |     h.update('cookiesalt')
admin_1   | TypeError: Unicode-objects must be encoded before hashing
abaco_mes_1 exited with code 1
Gracefully stopping... (press Ctrl+C again to force)
Stopping abaco_reg_1... done
Stopping abaco_admin_1... done
Stopping abaco_spawner_1... done
Stopping abaco_redis_1... done
Stopping abaco_rabbit_1... done
Stopping abaco_nginx_1... done

Implement callback support

Though Abaco allows us to build event-driven systems, we can't learn information about its state without polling. This puts load on the system and introduces latency to downstream consumers in the form of minimum polling intervals (especially if they implement exponential backoff and the desired event occurs right after the last poll).

It would be advantageous if Abaco could post to callbacks when events such as the following occur.

actor created
actor updated
actor scaled
actor shared
actor deleted
execution start
execution end
execution failed
nonce created
nonce deleted

At minimum, https POST with a non-customizable payload should be supported. Support for one or more authn/authz HTTP headers would be useful. URL parameters should be allowed in callbackURLs and any user-specified ordering should be respected.

Include worker ID in container execution environment variables

To help with debugging, can we please include the worker ID as _abaco_worker_id alongside actor ID and execution ID in the container environment? We would of course extend actors.py in AgavePy to support it.

updating an actor causes the mounts to disappear

register the actor:

$ curl -X POST -sk -H "Authorization: Bearer XXXXXXXXXXXXXXXXXXXXX" -H "Content-Type: application/json" --data '{"image":"jturcino/update-demo:latest", "name":"update-demo"}' 'https://api.sd2e.org/actors/v2'

update the actor:

curl -X PUT -sk -H "Authorization: Bearer XXXXXXXXXXXXXXXXXXXXX" -H "Content-Type: application/json" --data '{"image":"jturcino/update-demo:update"}' 'https://api.sd2e.org/actors/v2/ly8R1RKx4E7Kj'

list the actor - mounts are missing:

curl -sk -H "Authorization: Bearer XXXXXXXXXXXXXXXXXXXXX" 'https://api.sd2e.org/actors/v2/ly8R1RKx4E7Kj'

Execution ordering perserved?

When I query for executions, I get back an unordered list of messages that were processed with an unordered list of ids assigned by abaco. Is the the actual order they executed in? Is there a way to guarantee messages process in the order they are received?

Extend management of actor mailboxes

At present, we can either POST a new message to an actor's mailbox or do a GET to discover how many messages are in that mailbox. Additional management actions are desirable:

DELETE would immediately clear out the mailbox. This would be helpful in case of an unexpected backlog that we don't actually want to process.
Adding ?count=N to GET would retrieve the contents of the N most recent messages. This would be useful for debugging message sending behavior from other agents.

Enable state to be used by scalable actors

The actor state variable is super-useful, but is limited to single-worker actors. Some developers have routed around this limitation by imitating the state variable with the Agave metadata service, namespacing the state by executionId. That's a viable approach, but doesn't let the actor develop any sense of collective state across its workers. What could be very 🆒 is to replace the simple state variable, which is just a Python dict, with a CRDT dictionary type.

backup redis

collect redis snapshots on a schedule and push them to ranch. See backup_redis in:

https://github.com/Arabidopsis-Information-Portal/adama/blob/v0.3/adama/command/tools.py

Add proper optimistic locking to store

The store class should implement the redis optimistic locking feature to provide proper transactional support for modifying collections. The existing transaction method requires the calling code to re-implement the python-redis bindings. See safe_delete() in models.py for an example.

Timestamps needed on all resources

We need timestamps (ISO 8601, preferred) for:

actor created
actor updated
actor scaled
message was received
execution started
execution completed
log created
log last updated

return queued messages in the messages API

Currently, GET is not implemented in the messages API.

Implement a pause function for actors

We cannot easily stop an actor (temporarily) from accepting new work. This can be inconvenient when trying to test parallel sets of linked actors. It would be handy to be able to pause/resume an actor without deleting it.

get_context is missing state and actor_id

The get_context() function in samples/base/actors.py isn't returning values for actor_id and state on the tacc.prod deployment. The actor_id and state keys are empty, forcing me to use the .get accessor (and still getting an empty result for those values)

Here's my actor code

def main():
    context = get_context()
    ag = get_client()
    print 'raw_message', context.get('raw_message')
    print 'content_type', context.get('content_type')
    print 'execution_id', context.get('execution_id')
    print 'username', context.get('username')
    print 'actor_id', context.get('actor_id')
    print 'state', context.get('state')
    print 'client', ag
    print 'ENV'
    print json.dumps(dict(os.environ), indent=2)

and here's the content of os.environ inside the actor when it runs

{
  "_abaco_actor_state": "{}", 
  "_abaco_access_token": "", 
  "_abaco_Content-Type": "application/json", 
  "_abaco_actor_dbid": "TACC-PROD_c9fa1536-98ba-11e7-8b45-0242ac110005-059", 
  "HOSTNAME": "d1b445de0ee5", 
  "_abaco_execution_id": "d19cdf2e-98ba-11e7-9d8d-0242ac110006-053", 
  "_abaco_username": "vaughn", 
  "_abaco_actor_id": "c9fa1536-98ba-11e7-8b45-0242ac110005-059", 
  "PATH": "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", 
  "MSG": "{'key2': 'val2', 'key1': 'val1'}", 
  "HOME": "/root", 
  "_abaco_api_server": "https://api.tacc.utexas.edu", 
  "_abaco_jwt_header_name": "X-Jwt-Assertion-Tacc-Prod"
}

upgrade to dockerpy 2.x

The current Abaco code base leverages docker-py==1.10.6 (it is actually not pinned in the requirements file but it is the last version of the docker python binds with that package name). It would be best to update to the latest 2.x version (and pin the version). I also think some of the other issues might be simplified with this change.

Include container image short hash in execution environment variables

For the purposes of reporting and debugging, I propose to include the short hash of the active container as _abaco_container_hash in the execution's environment variables. We would extend agavepy.actors to be aware of this.

Automate release to Zenodo

Our tagged releases aren't being propagated to Zenodo, possibly because they are not Github releases.

autoscale

Allow users to register actors with an autoscale flag (TRUE/FALSE). With autoscale=true, abaco will run scheduled health checks to ensure that the number of workers is adequate for the given number of messages in the actor's message queue.

Add webhook support

It would be helpful to have webhook support on the following events. Support for writing to beanstalkd, HTTP, and email are preferred initial support.

actor created
actor removed
actor scaled
actor deleted
actor pull start
actor pull stop
execution created (message received)
execution setup
execution start
execution retry
execution completed
execution failed

Actor status message not reset after actor is corrected

Register an actor:

$ curl -H "Authorization: Bearer $tok" -H "Content-Type: application/json" --data '{"image":"abacosamples/test"}' 'https://api.sd2e.org/actors/v2'

update it with an "invalid" image:

$ curl -X PUT -H "Authorization: Bearer $tok" -H "Content-Type: application/json" --data '{"image":"abacosamples/fooy"}' 'https://api.sd2e.org/actors/v2/$aid'

actor now has status ERROR with a status message.

$ curl -H "Authorization: Bearer $tok" 'https://api.sd2e.org/actors/v2/$aid'

Now, correct the actor:

$ curl -X PUT -H "Authorization: Bearer $tok" -H "Content-Type: application/json" --data '{"image":"abacosamples/test"}' 'https://api.sd2e.org/actors/v2/$aid'

actor still has status ERROR with a status message:

$ curl -H "Authorization: Bearer $tok" 'https://api.sd2e.org/actors/v2/$aid'

Implement automated garbage collection for actor images

Currently, images are not removed at any time by Abaco processes, including on actor delete. One challenge is that multiple actors can be registered with the same image. Another issue is that images can be cached on any number of compute nodes, depending on which compute node the actor associated with the image has had workers running. Depending on the implementation we chose, there can also be race conditions if an actor (the last actor) referencing an image is deleted and then quickly re-registered with the same image. (This is a more common user pattern than one might expect).

Nevertheless, the lack of image management is becoming an issue in production; with increased usage, disk space on the compute nodes is filling up.

Add "comment" or "description" field to nonces

Once several nonces for a given actor are created and deployed out in the wild (i.e. as part of webhooks or handed out to trusted delegates) to enable various use cases, it becomes challenging to manage them since there's no obvious way of recalling how each nonce is used. One possible (and simple solution) is to add an optional "comment"-like field that can be populated when the nonce is requested. The contents of said field would then be returned as part of the nonce's record.

Example

{
  "actorId": "6rRKoBDgbzrjk",
  "apiServer": "https://api.tacc.cloud",
  "createTime": "2018-12-07 23:36:21.549079",
  "currentUses": 5,
  "description": "Github integration",
  "id": "TACC_rVv5P1RWWNPkY",
  "lastUseTime": "None",
  "level": "EXECUTE",
  "maxUses": -1,
  "owner": "tacobot",
  "remainingUses": -1,
  "roles": [
    "Internal/TACC_tacobot_jenkins_PRODUCTION",
    "Internal/TACC_tacobot_statusio_jul2018_PRODUCTION",
    "Internal/everyone"
  ]
}

uwsgi

Implement uwsgi, perhaps using this:

https://github.com/TACC/irksome-guacamole

Make nonce-bearing Abaco URIs cleaner via URL rewriting

Here's an example callback used in some ongoing work that leverages Abaco, where actor EEDKw7NAr4E0x can accept this message and use it to set a value for a variable shorthashin database record 1073f4ff-c2b9-5190-bd9a-e6a406d9796a.

https://api.sd2e.org/actors/v2/EEDKw7NAr4E0x/messages?x-nonce=TACC_kOMDBMNGo1r3m&shorthash=3f643e7b2722f16e&uuid=1073f4ff-c2b9-5190-bd9a-e6a406d9796a

Compared to the callback URL generated by several popular platforms, this feels a bit clunky and I think it is because the extended string of URL parameters carries both the nonce and the payload parameters.

It feels more intuitive for the base URL to include the nonce inline:

https://api.sd2e.org/actors/v2/EEDKw7NAr4E0x/messages/x-nonce/TACC_kOMDBMNGo1r3m

This makes it very clear which part of the URL is user payload versus the portion authorizing access to the actor.

https://api.sd2e.org/actors/v2/EEDKw7NAr4E0x/messages/x-nonce/TACC_kOMDBMNGo1r3m?token=3f643e7b2722f16e&uuid=1073f4ff-c2b9-5190-bd9a-e6a406d9796a

documentation

swagger 2.0 definition files (already started in the docs directory)
general documentation on the internals.

Make user_role configurable

Currently, the required user role for "basic" level access is hard-coded in the codes.py file:

USER_ROLE = 'Internal/abaco-user'

Change this to be configuration option.

Include container image short hash in actor record

I propose including the short hash for the deployed container in the /actors response to help validate that the correct image is deployed. In theory the org/image:tag|hash should be good enough but a common use finds container with semantic versioning schemes and so may be rebuilt several times while iterating on functionality.

Current

{   "id": "w7LMK0k7JGZZQ",
    "image": "sd2e/agave-test:dev",
    "lastUpdateTime": "2018-04-19 16:32:00.506656" }

Proposed

{    "id": "w7LMK0k7JGZZQ",
    "image": "sd2e/agave-test:dev",
    "image_id": "8e481d5e3679",
    "lastUpdateTime": "2018-04-19 16:32:00.506656"  }

where image_id is the value from docker images sd2e/agave-test:dev:

docker images sd2e/agave-test:dev
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
sd2e/agave-test     dev                 8e481d5e3679        11 minutes ago      629MB

health checks

Run scheduled health checks in a separate process for all worker containers.

Check should confirm the status of the worker and remove the worker from the database if it has exited/failed.

Related, the health check should ensure the number of worker containers for a given actor is appropriate for the given number of messages in the actor's message queue, and scale appropriately if the actor has been registered with autoscale=true (see autoscale issue).

Allow config through environment variables

Currently, the config module uses the ConfigParser class to parse a .ini file. Let's expand this to first look for an environment variable with name section_option (e.g., workers_auto_remove). If the environment variable is present, that will be used; otherwise, the config object will fall back to doing what it is currently: looking for a file called service.conf and parsing it for the variable.

Providing a path from moving away from config files and towards environment variables will make it easier to deploy onto orchestration systems like swarm and k8s.

Include container image repo name in container runtime variables

This is a proposal to include the container repository name in the execution environment as _abaco_container_repo, extending agavepy.actors to support it as well. This is a helpful value to have for report-out logging from within an execution.

Add actor alias feature

Add support for a new collection, aliases, that map user-provided identifiers (an "alias") to an existing actor_id. Once an alias is created, all endpoints involving a specific actor can be accessed with either the alias or the actor_id. For example, if a user creates an actor with id rljRykvYRawLO and then creates an alias jane to that actor_id, the following routes and all sub-routes would be equivalent:

/actors/rljRykvYRawLO
/actors/jane

The alias must be unique across the tenant. The service will create an internal alias_id from the alias and the tenant, similarly to how an actor's db_id is created. It will follow that the alias_id will be globally unique. We'll use the technique suggested and implemented by @mwvaughn to distinguish actor_id's from aliases by attempting to decode an identifier with the Abaco salt. Aliases will not be allowed to be Abaco hashid's (i.e., a hashid produced with the Abaco salt).

Aliases will have separate permissions for managing the alias definition itself. Access to an actual actor endpoint will still be governed by the actor's permissions: alias permissions are only used to govern who can manage the alias definition.

New endpoints -
/actors/aliases:

GET - list all aliases
POST - create a new alias. Required fields are actor_id and alias

/actors/aliases/{alias}:

GET - retrieve details about an alias.
DELETE - delete an alias

/actors/aliases/{alias}/permissions:

GET - list permissions associated with the alias.
POST - add a permission to an alias. Required fields (user, level) the same as actor permissions.

Expose hostname in actor.statusMessage when there is an error

When an error is encountered creating an actor, statusMessage is populated with informative detail as to what might have caused the issue. Here's an example:

"statusMessage": "Unable to start worker; error: Got exception trying to run container from image: abaco/core:0.11.0. Exception: 500 Server Error: Internal Server Error (\"devmapper: Thin Pool has 30979 free data blocks which is less than minimum required 31128 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior\")"

We don't easily know which Abaco host this has occurred on, which makes investigating just a bit harder than it needs to be. I suggest we consider including hostname or other identifier in statusMessage to assist in the process.

mongo storage driver

redis is fast, but the memory requirement will be a deal breaker for certain use cases, and sharding still has to be implemented in the application layer. we should implement mongo3 as an optional storage backend.

Include actor name in execution environment variables

Let's include the name of the Abaco Actor as _abaco_actor_name in the environment variables and extend agavepy.actors so it knows about it.