This repository outlines the formal API design for the TaskCluster.
- AMQP is only for message exchange, not keep alive, state or data, we should only use it for events that have relevance now.
- State of tasks should be eventually consistent on S3, data that doesn't change should be directly uploaded to S3.
- Database should hold state while task is executed, once tasks are resolved
completed
orfailed
they can be removed from the database. - Keep it simple
- Be dynamic initially, while allowing for less dynamic behavior as we evolve the system.
- Task, a unit of work executed by the task cluster.
- Artifact, a result generated by a worker
- Queue, place that hold state of pending and running tasks and ensures they are eventually scheduled (or fails if times out).
- Resolution, a task is resolved once, the 2 resolved states are 'completed' and 'failed', if a task gets canceled it fails.
Whenever, we talk about an id it is a with the exception of run-id
an string
of at most 36 alpha-numeric characters (plus _
and -
).
Name | Description |
---|---|
task-id |
Identifies a unique task |
run-id |
Identifies a run of a task (this is an integer, max 999) |
worker-group |
Identifies a group of workers |
worker-id |
Identifies a specific worker within a group |
provisioner-id |
Identifies a provisioner |
worker-type |
Identifies a worker type for a given provisioner |
Note, that worker-id
and worker-type
are not globally unique, they are
merely identifiers within a given worker-group
and a provisioner-id
.
The only identifiers that is assigned by the queue is the run-id
and
task-id
. The run-id
is assigned by the queue, because we want them to be
numerically increasing sequence per task. The task-id
assigned by the queue
to ensure random uuid as a security measure. Non random task-id
's would allow
new tasks to overwrite older tasks, which is a security issue.
The other identifiers are dynamically allocated when you call into the queue. For example if you want to add a new machine type and provisioner, you just give them a unique name and start submitting tasks for them.
By convention non-uuid identifiers should be either prefixed by irc nickname of the person who invented it, or registered in queue documentation to ensure that they are unique. For details see the Future Security Design section below.
Worker identification, the alert reader will notice that a worker is
identified by two ids worker-group
and worker-id
. In this case a group of
workers could identify a master node that manages a cluster of specialized
hardware. The worker-group could also identify multi-core EC2 instance under
which each worker-id
identifies a process. The worker-group
identifier is
often be useful for routing, where as this worker-id
(in combination with
worker-group
) will identify a process, specialized hardware node or folder
within which the task ran.
The task status structure contains all data stored the queue about a task. The purpose of this structure is track the state of a task until it is resolved.
{
"task_id": // Unique task identifier
"provisioner_id": // Provisioner identifier
"worker_type": // Type of worker to be provisioned by provisioner
"runs": [
{
"run_id": // run-id, an integer starting from 1
"worker_group": // worker group identifier
"worker_id": // worker identifier
}
],
"state": // pending|running|completed|failed
"reason": // String such as none, retries-failed, timeout, canceled
"routing": // Task specific routing keys
"retries": // Number of retries left
"priority": // Double relative priority
"created": // Creation time (ISO 8601)
"deadline": // Deadline for resolution after this either failed or completed
"taken_until": // Time until it reverses from running to pending
}
The actual task definition, results and logs should be stored in S3, the queue will sign urls for the worker so it can upload these files without AWS credentials.
As the system evolves we may want shift from ensuring identifier uniqueness by convention. Specifically, we will probably want provisioners to register with the queue and provide a JSON schema of task payloads they accept, as well as define a set of oauth scopes is required to post tasks for workers provisioned by the provisioner...
Essentially, we'll need to lock down the system so that there are different
scopes for posting and consuming tasks with a given provisioner-id
.
Additionally, registering JSON schemas for each worker-type
would allow us to
reject invalid tasks much sooner.
This document leaves these security considerations as future work, as initially we'll just want something fairly dynamic.