blox / blox Goto Github PK

Open source tools for building custom schedulers on Amazon ECS

License: Apache License 2.0

Java 97.29% Groovy 1.30% Gherkin 1.41%

amazon-ecs aws container-management blox

blox's Introduction

Blox: Open Source schedulers for Amazon ECS

Blox provides open source schedulers optimized for running applications on Amazon ECS. Developers now have greater control over how their applications are deployed across clusters of resources, run and scale in production, and can take advantage of powerful placement capabilities of Amazon ECS. Blox is being delivered as a managed service via the Amazon ECS Console, API and CLIs. Blox v1.0 provides daemon scheduling for Amazon ECS. We will continue to add additional schedulers as part of this project. Blox schedulers are built using AWS primitives, and the Blox designs and code are open source. If you are interested in learning more or collaborating on the designs, please read the design. If you are currently using Blox v0.3, please read the FAQ.

Project structure

For an overview of the components of Blox, run:

./gradlew projects

Testing

To run the full unit test suite, run:

./gradlew check

This will run the same tests that we run in the Travis CI build.

Deploying

First, take a look at what Blox will put in your personal stack by running the showStackConfig task:

$ ./gradlew showStackConfig

> Task :showStackConfig
Blox deployment stack configuration:

  Default resource name         (blox.name): blox-<username>-alpha-us-west-2 (default)
  API Gateway stage            (blox.stage): alpha (default)
  Stack prefix                (blox.prefix): <username>-alpha (default)
  AWS Region                  (blox.region): us-west-2 (default)
  AWS Credential Profile     (blox.profile): blox-<username>-alpha-us-west-2 (default)
  Cloudformation stack name (blox.cfnStack): blox-<username>-alpha-us-west-2 (default)
  Deployment S3 bucket name (blox.s3Bucket): blox-<username>-alpha-us-west-2 (default)

To customize these values, modify ~/.gradle/gradle.properties to override the property listed.

AWS CLI configuration for profile blox-<username>-alpha-us-west-2:

The config profile (blox-<username>-alpha-us-west-2) could not be found

If you wish to customize any of these values, you can do so by overriding the property in parentheses using any of the supported ways to override Gradle properties. The easiest way is to override it for your user in ~/.gradle/gradle.properties:

blox.profile=default
blox.region=us-east-1

Next, in order to deploy your personal stack:

install the official AWS CLI

create an IAM user with the following permissions:

{
    "Version":"2012-10-17",
    "Statement":[{
        "Effect":"Allow",
        "Action":[
            "s3:*",
            "lambda:*",
            "apigateway:*",
            "cloudformation:*",
            "iam:*",
            "execute-api:*",
            "events:DescribeRule"
        ],
        "Resource":"*"
    }]
}

These permissions are pretty broad, so we recommend you use a separate, test account.

configure the AWS Credential Profile shown in the showStackOutput task with the AWS credentials for the user you created above:

aws configure --profile blox-<username>-alpha-us-west-2 set region us-west-2
aws configure --profile blox-<username>-alpha-us-west-2

create an S3 bucket where all resources (code, cloudformation templates, etc) to be deployed will be stored:
```
./gradlew createBucket
```
deploy the Blox stack:
```
./gradlew deploy
```

End to end testing

Once you have a stack deployed, you can test it with:

./gradlew testEndToEnd

Contact

License

All projects under Blox are released under Apache 2.0 and contributions are accepted under individual Apache Contributor Agreements.

blox's People

Contributors

Stargazers

Watchers

Forkers

kylog rbramwell aaronwalker abanna ianamunoz kiranmeduri aaithal rprakashg kasimdoctor linearregression backupmanager handong890 punalpatel fnet123 jlambert121 poojamaiya nkilzer hyperpilotio devopsbox jhspaybar sawanoboly samuelkarp kylbarnes mwarkentin hafeez3000 wpromatt narehayrapetyan tuedtran etsangsplk paolol timothyojones willhoule shifty128 qjawe doridoridoriand czshimosaka-kimihiko mbeacom gthuang botdevops vangalamaheshh sozuuuuu wattdave tabern barianet steckmey maniacs-ops emkay carlosrobles sksundaram-learning aelmadho devopsdeveloper ellerbrock cloud-architecture yrsurya wbingli robpeart ldesiqueira goguardian gongmax xjhe dramaticlly jeff18ms gonsuke extremenelson lidorg-dev samhays henghengha pks-os barahate90 bharathkumarraju muhammadnaveedahmed zechariahks devopsutils poeblu syllogy archiveproject iq-scm

blox's Issues

Flaky Test?

I'm still new to Go and the testing framework being used, but sometimes when I run make it will fail with 1 error and this looks like it might be related.

--- PASS: TestRunLoadInstancesReturnsError (0.00s)
panic: Fail in goroutine after TestOverlappingRunInvocationsAreSkipped has completed

goroutine 9 [running]:
panic(0x51cce0, 0xc42016e910)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
testing.(*common).Fail(0xc4200a4540)
	/usr/local/go/src/testing/testing.go:412 +0x11f
testing.(*common).FailNow(0xc4200a4540)
	/usr/local/go/src/testing/testing.go:431 +0x2b
testing.(*common).Fatalf(0xc4200a4540, 0x5fef7b, 0x24, 0xc4201a6390, 0x3, 0x3)
	/usr/local/go/src/testing/testing.go:496 +0x83
github.com/blox/blox/vendor/github.com/golang/mock/gomock.(*Controller).Call(0xc42027f200, 0x54faa0, 0xc4203f2570, 0x5ee0c6, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/williamthurston/go/src/github.com/blox/blox/vendor/github.com/golang/mock/gomock/controller.go:113 +0x452
github.com/blox/blox/cluster-state-service/handler/mocks.(*MockTaskLoader).LoadTasks(0xc4203f2570, 0x26, 0x0)
	/Users/williamthurston/go/src/github.com/blox/blox/cluster-state-service/handler/mocks/taskloader_mocks.go:45 +0x73
github.com/blox/blox/cluster-state-service/handler/reconcile.(*Reconciler).RunOnce(0xc420404d20, 0x0, 0x0)
	/Users/williamthurston/go/src/github.com/blox/blox/cluster-state-service/handler/reconcile/reconciler.go:89 +0xbf
github.com/blox/blox/cluster-state-service/handler/reconcile.(*Reconciler).Run.func1(0xc420404d20)
	/Users/williamthurston/go/src/github.com/blox/blox/cluster-state-service/handler/reconcile/reconciler.go:69 +0x2f
created by github.com/blox/blox/cluster-state-service/handler/reconcile.(*Reconciler).Run
	/Users/williamthurston/go/src/github.com/blox/blox/cluster-state-service/handler/reconcile/reconciler.go:73 +0x1c6
FAIL	github.com/blox/blox/cluster-state-service/handler/reconcile	0.092s

This is more like a question (how does blox handle resource createion/cleanup/reclaim)

Since there are multiple resources embedded (vpc,subnet etc etc), when blox brings down a cluster how does it make sure those are also released but not leaving as dangling. And how does it handle if it cannot resolve those it itself?

update "deploy CFT on AWS" to infer available AZ using intrinsic function

my first runs to get blox up on AWS failed due to mapping on the CFT, and it rolled back.
a more graceful experience can be to use:

"AvailabilityZone" : {
"Fn::Select" : [ "0", { "Fn::GetAZs" : "" } ]
},

The CFT will get the available AZ and CFT picks from them to create subnet,

daemon-scheduler constraint doesn't hold in manual start of the container

I was testing daemon-scheduler locally and observed this behavior:

I created an environment and deployment for the daemon-scheduler. The service involved in ECS currently has 0 container running. There are 3 hosts.
The daemon deployed and 1 copy of the container is running on each host.
I ssh'd onto one of the hosts, killed the container, looking at the log, as expected, a new one got started. Also worked when I added a new host to the cluster.
I manually started a container of the same service on one of the hosts, and both containers continue to run. Here I (kinda) expect the daemon would step in and enforce the constraint by removing one?
I killed off the container (started by the daemon) and leave the one (started by me) running, a new one got started by the daemon again, and 2 copies are still running.

Is my assumption of how this should work wrong?

[Proposal] Support Identity for scheduled services

Description

This feature provides the ability to associate specific resources to a task that survives task restarts, such as a unique token, DNS hostname or data volume.

Motivation

Strong identity is cornerstone in running persistent applications that have a notion of membership. Identity is provided by either supplying the same token at launch/relaunch of a task instance, providing the same DNS name upon task relaunch, or by making sure the state of a task instance (e.g., data volumes) moves along with it.

Use cases

Ability to run clustered applications like Kafka. For example, Kafka requires a broker id when it’s launched. When the broker exits and needs to be relaunched, it should launch with the state such as broker id and the broker data.
Ability to run applications that advertise their location. Peers of this application would need to reach each other to form a cluster. When a peer exits and gets relaunched it would need to get the same DNS name so that other peers are able to reach it.

CI setup

Setup Travis CI for triggering builds upon PR request, checkins

Update Swagger spec with streaming APIs

Swagger JSON should expose cluster-state-service streaming APIs.

Proposal: Visualize Container Packing in Cluster

This is an enhancement request to this toolset.

It would be great blox offered a toolset that would give us a visual representative of how the containers are packed in the cluster.

[Proposal] Support highly available Blox deployments

Description

Currently Blox is a single instance application stack that can be run locally or orchestrated by ECS. In order to provide better resiliency to failures, every component in the Blox framework should be highly available. Blox is made of stateless services along with a stateful datastore. All the components should be configured to restart automatically upon exit in order to improve the reliability. Also the components should be run in a replicated manner to make them redundant, so that when one instance fails, others can take over the responsibility. Etcd, which is the datastore in the stack, should be setup to run in a clustered setup.

Motivation

In order to improve the production readiness of Blox framework, single points of failure need to be eliminated and a redundant system design put in place to offer an acceptable level of uptime for the users.

Use cases

Able to achieve high degree of operational uptime when deploying Blox in production.

Support retention of terminal records independent of ECS retention

Currently ECS cleans up terminal cluster state, like stopped tasks, from its database after a certain period of time. When cluster-state-service reconciles, and learns about the terminal records being cleaned up by ECS, these records are also cleaned up from local store. We should look into providing a configurable retention period of terminal records so that the data is available for auditing, debugging purpose for a longer period of time.

Ensure that sure that a filter is specified only once when calling the CSS List APIs using curl

If a filter is specified multiple times in the curl URL (curl "http://localhost:3000/v1/instances?cluster=cluster1&cluster=cluster2"), we use the first value and succeed. We should instead throw an error.

Add metrics to daemon-scheduler

daemon-scheduler should gather metrics to help assess the operational characteristics and expose it via an API endpoint. For example, metrics such as number of deployments created, number of environments, scheduling latency.

[Proposal] Support for Mesos Scheduling Frameworks

Description

This feature supports Mesos scheduling frameworks, such as Marathon and Chronos, with Amazon ECS.

Motivation

The Mesos ecosystem has scheduling frameworks that cover a wide range of use cases. This feature lets customers choose from existing schedulers that meet their needs. It also enables customers that already use Mesos scheduling frameworks to continue to use those frameworks without the need to operate a resource manager.

Use cases

Allows customers to use Mesos scheduling frameworks with Amazon ECS.
Supports the same scheduling framework in multiple environments.

CloudWatch logs for cluster-state-service

Add support for CloudWatch logs for cluster-state-service container.

Add swagger and API support to filter tasks by started_by

Backend implementation is complete, just have to add fronted and swagger support.

iuhrlnkrj

API to get environment for a cluster ARN

Provide an API to get environment for a particular cluster name or ARN.

Kinesis in CFN template

Add Kinesis stream option to the cloud formation template and ensure it is used and tested in cucumber tests.

Metadata API for cluster-state-service

Add a metadata API to cluster-state-service which prints version info and any other metadata that we want to expose.

Use etcd transactions when adding environment and deployment

Etcd v3 supports transactional model for CRUD operations. All the environment and deployment operations need to be implemented using the transactional APIs provided by etcd.

Finish unit tests for deployment worker

Support multiple data stores

Are there any plans for the cluster-state-service to support other data stores such as Consul?

On a side-note, is there a better place to ask questions like these?

API to get environment based on started_by

Provide a filter API to get environment based on started_by field.

Automate cluster-state-service e2e test setup

[Proposal] Support for Kubernetes

Description

This feature supports Kubernetes framework.

It is currently possible to run Kubernetes on AWS but the whole setup is complex and having it as a managed service integrated with AWS would be great.

EDIT: As it is mentioned in comments supporting Kubernetes APIs is a better framing of this request.

Motivation

Kubernetes is a very popular framework for running containers. It has a big community and tooling around it.

Use cases

Allow the usage of Kubernetes container scheduler and any other of it's features that can fit into ECS.

List Resources API

Provide an API that will return the available and remaining resources in a container instance.

Pagination for list and filter APIs

List and Filter APIs in cluster-state-service should support pagination when returning responses.

Support versioned task and instance streaming

CloudWatch logs for daemon-scheduler

Add support for CloudWatch logs for cluster-state-service container.

Handle client-side termination (Ctrl+C) in the instance and task streaming APIs

Support environment updates

The daemon-scheduler api and demo cli don't seem to support updating environments. I've been deleting, stopping tasks manually, and recreating when I modify a task definition. Updating an environment to a new revision and having changes propagate to existing deployments would be useful.

[Bug] Fix resource information in CSS instance API responses

Resource field 'value' in responses to instance API calls using curl (Ex. curl -v http://localhost:3000/v1/instances) is always null.

"registeredResources": [
{
"name": "CPU",
"type": "INTEGER",
"value": null
},
{
"name": "MEMORY",
"type": "INTEGER",
"value": null
},
{
"name": "PORTS",
"type": "STRINGSET",
"value": null
},
{
"name": "PORTS_UDP",
"type": "STRINGSET",
"value": null
}
],
"remainingResources": [
{
"name": "CPU",
"type": "INTEGER",
"value": null
},
{
"name": "MEMORY",
"type": "INTEGER",
"value": null
},
{
"name": "PORTS",
"type": "STRINGSET",
"value": null
},
{
"name": "PORTS_UDP",
"type": "STRINGSET",
"value": null
}
]

Tune etcd configuration

Investigate and tune etcd configuration for optimal operation for Blox workloads.

Make pending deployments asynchronous

Refactor environment and deployment types

Delete environment

Currently the delete environment API deletes the environment regardless of whether the environment has running tasks. Add a check to prevent the deletion of an "active" environment, ie one that has running tasks, and provide a force delete environment option that stops the tasks before deleting the environment.

This will also help with test cleanup as deleting the environment after the test runs will clean up its artifacts (tasks started by the environment) as well.

Refactor daemon scheduler to use CSS's streaming API

Support for multiple accounts

Today, CSS consumes data for clusters within a single account. We should add support for multiple accounts.

Also, since we deal with a single account, APIs referring to entities (like cluster, etc.) by just names and not ARNs work fine. We'll have to figure out a way for data disambiguation when we support multiple accounts.

Generating `license.go` is non-deterministic

It looks like generating license.go is non-deterministic; I get a different order when running on my Ubuntu laptop than what is checked in here. license.sh does shell globbing which can differ between machines.

Refactor CSS swagger artifact generation to be consistent with the daemon scheduler

Also, with any changes in the swagger.json file, the swagger artifacts have to be manually regenerated for the gucumber (e2e) tests. Change this so the tests use the artifacts generated using make generate.

[Proposal] Support Time based Task deployments

Description

This feature provides the ability to launch task definitions at a specified time or frequency using a cron-like syntax.

Motivation

Many system maintenance and batch jobs need to be automatically run at a specified time or frequency.

Use cases

Ability to aggregate data in a database for reporting, auditing, etc.
Ability to backup data at a particular time.
Ability to launch resource intensive task during lower utilization periods.

Implement the existing AWS API for all list and describe calls

Rather than a separate client and separate API I'd prefer to just use the existing AWS SDK and command line tools to query my cluster state from this system, just without the risk of throttling that exists when speaking to the hosted API.

It'd also be cool if it implemented the various write APIs as well and forwarded them, but that's probably less valuable for me.

cluster-state-service task is failing

Hi ,
I am using AWS installation mentioned in https://github.com/blox/blox/tree/dev/deploy#local-installation.
But after successfully running the service, i am still seeing below error for bloxoss/cluster-state-service:0.1.0 task is failing with below error :
2016-12-12T02:06:29Z [INFO] Reconciler loading tasks and instances
2016-12-12T02:07:29Z [CRITICAL] Error starting event stream handler: Error bootstrapping: Failed to reconcile. Could not load tasks.: Error loading tasks from data store: Error loading tasks from store: Context deadline is exceeded: context deadline exceeded

Am i missing something here?

All other task are running fine.
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
00999f64e828 bloxoss/daemon-scheduler:0.1.0 "/daemon-scheduler --" 5 minutes ago Up 4 minutes 0.0.0.0:32770->2000/tcp ecs-BloxFramework-1-scheduler-8cb28feefab9e48c5900
ede79b28d18c quay.io/coreos/etcd:v3.0.13 "/usr/local/bin/etcd " 5 minutes ago Up 5 minutes 2379-2380/tcp ecs-BloxFramework-1-etcd-faba859abcfba88e4e00
0c5c1af7f37d amazon/amazon-ecs-agent:latest "/agent" 2 days ago Up 2 days ecs-agent

Input validation

Validate input parameters like cluster name, task definition and deployment tokens to daemon-scheduler APIs.

Support unlimited cluster-state-service task filter combinations

Currently we only allow the cluster and status filter combination to be used on the cluster-state-service /v1/tasks API method. You can't filter by cluster and startedBy or all three filters. We should re-factor the css filter tasks method to support any possible filter combination. This will future proof us for adding new filters down the road.

[Proposal] Create Web User Interface

Description

This feature is to provide a web user interface for consuming, visualizing, and modifying the Blox state.

Motivation

Having a web UI would be a nice addition for visualizing cluster state and managing deployed environments. Please submit comments for any features you would like to see in the web UI.

Use cases

View cluster-state-service instances and tasks.
View daemon-scheduler environments and deployments.

Support version parameter in Streaming API

Streaming API in cluster-state-service allows any consumer to listen for state changes in a streaming fashion. This API should support a fromVersion parameter so that any consumer can catch up to the events from the point where they dropped off.

Filter based on remaining resources

Provide an API that can be used to search for container instances with available resources. For example, query to identify instances with 128 M and 2 CPUs available. This can be used by schedulers to quickly identify possible container instances to place a task on.

Add metrics to cluster-state-service

cluster-state-service should gather metrics to help assess the operational characteristics and expose it via an API endpoint. For example, metrics such as number of ECS events processed, number of reconciliations, last reconciliation time, next reconciliation time.

Metadata API

Add a metadata API to daemon-scheduler which prints version info and any other metadata that we want to expose.

blox / blox Goto Github PK

blox's Introduction

Blox: Open Source schedulers for Amazon ECS

Project structure

Testing

Deploying

End to end testing

Contact

License

blox's People

Contributors

Stargazers

Watchers

Forkers

blox's Issues

Description

Motivation

Use cases

Description

Motivation

Use cases

Description

Motivation

Use cases

Description

Motivation

Use cases

Description

Motivation

Use cases

Description

Motivation

Use cases

Recommend Projects

Recommend Topics

Recommend Org