tritondatacenter / containerpilot Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 138.0 1.23 MB

A service for autodiscovery and configuration of applications running in containers

License: Mozilla Public License 2.0

Makefile 2.29% Go 88.26% Shell 7.41% Python 1.24% Dockerfile 0.80%

consul containerpilot containers docker joyent orchestration service-discovery triton

containerpilot's People

Contributors

Stargazers

Watchers

Forkers

tgross bbox-kula kula frascuchon fitzage koulio justenwalker j0hnsmith jim3ma dobbs zofuthan mastrobardo wdamron techedge01 tharanga-abeyseela zaksoup containersolutions fanlifei duanshuaimin ddunkin namtzigla cloudxtreme 40a wleese lusis dsapala pinterb patricktoca ric03uec tribemedia misterbisson giorgio-v dweomer paultyng bjoecool devopsbox freignat91 davefinster liaofan cpg1111 bonifaido csainty leowmjw bhyvex victorvess marctrem akamalov greenbaum toming90 salewski ddneves pipozzz deitch zeroae cirocosta christianalexander jwreagor riot-jayb commarla mikebaumgart jmheidly gbmeuk dstrbad josuesdiaz xiaopal muthhus lotreal inetfuture gotcramer alrs lmahesh26 geek ozlevka-work benoittoulme sarbash boreys miguelramosfdz mmrzyglod bossjones alexxnica kryndex mikemccracken borgified tniswong scottgshannon builder12 maheshsundaramurthy teutat3s jbudziak shrutishrestha dm7 barkinet freedomcodes tibibok happypathway-public chyuch-github juan-moreno rdxsl muniftanjim zhangjingjing2020

containerpilot's Issues

Consul: Figure out why Consul is registering a Server node rather than the Client node when using Agent API

Update: I thought this was due to using the Catalog API rather than the Agent API but we are already using the Agent API. Needs a bit more investigation - perhaps my expectations are wrong.

The service definition that I get is:

{
    "Node": "consul-server1",
    "Address": "10.111.0.1",
    "ServiceID": "my-service-8d462d139c39",
    "ServiceName": "my-service",
    "ServiceTags": [],
    "ServiceAddress": "10.222.0.100",
    "ServicePort": 3000,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 2403,
    "ModifyIndex": 2409
},

What I expect to get is:

{
    "Node": "consul-agent1",
    "Address": "10.222.1.100",
    "ServiceID": "my-service-8d462d139c39",
    "ServiceName": "my-service",
    "ServiceTags": [],
    "ServiceAddress": "10.222.0.100",
    "ServicePort": 3000,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 2403,
    "ModifyIndex": 2409
},

CONSUL_HTTP_TOKEN doesn't seem to be supported?

I can't seem to get containerbuddy to use CONSUL_HTTP_TOKEN. Logging from the agent:

`Feb 11 14:07:12 host consul[18926]: agent: Synced service 'mesos-consul:xxxxxxxx:tst-logredis:31398'

Feb 11 14:08:41 host consul: 2016/02/11 14:08:41 [ERR] http: Request PUT /v1/agent/check/pass/containername-81eb092d0568?note=ok, error: CheckID does not have associated TTL from=172.17.0.3:55587

Feb 11 14:08:41 host consul[18926]: http: Request PUT /v1/agent/check/pass/containername-81eb092d0568?note=ok, error: CheckID does not have associated TTL from=172.17.0.3:55587

Feb 11 14:08:41 host consul: 2016/02/11 14:08:41 [WARN] agent: Service 'containername-81eb092d0568' registration blocked by ACLs

Feb 11 14:08:41 host consul[18926]: agent: Service 'containername-81eb092d0568' registration blocked by ACLs

Feb 11 14:08:41 host consul: 2016/02/11 14:08:41 [WARN] agent: Check 'containername-81eb092d0568' registration blocked by ACLs

Feb 11 14:08:41 host consul[18926]: agent: Check 'containername-81eb092d0568' registration blocked by ACLs

Feb 11 14:08:51 host consul: 2016/02/11 14:08:51 [WARN] agent: Check 'containername-81eb092d0568' registration blocked by ACLs

Feb 11 14:08:51 host consul[18926]: agent: Check 'containername-81eb092d0568' registration blocked by ACLs

Feb 11 14:08:53 host consul: 2016/02/11 14:08:53 [WARN] agent: Check 'containername-649ba536bd72' missed TTL, is now critical

Feb 11 14:08:53 host consul[18926]: agent: Check 'containername-649ba536bd72' missed TTL, is now critical

Feb 11 14:08:53 host consul[18926]: agent: Check 'containername-649ba536bd72' registration blocked by ACLs

Feb 11 14:08:53 host consul: 2016/02/11 14:08:53 [WARN] agent: Check 'containername-649ba536bd72' registration blocked by ACLs`

Am I correct in believing this is simply not supported at the moment? I've tried the 0.1.1 release and building off of master

Support tags for service registration

Allow the configuration of a service registration to include tags that the discovery backend can expose for consumption by backend handlers. (ex. "prod", "dev")

UX Improvements for required fields

Some fields that are required need to have better error messages.

Some bad UX I've seen.

if Poll is not given, it will default to 0 which causes a panic
if health is not given, it will panic with NPE.

with docker bridged mode I can't set the consul service instance address properly

When running in bridge mode none of the interfaces available to containerbuddy contains the actual IP that needs to be advertised to consul.

I'd actually expect that leaving out the interfaces configuration would magically use the consul agent IP. But whatever the case it seems useful to be able to explicitly set the IP through a variable for docker bridged mode users.

Consider supporting command args with JSON Array

All commands in the app.json are just strings. Currently, Containerbuddy splits these strings on space.
This could be fragile if the command requires an argument to have a space.

This leaves three options:

Spaces in arguments are not supported - rewrite your app please
Make the arg splitter more intelligent (i.e. support quoting/escaping arguments)
Allow the user to enter a JSON array of arguments

Proposal

Supporting a JSON array of arguments makes practical sense:

Option 1 makes the application less usable
Option 2 can be complicated to implement, and may introduce command parsing bugs
JSON Arrays are easy enough to add, and it uses the JSON syntax for argument splitting, which is already well-defined and implemented.

Commands & arguments

All executable fields, such as onStart and onChange, accept both a string or an array. If a string is given, the command and its arguments are separated by spaces; otherwise, the first element of the array is the command path, and the rest are its arguments.

String Command

"health": "/usr/bin/curl --fail -s http://localhost/app"

Array Command

"health": [
  "/usr/bin/curl",
  "--fail",
  "-s",
  "http://localhost/app"
]

1.0?

An open item for discussion: what do we need to do yet to get Containerbuddy into what we'll call a 1.0 release?

Edited list, 2016 Jan 15:

Will be included:

Integration test suite: #67 #42
Fix freeze on reload: #73
Reap children: #62

Will not be included:

full IPv6 support
KPIs: #27

starting the example locally does not work

so there are multiple issues with the local deployment…

cd ./examples/nginx
./start -p example -f docker-compose-local.yml

this should be

cd ./examples
./start.sh -p example -f docker-compose-local.yml

there is no executable called start in ./examples/nginx, it is in ./examples and its called start.sh

the build and start of nginx is broken.
when it tries to start nginx i get this error

Successfully built a09f1b400979
Creating example_nginx_1...
Cannot start container c0bd9d5f1e31674146c0800c884a273580fa35b6f93356c66aaeea2e0fe449d9: [8] System error: exec: "opt/containerbuddy/containerbuddy": stat opt/containerbuddy/containerbuddy: no such file or directory

this is obviously because there is no file like /opt/containerbuddy/containerbuddy in the repo nor in the built image.

Support configuration reload

Accept a signal to reload the Containerbuddy configuration file on external changes. Implementation details:

The os/signal package provides the mechanism to accept POSIX signals.
The signal can reload the configuration by re-parsing the configuration at parseConfig
We already have a quit channel for each of the goroutines running checkHealth or checkForChange, so the easiest way to reload the configuration would be to send the quit signal to all these goroutines and recreate them with the new config.
We can't update the environment variables of a running process, so this option will not be open in that scenario.

cc @misterbisson

Add TravisCI Support

It would be great if we could get automated builds on TravisCI

Could make reviewing easier, since Travis will update the build status on the PR.

Verify IPv6 support

This came up in discussion in #51, and we decided to split this problem out for later work.

This enhancement includes:

support IPv6 addresses on interfaces that we're advertising to the discovery service
support IPv6 connections to the discovery service
anything else we can think of?

Add more custom script hooks

From @misterbisson:

22 opens up the door to more than just deregistering services. Consider running Couchbase in Docker. The blueprint and demo makes deploying Couchbase and scaling it up easy (repo), but scaling down requires more steps that haven't been automated.

A SIGTERM handler could be exactly what's needed to add that automation. If it could also execute a user-defined executable (and wait for it), it would allow us to mark the Couchbase node for removal from the cluster and automatically rebalance the data to the remaining nodes before stopping it.

I haven't tested it, but I think the right command to call would be:

couchbase-cli rebalance -c 127.0.0.1:8091 -u $COUCHBASE_USER -p $COUCHBASE_PASS --server-remove=${IP_PRIVATE}:8091

And when that is done, it should be safe to stop (and remove/delete) the container.

Error while getting nginx port

Hey,

I'm getting an error related to Nginx port when running the example locally.

I did some digging and this command does not work.

NGINX_PORT=$(docker inspect example_nginx_1 | json -a NetworkSettings.Ports."80/tcp".0.HostPort)

Error produced:

command not found: json
write /dev/stdout: broken pipe

So I suggest to use the command above instead. Tested and working.
Docker docs -> Find a Specific Port Mapping

NGINX_PORT=$(docker inspect --format='{{(index (index .NetworkSettings.Ports "80/tcp") 0).HostPort}}' ${PREFIX}_nginx_1)

I can submit a PR if necessary, just let me know :)

Support user-defined behaviors on startup

We've seen a number of cases where we want Containerbuddy to perform some kind of bootstrapping behavior prior to starting our application.

The implementation I'm thinking of is that we'd add a new configuration value onStart to the Config struct. We then execute a run of a user-defined executable just before the main application is run.

ttl health check result is posted before service is registered

The first thing my container does is:

PUT /v1/agent/check/pass/tst-logredis2-tst-5a49064b027c?note=ok&token=xxxxxxxxxxxxx

Which fails with (output from tcpdump):

HTTP/1.1 500 Internal Server Error
Date: Fri, 04 Mar 2016 08:29:42 GMT
Content-Length: 36
Content-Type: text/plain; charset=utf-8

CheckID does not have associated TTL

Consul agent logs:
Mar 4 09:59:27 tst-xxxx-xxxx-001 consul: 2016/03/04 09:59:27 [ERR] http: Request PUT /v1/agent/check/pass/tst-logredis2-tst-5a49064b027c?note=ok&token=, error: CheckID does not have associated TTL from=X.X.X.X:60299

Afterwards containerbuddy registers the service and check (just the check tcpdump below)

PUT /v1/agent/check/register?token=x
{"ID":"tst-logredis2-tst-5a49064b027c","Name":"tst-logredis2-tst-5a49064b027c","Notes":"TTL for tst-logredis2-tst set by containerbuddy","ServiceID":"tst-logredis2-tst-5a49064b027c","TTL":"30s"}
This behaviour seems incorrect - shouldn't the service be registered first?

Here's my app.json:


{
  "consul": "{{.HOST}}:8500",
  "stopTimeout": -1,
  "services": [
    {
      "name": "tst-logredis2-tst",
      "port": 8080,
      "health": [
        "socat",
        "-",
        "TCP4:localhost:8080"
      ],
      "poll": 10,
      "ttl": 30,
      "interfaces": [
        "eth0[0]",
        "x.x.x.x/16",
        "inet",
        "inet6"
      ]
    }
  ]
}

etcd config does not accept list of endpoints

Ran into this while working on #91. The etcd configuration is supposed to accept a list of endpoints as documented in the README:

{
  "etcd": {
    "endpoints": ["http://etcd1:4001"]
  }
}

But this throws the error Must provide etcd endpoints, which comes from here where we switch over the config values. Passing single string for a single host works fine. This section of our etcd tests also works, which suggests that the problem is in the configuration file parsing and not the switch where we're throwing the error.

Proposal: Integration Test Framework

My goal is to make it somewhat easy to create new integration tests by factoring
out the common logic into a simple framework.

It might be overkill though, maybe there is an easier way to do it.

Folder Structure

integration_tests/fixtures - Contains folder for each test harness
integration_tests/tests - Contains folders for each integration test

Fixtures

integration_tests/fixtures/fixture_name - Folder contains a Dockerfile
for building a Conatainerbuddy app - resulting in an image named fixture_name

The Dockerfile should specify an ENTRYPOINT/CMD and should not require any arguments.

Tests

integration_tests/tests/test_name

Contains a test.sh script
docker-compose.yml for setting up the test environment
Any other resources it might need.

If test.sh returns success: 0 then the test passed, otherwise it failed

This script can make some assumptions:

It's current directory PWD is the same as the script.
containerbuddy_etcd and containerbuddy_consul are running and are fresh (no data)
all fixtures in integration_tests/fixtures are created and are available as images - requiring no arguments

Running Tests

Make target `integration_test`

integration_test: build
  ./test.sh

The script in the top-level folder named test.sh will:

Setup Test Fixtures

Scan through all folders in integration_tests/fixtures in alpha order
For each fixture:
- Copy build into integration_test/fixtures/fixture_name/build so it can easily be sourced by the Dockerfile
- cd integration_tests/fixtures/fixture_name
- docker build -t fixture_name .
- Remove build folder to clean up (should add to .gitignore also)

Note: Since fixtures are created in alpha order, they can have FROM directives for previously
created images - so as to reduce duplication.

Run tests

Scan through all folders in integration_tests/tests in alpha order
For each test:
- Re-run containerbuddy_etcd and containterbuddy_consul
- Run docker compose on the docker-compose.yml to bring up the test environment
- Run integration_test/tests/test_name/test.sh
- Stop/kill the compose environment

Add Integration Tests to TravisCI

Bugs like #40 may be difficult to find if we are not exercising more complicated code paths.
Need to add some scripts to test critical paths for a running instance of Containerbuddy.

We should be able to compose a set of docker containers which can exercise code paths that are important from outside the codebase.

Reload signal can cause freeze?

I'm still trying to hunt down the details but while working on autopilotpattern/mysql#1 I'm discovering it appears to be possible that a configuration reload can cause Containerbuddy to fail to make further forward progress.

I'm going to try to work up a reproduction, because it definitely doesn't happen every time so there might be a race with other signals or something like that.

Wrong order of DOCKERMAKE mounts

I think this came up after merging #64 and #61 - one of them overwrites the order in which the volumes are mounted, so .godeps overwrites /src/containerbuddy

Override interface for ip discovery

I'm trying to combine Weave and containerbuddy to register weave IPs in consul-dns. Unfortunately, discovery uses the first IP that it finds from the list of interfaces - which in my case is usually the docker bridge, not the one I want (weave interface).

I'd like to configure the containerbuddy to use ethwe instead directly, because I know my containers will be on the weave network:

Example of my container's interfaces on the Weave network

root@aab54fefc215:/# ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:ac:11:00:02
          inet addr:172.17.0.2  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:acff:fe11:2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:648 (648.0 B)  TX bytes:648 (648.0 B)

ethwe     Link encap:Ethernet  HWaddr ee:71:f3:9b:ed:4a
          inet addr:10.130.16.1  Bcast:0.0.0.0  Mask:255.255.224.0
          inet6 addr: fe80::ec71:f3ff:fe9b:ed4a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1410  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:648 (648.0 B)  TX bytes:690 (690.0 B)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Adding an interface override like "interface": "ethwe" may help me get the correct IP:

{
  "consul": "consul:8500",
  "services": [
    {
      "name": "nginx",
      "port": 80,
      "interface": "ethwe",
      "health": "/usr/bin/curl --fail -s http://localhost/health.txt",
      "poll": 10,
      "ttl": 25
    }
  ],
  "backends": [ ... ]
}

Reap Children when run as PID1

Containerbuddy should be performing the duties of the init process if it is run as PID1: Reaping child processes.

One possible way to do this is by using this go library:
https://github.com/ramr/go-reaper

Although since Containerbuddy is already handling signals, we can just run

func reapChildren() {
    var wstatus syscall.WaitStatus
    pid, err := syscall.Wait4(-1, &wstatus, 0, nil)
    for syscall.EINTR == err {
        pid, err = syscall.Wait4(-1, &wstatus, 0, nil)
    }
    if syscall.ECHILD == err {
        return
    }
    log.Printf("Reaped: pid=%d, wstatus=%+v", pid, wstatus)
}

// ... in handleSignals

if 1 == os.Getpid() {
    signal.Notify(sig, syscall.SIGCHLD)
}

//...

case syscall.SIGCHLD:
    reapChildren()

-version flag is broken

Running ./containerbuddy -version fails to bring up the Githash and Version identifiers:

docker run --rm -it -v build/containerbuddy:/containerbuddy debian:jessie /bin/bash
root@5823395a7845:/# ./containerbuddy -version
Version:
GitHash:

I've been able to verify that it works with the 0.1.2-RC build so presumably this got introduced when we moved the directories around in 0.1.3.

Containerbuddy doesn't handle interfaces without an IP or with a dual IPV4/IPV6 address

On my local workstation, I have the following crazy amount of interfaces:

docker0   Link encap:Ethernet  HWaddr 02:42:27:ff:b2:cc  
          inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:27ff:feff:b2cc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:128306 errors:0 dropped:0 overruns:0 frame:0
          TX packets:158590 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:8752297 (8.7 MB)  TX bytes:483052292 (483.0 MB)

eth0      Link encap:Ethernet  HWaddr 10:c3:7b:45:a2:ff  
          inet addr:192.168.0.7  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::12c3:7bff:fe45:a2ff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5764724 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2369492 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5741557978 (5.7 GB)  TX bytes:390985056 (390.9 MB)
          Interrupt:18 Memory:fbf00000-fbf20000 

eth1      Link encap:Ethernet  HWaddr 40:16:7e:37:99:a6  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:19 Memory:fb800000-fb820000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:353966 errors:0 dropped:0 overruns:0 frame:0
          TX packets:353966 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:57731938 (57.7 MB)  TX bytes:57731938 (57.7 MB)

lxcbr0    Link encap:Ethernet  HWaddr 7a:01:23:0a:6a:1f  
          inet addr:10.0.3.1  Bcast:10.0.3.255  Mask:255.255.255.0
          inet6 addr: fe80::7801:23ff:fe0a:6a1f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18425 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:3809150 (3.8 MB)

veth9a44ba0 Link encap:Ethernet  HWaddr 3e:fb:89:da:76:7f  
          inet6 addr: fe80::3cfb:89ff:feda:767f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:33 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:648 (648.0 B)  TX bytes:7584 (7.5 KB)

vmnet1    Link encap:Ethernet  HWaddr 00:50:56:c0:00:01  
          inet addr:10.99.99.1  Bcast:10.99.99.255  Mask:255.255.255.0
          inet6 addr: fe80::250:56ff:fec0:1/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:160348 errors:0 dropped:16648 overruns:0 frame:0
          TX packets:18424 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

vmnet8    Link encap:Ethernet  HWaddr 00:50:56:c0:00:08  
          inet addr:10.88.88.1  Bcast:10.88.88.255  Mask:255.255.255.0
          inet6 addr: fe80::250:56ff:fec0:8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:222122 errors:0 dropped:16648 overruns:0 frame:0
          TX packets:240397 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

wlan0     Link encap:Ethernet  HWaddr 00:24:01:ee:dc:bc  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

If you look at eth1, you can see that nothing is plugged into it, so its IP addresses will be empty.

Add License Badge to README

One like this - Linked to the LICENSE file

[![MPL licensed](https://img.shields.io/badge/license-MPL_2.0-blue.svg)](https://github.com/joyent/containerbuddy/blob/master/LICENSE)

Can't seem to run HAProxy

When I try to execute HAProxy using the HAProxy base image for v1.6 I get

docker run --rm mbbender/haproxy /opt/containerbuddy/containerbuddy -config file:///opt/containerbuddy/haproxy.json /usr/local/sbin/haproxy -f /usr/local/etc/haproxy/haproxy.cfg
2015/11/06 04:34:46 fork/exec : no such file or directory

It takes a bit for that error to show up, maybe 5-10 seconds so it's not an immediate thing. If I run this image without the containerbuddy wrapper it works as expected.

My image only adds consul-template to the mix which isn't even in play in any of these tests so I don't expect that should matter.

I don't know go so I'm attempting to troubleshoot/debug but not having much luck.

Buddy does not deregister service from consul.

I realized that if I stop and start the container containerbuddy does_not refresh the service ip address on consul. I would have to deregister the service manually. Is this a bug?

Document the SIGHUP signal

#48 introduced config reload via SIGHUP but the README doesn't list this under the handled signals.

@tgross - Sorry I didn't catch this on the PR.

Race detector fails to link

When I run tests with the -race flag to detect data races, there's what looks to be a linker failure associated with a missing glibc component. We should fix this so that we can run the race detector as part of our integration tests.

$ make test
docker rm -f containerbuddy_consul > /dev/null 2>&1 || true
docker run -d -m 256m --name containerbuddy_consul \
                progrium/consul:latest -server -bootstrap-expect 1 -ui-dir /ui
bf62fd69cf6a9208790d33de6536c804d6a671889cb61775e56f901d69c1b20a
docker rm -f containerbuddy_etcd > /dev/null 2>&1 || true
docker run -d -m 256m --name containerbuddy_etcd -h etcd quay.io/coreos/etcd:v2.0.8 \
                -name etcd0 \
                -advertise-client-urls http://etcd:2379,http://etcd:4001 \
                -listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \
                -initial-advertise-peer-urls http://etcd:2380 \
                -listen-peer-urls http://0.0.0.0:2380 \
                -initial-cluster-token etcd-cluster-1 \
                -initial-cluster etcd0=http://etcd:2380 \
                -initial-cluster-state new
04fb620862a9f165f051abe0b0b9c70bbc299033e445cf2fa279b11e07201f25
docker run --rm --link containerbuddy_consul:consul --link containerbuddy_etcd:etcd \
                -v /src/containerbuddy:/go/src/containerbuddy \
                -v /src/containerbuddy/.godeps:/go/src \
                -v /src/containerbuddy/build:/build \
                -v /src/containerbuddy/cover:/cover \
                -v /src/containerbuddy/examples:/root/examples:ro \
                -v /src/containerbuddy/Makefile.docker:/go/makefile:ro \
                -e LDFLAGS='-X main.GitHash=8e9e751 -X main.Version=dev-build-not-for-release' \
                containerbuddy_build test
cd /go/src/containerbuddy && go test -v -race -coverprofile=/cover/coverage.out
# testmain
runtime/race(.text): __libc_malloc: not defined
runtime/race(.text): getuid: not defined
runtime/race(.text): pthread_self: not defined
runtime/race(.text): madvise: not defined
runtime/race(.text): madvise: not defined
runtime/race(.text): madvise: not defined
runtime/race(.text): sleep: not defined
runtime/race(.text): usleep: not defined
runtime/race(.text): abort: not defined
runtime/race(.text): isatty: not defined
runtime/race(.text): __libc_free: not defined
runtime/race(.text): getrlimit: not defined
runtime/race(.text): pipe: not defined
runtime/race(.text): __libc_stack_end: not defined
runtime/race(.text): getrlimit: not defined
runtime/race(.text): setrlimit: not defined
runtime/race(.text): setrlimit: not defined
runtime/race(.text): setrlimit: not defined
runtime/race(.text): exit: not defined
runtime/race(.text.unlikely): __errno_location: not defined
runtime/race(.text): undefined: __libc_malloc
/usr/local/go/pkg/tool/linux_amd64/link: too many errors
FAIL    containerbuddy [build failed]
makefile:36: recipe for target 'test' failed
make: *** [test] Error 2
make: *** [test] Error 2

Add an alternate service discovery backend

The PR #4 provided by @bbox-kula made the changes we needed to have a generic interface for service discovery so we can have pluggable backends as intended. Including a second implementation of this backend as an example would make sure this works out in practice.

TODO on this issue: pick a second backend

TestMaintenanceSignal is failing

Our first build on TravisCI failed https://travis-ci.org/joyent/containerbuddy/builds/94954762
The failing test is one that passed when run locally. I believe there's a race in the signal sending set up or tear-down.

=== RUN   TestMaintenanceSignal
2015/12/04 20:13:36 we are paused!
--- FAIL: TestMaintenanceSignal (0.00s)
    signals_test.go:51: Should not be in maintenance mode after receiving second SIGUSR1
=== RUN   TestTerminateSignal
.2015/12/04 20:13:36 Caught SIGTERM
2015/12/04 20:13:36 Deregister service: test-service
--- PASS: TestTerminateSignal (1.00s)

I'm looking into it, but I'm going to cc @justenwalker because he's been in this area of the code recently.

Support marking nodes for maintenance

From https://www.joyent.com/blog/automatic-dns-updates-with-containerbuddy:

The problem is that if we simply remove the old container we'll have a period of lost traffic between the time we remove the container and the TTL expires. We need a way to signal Containerbuddy to mark the node for planned maintenance. I'll circle back to that in a revision to Containerbuddy and discuss this change in an upcoming post.

cc @misterbisson

Proposal: add user defined application telemetry

This is poorly thought through so far, but I wanted to open up discussion for it.

Health checking gives us a binary way to determine if the app is working or not. If the app is not healthy, we should not send any requests to it and we most likely need to spawn a replacement. But scaling depends on more than binary app health. Every app has one or more performance indicators that can be used to determine if the app is nearing overload and should be scaled up or is too lightly loaded and should be scaled down. In the spirit of what Containerbuddy does to make applications container-native, I think it might also make sense to add an awareness of those performance indicators.

Some performance indicators can be read from the system. The five minute load average reported by the kernel may work well for many apps, but many of the most interesting indicators come from the app itself.

In MySQL, the number of Query entries from SHOW PROCESSLIST that are in any Waiting state can be a very significant performance indicator, and it's one that is best retrieved from within the container.

In Nginx, the average request processing time is useful info, but that's only output in the logs, which are hopefully not inside the container (if they were, ngxtop would be a nice tool to help understand them). But we do expect http_stub_status_module in our triton-nginx image, so we can look at that instead. Active connections vs. the worker_connections limit is a hugely important number there. Any delta between accepts and handled is a huge red flag. Or, perhaps, the Waiting number is an inverse indicator (high numbers indicate low activity, low numbers are high activity, zero could be critical). (See autopilotpattern/nginx#3 for further musing on this.)

Here's what I'm imagining it would look like in the config (though it would be obviously unlikely to actually mix Nginx and MySQL in a single image):

  "kpis": [
    {
      "name": "system-loadavg",
      "poll": 103,
      "kpi": "/opt/containerbuddy/system-loadavg.sh"
    },
    {
      "name": "mysql-queries-waiting",
      "poll": 31,
      "kpi": "/my/bin/mysql-queries-waiting.sh"
    },
    {
      "name": "nginx-busy",
      "poll": 17,
      "kpi": "/my/bin/nginx-busy.sh"
    },
    {
      "name": "nginx-missed-connections",
      "poll": 137,
      "kpi": "/my/bin/nginx-missed-connections.sh"
    }
  ]

In order, the above four KPI entries would return:

The 5 minute load average (the trailing average, returned every 103 seconds). This is decimal value. It should generally not be higher than the count of VCPUs the instance has access to, though that can be hard to figure out in a Docker environment.
The count of MySQL Query entries from SHOW PROCESSLIST that are in any Waiting state. 0 is great. 1 or above can be trouble. 10 or more is probably critical.
The result of Nginx's (${Active_connections} - ${Waiting}) / ${worker_connections}, a decimal value. 1 is maxed out, 0.5 is 50% busy.
The result of Nginx's accepts - handled vars, an integer value. Anything other than 0 here is probably critical.

The executables that get these KPIs are to be provided by the app developer/packager, though it might make sense to have a common way to get thee system load average as part of Containerbuddy.

As with the health checks, the executables would be run periodically at the time specified. The health checks could then be returned in stdout as newline delimited JSON as they're executed, for interpretation and use by whatever tools are reading the logs.

I wanted to offer those specific use cases as as an exercise to figure out what data would be handled here. In short, I'm thinking each KPI is a numeric (decimal allowed) time series value. The executable will simply return that value and Containerbuddy can pass it on with the name given in the config.

Implement log levels

Containerbuddy is a bit chatty. Health checks in particular pile up in the Docker logs. Adding (and documenting) log levels wouldn't be a bad idea. A couple options:

https://github.com/Sirupsen/logrus is used by Docker
I've also used https://github.com/op/go-logging in other projects

Example fails

When I try to run the example either locally or on Trident it gets to where it starts writing the template file and never exits. See pastebin http://pastebin.com/GQW5KGE6 It has been running at least 1/2hour.

Support selecting service IP matching CIDR

If Triton allows multiple IPs per address (@bahamat on #50), and we potentially have IPV6 (#52) Then perhaps we need to allow services to select their IP in a more flexible way.

Propose something like:

{
  "cidr": "192.168.0.0/16"
}

instead of (or in addition to) the "interfaces" option

Proposed behavior:

Both not given - pick the first IP on eth0 (sane default) - Already Supported
interfaces given - pick the first IP on the first found interface listed in interfaces - Already Supported
cidr given - pick the first IP address on any interface matching the CIDR
Both given - pick the first IP address on any interface (limited to interfaces) matching the CIDR

Note: first IP could be IPV4 or IPV6 - Maybe that's a flag or environment variable that dictates which one is preferred? Depends on what is decided in #52 I suppose. In the cidr case, it will always be determined by the whatever version of CIDR is given, so this disambiguation logic applies to 1 and 2

Also, case 4 may be unnecessary if each interface will only have at most 1 v4 and 1 v6 address. In this case it should probably be a parse error instead of handling it specially.

Signal could not be handled

According to the description, containerbuddy accepts POSIX signals to change its runtime behavior. However, when we test it, we've found that the signals could not be accepted by the process at all, and after we dive into the code, we've got some hints:

        //Line 25 in containerbuddy/main.go 
    // Run the onStart handler, if any, and exit if it returns an error
    if onStartCode, err := run(config.onStartCmd); err != nil {
        os.Exit(onStartCode)
    }

    // Set up handlers for polling and to accept signal interrupts
    if 1 == os.Getpid() {
        reapChildren()
    }
    handleSignals(config)
    handlePolling(config)

In the above code snippet, handleSignals would not be called untill the "run" procedure has been executed, however, "run" procedure would be blocked untill the process itself quit. Given such design, handleSignals would not take into effects at all.

So is this a bug for such design? A similar project is https://github.com/Yelp/dumb-init , its signal handler could work correctly.

Add -version flag arg and bake version into LDFLAG

Rather than having to push a commit to bump the version (like this #23), it'd be better to be able to have a -version flag that exposes the version that we can inject via LDFLAGS.

In the code we'd need a couple of global variables:

var Version string // version for this build, set at build time via LDFLAGS
var GitHash string // short-form hash of the commit at HEAD of this build, set at build time via LDFLAGS

At the top of the makefile we can have:

VERSION ?= dev-build-not-for-release
LDFLAGS := '-X containerbuddy.GitHash $(shell git rev-parse --short HEAD) -X containerbuddy.Version ${VERSION}'

Our go build would then include -ldflags ${LDFLAGS} to inject those values into the global variables, and then we'd just need a flag argument to echo that text to stdout.

This means that when we build during development, we'll get:

$ make build
...
$ ./build/containerbuddy -version
Version: dev-build-not-for-release
GitHash: deadbeef

But when we do a release build, we'll get:

$ VERSION=0.0.2 make build
...
$ ./build/containerbuddy -version
Version: 0.0.2
GitHash: deadbeef
$ VERSION=0.0.2 make release
...
Upload this file to Github release:
464ee7708b3bd93c6c996a22f76c426bc144ec71  release/containerbuddy-0.0.2-alpha.tar.gz

Consider using godep for managing dependencies?

Godep is pretty much the de-facto tool for managing dependencies in go projects. I suggest we consider using Godep instead of manually checking out and manipulating the source repositories.

This requires that the directory structure be altered to conform more closely with the golang coding standards

Move examples to integration tests

Following the discussion in #80, I'm going to move the example applications into the integration tests. This makes sure we keep the example applications working, and also lets us point to more complex example applications in different repos for more production-ready examples.

support environment interpolation in the config file

It would be nice to be able to reference environment variables within the configuration file. e.g. if I want to one Dockerfile that can be used on a 'dev consul' and later in production, I might choose to change (future) tag support or use a different consul master port.

Break core functionality out to library?

After #76 landed we ended up with a main.go and separate containerbuddy package. This opens up the suggestion of making the core functionality of Containerbuddy a library that other applications could import. Then the Containerbuddy binary that we ship would be the main.go and probably the configuration loaders? I've experimented with this a little bit back before we added a lot of our features but it seems like it's feasible and useful.

This would definitely be a post-1.0 release item and needs some discussion about where the "seams" are before we start putting up a bunch of PRs. I do want to avoid making the project unapproachably over-factored.

Polling Bug - exec.Cmd is not reusable

Unfortunately, I did not do adequate testing for #38 and introduced bug.

Parsing the config yields a Cmd object for all commands. However, health checks and backend change hooks are run more than once. Once run, the Cmd object is effectively dead - cannot be run again.

The result is that we hit the failure case in the executeAndWait function and exit prematurely.

This does not affect the preStop, postStop, and onStart Commands since they are only ever run once.

Optimize startup time of build container

At some point the time to start the build container got noticeably long, which means test runs have gone from a few seconds to ~20. I can reproduce this both locally and on TravisCI. It's not the tests themselves:

$ time make test
docker rm -f containerbuddy_consul > /dev/null 2>&1 || true
docker run -d -m 256m --name containerbuddy_consul \
                progrium/consul:latest -server -bootstrap-expect 1 -ui-dir /ui
199c7b2d0d11c6450887b329aeb86e737770f6a93733c149d9397a9018385b43
docker rm -f containerbuddy_etcd > /dev/null 2>&1 || true
docker run --rm --link containerbuddy_consul:consul --link containerbuddy_etcd:etcd -v /Users/tim.gross/src/justenwalker/containerbuddy/vendor:/go/src -v /Users/tim.gross/src/justenwalker/containerbuddy:/go/src/github.com/joyent/containerbuddy -v /Users/tim.gross/src/justenwalker/containerbuddy/build:/build -v /Users/tim.gross/src/justenwalker/containerbuddy/cover:/cover -v /Users/tim.gross/src/justenwalker/containerbuddy/examples:/root/examples:ro -v /Users/tim.gross/src/justenwalker/containerbuddy/Makefile.docker:/go/makefile:ro -e LDFLAGS='-X containerbuddy.GitHash=4ae8cb3 -X containerbuddy.Version=dev-build-not-for-release' containerbuddy_build test
...
(long pause)
...
cd /go/src/github.com/joyent/containerbuddy && go test -v -coverprofile=/cover/coverage.out ./containerbuddy
...
(lots of output)
...
ok      github.com/joyent/containerbuddy/containerbuddy 6.130s
make test  0.73s user 0.04s system 2% cpu 26.594 total

Going to mark this as an enhancement for post-1.0 release and take it on myself, probably while working on #75 in parallel.

Logging failure after 1.1

Given the following config file:

{
  "consul": "consul:8500",
  "logging": {
    "level": "DEBUG",
    "format": "default",
    "output": "stderr"
  },
  "services": [
    {
      "name": "myservice",
      "port": 80,
      "poll": 10,
      "ttl": 30
    }
  ]
}

we expect the output to be:

$ ./containerbuddy -config file:///containerbuddy.json hello-world.sh
2016/03/09 19:16:18 `health` is required in service jenkins

but instead we get:

$ ./containerbuddy -config file:///containerbuddy.json hello-world.sh
$ echo $?
1

This failure happens after the logging framework is set up in this section. But if we force the failure to happen before we set up the logging, for example with the config file, we get the same non-logging behavior:

{
  "consul": "consul:8500",
  "etcd": "etcd:4001",
  "logging": {
    "level": "DEBUG",
    "format": "default",
    "output": "stderr"
  },
  "services": [
    {
      "name": "myservice",
      "port": 80,
      "poll": 10,
      "ttl": 30
    }
  ]
}

Update links for examples

Just a few days ago there was a folder with example files and now it has gone missing. A few links comming from the blog are broken now.

"make test" re-clones dependencies every time

The way that the DOCKERMAKE is used in #57 is causing all dependencies to be re-cloned into the container for every test run. I'm going to look into this to reduce test iteration speed.

Proposal: Periodic tasks

Currently, we have preStart, preStop and postStart events; As the container is running though, it may be useful to have periodic tasks execute to report on status to external systems separate from the health checks that report to the service discovery backend.

The primary use-case would be a logical extension point for push-style metrics without having to build in any backends into Containerbuddy directly. (See #27 for discussion)

Configuration may look something like:

{
  "onScheduled": [
    { "frequency": "1s", "command": [ "/bin/push_metrics.sh" ] },
    { "frequency": "10s", "command": [ "/bin/push_other_metrics.sh" ] }
  ]
}

Some other things to consider:

Should we be supporting a more flexible duration syntax? For example: ISO8601 Durations. For the metrics use-case, the frequency of events will be very short duration. Backups could potentially be longer (hours). Certainly I don't think there would be multi-day pauses between execution.
Should we build in the idea of exponential back-off to deal with back-pressure? If a command results in a specific non-zero exit code, we reduce the frequency to give the target system time to recover. Note: This may be useless or harmful for larger frequencies (Hours), so we might need to also specify the retry frequency as well in order to fix that.

Example Nginx config grabs all services

Observation. When creating default.conf file via the consul template an upstream is created for all services. If your cluster has containers which do not expose an external port and are not managed by nginx including an upstream for them will fail the nginx file because it will try to emit a port of 0.

This doesn't affect the example programs but would affect other situations.

I have modified my code to only create entries in the conf file for those services which are linked to nginx.

There are undoubtedly cleaner methods than the one am I using because I am certainly not a bash guru but I am including the file in case it is of interest (and in the hope that you say "Geez Don, it is a lot simpler to just do this ..."
start.sh.zip

Service tags not exposed to consul templates

When I add a tag to a service I cannot access it in a consul template

hexo.json

{
  "consul": "consul:8500",
  "onStart": "/opt/containerbuddy/reload-hexo.sh",
  "services": [
    {
      "name": "hexo",
      "port": 4000,
      "tags":["joyent.blog.vawter.com"],
      "health": "/usr/bin/curl --fail -s http://localhost:4000/",
      "poll": 10,
      "ttl": 25
    }
  ],
  "backends": [
      ]
}

template

{{range services}}
  {{range service .Name}}
    {{.Name}}
    {{.Address}}
    {{.Port}}
    {{.Tags}}
   {{end}}
 {{end}}

template output


    hexo
    192.168.128.235
    4000
    []