docker-archive / for-azure Goto Github PK

View Code? Open in Web Editor NEW

27.0 27.0 18.0 2 KB

for-azure's People

Contributors

Stargazers

Watchers

Forkers

dockerexpert azureexpert duglin jeevankishoreweb bhanditz injeti-manohar sindhubreddy24 savobit global-localhost global19 global19-atlassian-net isabella232

for-azure's Issues

Lost node labels on upgrade

When running the upgrade.sh script, all the node labels are lost and have to be manually re-applied.

Very high memory usage by docker4x/agent-azure:17.05.0-ce-azure2

Expected behavior

A swarm running with multiple stacks, 12 services per stack,

Actual behavior

The container uses up a lot of ram, and on getting its logs, we get the following error repeating constantly on them

Information

Full output of the diagnostics from "docker-diagnose" ran from one of the instance
A reproducible case if this is a bug, Dockerfiles FTW
Page URL if this is a docs issue or the name of a man page

Steps to reproduce the behavior

panic: runtime error in Swarm manager node

Actual behavior

We deployed docker-ce for azure using below template:
https://store.docker.com/editions/community/docker-ce-azure

but after few days , docker service on manager node got crashed.

Information

Cannot execute any docker command on manager node,

swarm-manager000000:~$ docker-diagnose
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Following error has appeared at the end of the docker.log file.

Stack trace:

Oct 26 04:41:43 moby root: time="2017-10-26T04:41:43.239733160Z" level=debug msg=subscribed method="(*LogBroker).SubscribeLogs" subscription.id=v61j7ly06ey19w6gvscclp6ri  
Oct 26 04:41:43 moby root: panic: runtime error: index out of range 
Oct 26 04:41:43 moby root: goroutine 1174563 [running]: 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*SubscriptionMessage).MarshalTo(0xc423f6a630, 0xc4224a67d0, 0x47, 0x47, 0x47, 0x47, 0x1a4d120) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/logbroker.pb.go:1162 +0x34a 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*SubscriptionMessage).Marshal(0xc423f6a630, 0x7fd07c460088, 0xc423f6a630, 0x7fd07c4600c0, 0xc423f6a630, 0xe729201) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/logbroker.pb.go:1123 +0x84 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/github.com/golang/protobuf/proto.(*Buffer).Marshal(0xc421e8e0d8, 0x7fd07c460088, 0xc423f6a630, 0xc4230b83c0, 0x0) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/github.com/golang/protobuf/proto/encode.go:264 +0x7a 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/google.golang.org/grpc.protoCodec.marshal(0x1a4d120, 0xc423f6a630, 0xc421e8e0d0, 0x43efe5, 0xc42642f790, 0x3, 0x3, 0xc42642f820) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/google.golang.org/grpc/codec.go:78 +0xe8 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/google.golang.org/grpc.protoCodec.Marshal(0x1a4d120, 0xc423f6a630, 0x0, 0x3, 0x3, 0x3, 0x0) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/google.golang.org/grpc/codec.go:88 +0x73 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/google.golang.org/grpc.(*protoCodec).Marshal(0x28aa198, 0x1a4d120, 0xc423f6a630, 0xc425860008, 0xc8, 0xc8, 0xc42642f498, 0x40d219) 
Oct 26 04:41:43 moby root: ^I<autogenerated>:35 +0x59 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/google.golang.org/grpc.encode(0x2832620, 0x28aa198, 0x1a4d120, 0xc423f6a630, 0x0, 0x0, 0x0, 0x0, 0xc425f46920, 0xc425f468b8, ...) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/google.golang.org/grpc/rpc_util.go:253 +0x2f9 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/google.golang.org/grpc.(*serverStream).SendMsg(0xc4259b0c80, 0x1a4d120, 0xc423f6a630, 0x0, 0x0) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/google.golang.org/grpc/stream.go:581 +0x113 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.(*monitoredServerStream).SendMsg(0xc424b31ce0, 0x1a4d120, 0xc423f6a630, 0x377dc7e7aad672ec, 0xc4244b54d0) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server.go:61 +0x4b 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*logBrokerListenSubscriptionsServer).Send(0xc425804b50, 0xc423f6a630, 0xc425860340, 0xc42237b080) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/logbroker.pb.go:748 +0x49 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*LogBroker_ListenSubscriptionsServerWrapper).Send(0xc424b31d20, 0xc423f6a630, 0xc423f6a630, 0xc425f472e0) 
Oct 26 04:41:43 moby root: ^I<autogenerated>:459 +0x53 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/logbroker.(*LogBroker).ListenSubscriptions(0xc421695b00, 0x28aa198, 0x283a600, 0xc424b31d20, 0x0, 0x0) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/logbroker/broker.go:368 +0xa8d 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*authenticatedWrapperLogBrokerServer).ListenSubscriptions(0xc420e14780, 0x28aa198, 0x283a600, 0xc424b31d20, 0x0, 0x0) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/logbroker.pb.go:276 +0x127 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/github.com/docker/swarmkit/api.(*raftProxyLogBrokerServer).ListenSubscriptions(0xc42093bb80, 0x28aa198, 0x2839fa0, 0xc425804b50, 0xc42093bb80, 0x4120b8) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/logbroker.pb.go:1483 +0x23e 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/github.com/docker/swarmkit/api._LogBroker_ListenSubscriptions_Handler(0x1962e80, 0xc42093bb80, 0x28384a0, 0xc424b31ce0, 0xc4228ae280, 0xc421553c00) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/api/logbroker.pb.go:735 +0x113 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.StreamServerInterceptor(0x1962e80, 0xc42093bb80, 0x2838740, 0xc4259b0c80, 0xc424b31cc0, 0x1be9f18, 0xffffffffffffffff, 0xc4207f06c8) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server.go:40 +0x13b 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/google.golang.org/grpc.(*Server).processStreamingRPC(0xc4210a10e0, 0x283a240, 0xc42134b1e0, 0xc424855680, 0xc421f381e0, 0x27f5b60, 0xc42589bc20, 0x0, 0x0) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/google.golang.org/grpc/server.go:872 +0x363 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/google.golang.org/grpc.(*Server).handleStream(0xc4210a10e0, 0x283a240, 0xc42134b1e0, 0xc424855680, 0xc42589bc20) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/google.golang.org/grpc/server.go:959 +0x1539 
Oct 26 04:41:43 moby root: github.com/docker/docker/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc4254c1880, 0xc4210a10e0, 0x283a240, 0xc42134b1e0, 0xc424855680) 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/google.golang.org/grpc/server.go:517 +0xa9 
Oct 26 04:41:43 moby root: created by github.com/docker/docker/vendor/google.golang.org/grpc.(*Server).serveStreams.func1 
Oct 26 04:41:43 moby root: ^I/go/src/github.com/docker/docker/vendor/google.golang.org/grpc/server.go:518 +0xa1

question: what does docker4x/guide-azure do?

Hi. I'd like to know what is the purpose of docker4x/guide-azure.

It's calling:

storage_keys = storage_client.storage_accounts.list_keys(RG_NAME, SA_NAME)

sending 3 requests per minute and per node to the Microsoft Azure API. This API has a limit of 15K requests per hour, so if you have multiple Docker for Azure running in the same subscription, you get throttled.

I'd like to know what this service is for, to help use determine if we can stop it as a workaround until the "bug" is fixed.

Thank you.

Container accessing Docker API and mounting Azure File Storage breaks whole machine

We have a 5 node cluster (3 manager, 2 worker) and I'm working on a small helper image to view the container logs nicely. So in theory my container does some HTTP requests to the Docker API to get the ID of the tasks, and mounts the Azure File Storage, which holds the actual logs files.
Inspired by the editions_logger (image docker4x/logger-azure:17.06.0-ce-azure1) I also want to mount the actual storage right inside the container.

In my case the the script is not ready, so please don't judge the script itself.. :) I wrote a simple NodeJS app which mounts the storage and gets the tasks.

This is my Dockerfile:

FROM node:8-alpine

ENV APP_DIR            /app
ENV DOCKER_HOST        /var/run/docker.sock
ENV DOCKER_API_VERSION v1.30

RUN apk add --update cifs-utils

RUN mkdir -p $APP_DIR
WORKDIR $APP_DIR

COPY package* $APP_DIR/
RUN npm install
COPY . $APP_DIR

CMD ["npm", "start"]

To do requests to the Docker API:

const path = require('path');
const http = require('http');

/*
 * This is used to do requests against the Docker API.
 */
module.exports = (method, uri, data) => {
    if(!process.env.DOCKER_HOST || !process.env.DOCKER_API_VERSION) {
        throw Error('Please provide DOCKER_HOST and DOCKER_API_VERSION to contact Docker API properly.');
    }

    const options = {
        socketPath: process.env.DOCKER_HOST,
        port: 80,
        headers: { 'Content-Type': 'application/json' },
        dockerAPI: process.env.DOCKER_API_VERSION
    };
    let rawData = '';

    options.method = method;
    options.path = path.join('/', options.dockerAPI, uri);

    return new Promise((resolve, reject) => {
        const req = http.request(options, res => {
            res.setEncoding('utf8');
            res.on('error', reject);
            res.on('data', chunk => { rawData += chunk });
            res.on('end', () => {
                if([200, 201].indexOf(res.statusCode) == -1) {
                    return reject(Error(`[${res.statusCode}] ${options.path} (${JSON.stringify(data)}) failed: ${rawData}`));
                }
                resolve(JSON.parse(rawData));
            });
        });
        req.end(JSON.stringify(data));
    });
}

And the actual script:

const request = require('./request');
const fs = require('fs');
const { execSync } = require('child_process');

const storage = '//xxx.file.core.windows.net/xxx';
const logmountFolder = '/logmnt';
const username = 'xxx';
const password = 'xxx';

if(!fs.existsSync(logmountFolder)) {
    fs.mkdirSync(logmountFolder);
}
const mount = execSync(`mount -t cifs ${storage} ${logmountFolder} -o vers=2.1,username=${username},password=${password},dir_mode=0777,file_mode=0777,uid=0,gid=0`);
const files = fs.readdirSync(logmountFolder);

request('get', '/tasks?filters={"label":["com.docker.stack.namespace=production"]}')
.then(tasks => {
    tasks.forEach(task => {
        console.log('task', task.ID);

        files.forEach(file => {
            if(file.indexOf(task.ID) != -1) {
                console.log('file', file);
            }
        })
    });
})

Expected behavior

Is used this command to run it on a master-machine:

docker run --rm -ti -v /var/run/docker.sock:/var/run/docker.sock --privileged infra-log

And it works without any troubles, but only the first run.

Actual behavior

The second time the whole machine breaks and is unable to rejoin the cluster after the restart. After restart, around 3-5 minutes later, the whole machine breaks againt, continuously. After a bunch of restarts Azure itself deallocates the machine and creates a new machine in the scaleset (or reimages the broken machine.. I can't really tell).

In the past I also reimaged the broken machine and rejoined the machine back into the cluster by hand.

Information

I ran docker-diagnose after Azure created the new machine:

swarm-manager000001:~$ docker-diagnose
curl: (7) Failed to connect to 10.0.0.7 port 44554: Connection refused
OK hostname=swarm-manager000002 session=1500387848-1vtGIWvbMflyjRA2SQWBXR2iXTZPVLSH
OK hostname=swarm-manager000003 session=1500387848-1vtGIWvbMflyjRA2SQWBXR2iXTZPVLSH
OK hostname=swarm-worker000000 session=1500387848-1vtGIWvbMflyjRA2SQWBXR2iXTZPVLSH
OK hostname=swarm-worker000001 session=1500387848-1vtGIWvbMflyjRA2SQWBXR2iXTZPVLSH
Done requesting diagnostics.
Your diagnostics session ID is 1500387848-1vtGIWvbMflyjRA2SQWBXR2iXTZPVLSH
Please provide this session ID to the maintainer debugging your issue.

I also got the docker.log file from the broken machine after a bunch of restarts, but i'm not going to post this here because it may contain sensitive information. But i can send it to you.

Unable to connect to Manager and Worker VMSS's after shutting down or restarting

Expected behavior

Restart or Deallocate > Start the Manager and Worker VMSS's
Start both VMSS's
Able to SSH into manager. Swarm is running as before restart.

Actual behavior

SSH using PuTTY returns "Network error: Connection refused"
The website that was running returns INET_E_RESOURCE_NOT_FOUND
Curl returns curl : Unable to connect to the remote server

Steps to reproduce the behavior

Spin up a Docker for Azure swarm using the template
Wait for everything to get provisioned, test by deploying a simple stack, etc
Worker VMSS > Deallocate, wait till that completes
Manager VMSS > Deallocate, wait till completion
Manager VMSS > Start, wait till completion
Worker VMSS > Start, wait till completion

Add templates to repo

The templates can currently be downloaded and used with Azure but it would be useful to have them in this GitHub repo so pull requests can be submitted rather than just issues.

new VM sizes not listed

Expected behavior

list new VMs sizes

Actual behavior

it list old versions of VM sizes (D2-v2 instead of D2-v3)

Volumes with `cloudstor:azure` driver prevent changing permissions on files

Whenever I try to use the cloudstor:azure driver on a volume intended to be used on a container that changes the file ownership of the mounted files, the files never get assigned to the new owner.

Observed cases: Running postgres or rabbitmq with the data volume using the cloudstor:azure will always fail. Both have entrypoint scripts that try to ensure the data files belong to a different user.

Expected behavior

Changing the ownership of files inside a cloudstor:azure volume should be successful.

Actual behavior

Trying to change the ownership of files inside a cloudstor:azure volume fails silently.

Information

Running docker-diagnose failed with Error: No such object: meta-azure

Steps to reproduce the behavior

Given the following compose file:

# stack.yml
version: '3.1'

volumes:
  data:
    driver: cloudstor:azure

services:
  rabbitmq:
    image: rabbitmq:3.6-alpine
    volumes: [ "data:/var/lib/rabbitmq" ]

Deploy the stack: docker stack deploy --compose-file stack.yml rabbit
Observe the service failing to start: watch -n 2 docker stack ps rabbit

docker service logs command not responding

Expected behavior

We have created swarm cluster in azure using following template
https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fdownload.docker.com%2Fazure%2Fstable%2FDocker.tmpl

docker service logs -f should show service logs

Actual behavior

after scale up and scale down several times, docker service logs command stopped responding

Steps to reproduce the behavior

Create a service serv1 with replicas across multiple nodes
Run docker service logs -f serv1
Initially observe logs from multiple containers across different nodes
scale up and scale down several times
Run docker service logs -f serv1
command not responding

Information

docker-diagnose output

swarm-manager000003:~$ docker-diagnose
OK hostname=swarm-manager000001 session=1510318044-c5urt3zgyY9ulkooLzIoM8Vjv28fKqZg
OK hostname=swarm-manager000002 session=1510318044-c5urt3zgyY9ulkooLzIoM8Vjv28fKqZg
OK hostname=swarm-manager000003 session=1510318044-c5urt3zgyY9ulkooLzIoM8Vjv28fKqZg
OK hostname=swarm-worker000000 session=1510318044-c5urt3zgyY9ulkooLzIoM8Vjv28fKqZg
OK hostname=swarm-worker000001 session=1510318044-c5urt3zgyY9ulkooLzIoM8Vjv28fKqZg
OK hostname=swarm-worker000002 session=1510318044-c5urt3zgyY9ulkooLzIoM8Vjv28fKqZg
OK hostname=swarm-worker000003 session=1510318044-c5urt3zgyY9ulkooLzIoM8Vjv28fKqZg
OK hostname=swarm-worker000004 session=1510318044-c5urt3zgyY9ulkooLzIoM8Vjv28fKqZg
Done requesting diagnostics.
Your diagnostics session ID is 1510318044-c5urt3zgyY9ulkooLzIoM8Vjv28fKqZg
Please provide this session ID to the maintainer debugging your issue.

docker version output

> ```
> swarm-manager000003:~$ docker version
> Client:
>  Version:      17.09.0-ce
>  API version:  1.32
>  Go version:   go1.8.3
>  Git commit:   afdb6d4
>  Built:        Tue Sep 26 22:39:28 2017
>  OS/Arch:      linux/amd64
> 
> Server:
>  Version:      17.09.0-ce
>  API version:  1.32 (minimum version 1.12)
>  Go version:   go1.8.3
>  Git commit:   afdb6d4
>  Built:        Tue Sep 26 22:45:38 2017
>  OS/Arch:      linux/amd64
>  Experimental: false
> ```

docker info output

Containers: 8
 Running: 6
 Paused: 0
 Stopped: 2
Images: 8
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: zbbpsttjfubkuumf9p0e214d0
 Is Manager: true
 ClusterID: wyn1lmhtgecbnb2r2rwhzjm5s
 Managers: 3
 Nodes: 8
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.0.0.9
 Manager Addresses:
  10.0.0.10:2377
  10.0.0.11:2377
  10.0.0.9:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.49-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 6.785GiB
Name: swarm-manager000003
ID: JRZS:L436:UFYH:KTKG:7T4K:4HP5:TGFI:TOZC:4CSS:HQLW:KNEK:GI4K
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 90
 Goroutines: 152
 System Time: 2017-11-10T13:10:13.190976179Z
 EventsListeners: 1
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Expose docker daemon tcp port to allow commands via ssh tunnel

The docker manager instance runs on a unix socket. In order to allow commands to this manager via an ssh tunnel, should it also expose a listener on a tcp port?

That way, one can run commands from a remote docker client, and use a remote docker-compose instance for creating services.

Scaling swarm-worker-vmss virtual machine scale set takes down the stack

Expected behavior

I want to add more worker nodes and scale up one of my stack services. I would expect my stack to keep working while scaling up workers.

Actual behavior

When swarm-worker-vmss virtual machine scale starts resizing I loose connectivity to all the stack endpoints.

Steps to reproduce the behavior

Login into cloud.docker.com
Using Azure as a provider create a 1 Manager (VM DS4) and 4 Worker (VM DS3) swarm.
Once it is provisioned deploy a simple stack that exposes an HTTP endpoint.
Create a HTTP client that loops over calling one of the HTTP services. Leave it running forever.
Login into Portal Azure, navigate to the Swarm Resource Group, select the swarm-worker-vmss and scale to 10 workers.
The HTTP client in 4 dies with connections errors.
Retrying makes the client fail for a couple of minutes until the swarm services recover.

Service events not streaming via events API.

Expected behavior

Service events should be streamed via events API since version 1.30.

Actual behavior

Service events not streaming.

Information

Service events stream should be supported since 1.30 API

Docker version:

Client:
Version: 17.06.2-ce
API version: 1.30
Go version: go1.8.3
Git commit: cec0b72
Built: Tue Sep 5 19:57:21 2017
OS/Arch: linux/amd64

Server:
Version: 17.06.2-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: cec0b72
Built: Tue Sep 5 19:59:19 2017
OS/Arch: linux/amd64
Experimental: false

Steps to reproduce the behavior

Create or update service in cluster while streaming from events API
There are no messages regarding create/update/remove, just container events

Azure service container logs need to be rotated

Expected behavior

Should be able to access Manager/Workers host file system in order to clean-up log files

Actual behavior

As ssh sessions are directed to the agent container, there is no way to access host file system directories other than the ones that are automatically mounted.

Information

I have almost all file system taken but can't find who is using it:
swarm-manager000000: df
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 30831524 28023996 1218332 96% /
tmpfs 7168368 4 7168364 0% /dev
tmpfs 7168368 0 7168368 0% /sys/fs/cgroup
tmpfs 7168368 165104 7003264 2% /etc
/dev/sda1 30831524 28023996 1218332 96% /home
tmpfs 7168368 165104 7003264 2% /mnt
shm 7168368 0 7168368 0% /dev/shm
/dev/sda1 30831524 28023996 1218332 96% /etc/ssh
tmpfs 7168368 165104 7003264 2% /lib/modules
tmpfs 7168368 165104 7003264 2% /lib/firmware
/dev/sda1 30831524 28023996 1218332 96% /var/log
/dev/sda1 30831524 28023996 1218332 96% /etc/hosts
/dev/sda1 30831524 28023996 1218332 96% /etc/hostname
/dev/sda1 30831524 28023996 1218332 96% /etc/resolv.conf
tmpfs 1433676 1816 1431860 0% /var/run/docker.sock
/dev/sda1 30831524 28023996 1218332 96% /var/lib/waagent
tmpfs 7168368 165104 7003264 2% /usr/local/bin/docker
/dev/sdb1 209713148 121824 209591324 0% /mnt/resource
Output of du:
swarm-manager000000: sudo du / -h -d 1
1.5M /sbin
0 /proc
111.6M /usr
1.2M /etc
7.0M /lib
16.0K /media
4.0K /srv
8.0K /tmp
4.0K /dev
12.0K /run
172.0K /root
0 /sys
720.0K /home
4.0K /mnt
88.0M /var
1.9M /bin
32.0K /opt
8.0K /daemons
7.5M /WALinuxAgent
219.6M /

Azure Managed Disks

Please update the script to use Azure Managed Disks.

upgrade doesn't works in stable channel

Expected behavior

upgrade docker to latest stable version

Actual behavior

still in previous version

Information

Just created swarm
run docker run \ -v /var/run/docker.sock:/var/run/docker.sock \ -v /usr/bin/docker:/usr/bin/docker \ -ti \ docker4x/upgrade-azure:17.06.1-ce-azure1

The whole process seems to go without errors, 2 of 3 nodes get restarted (the one running upgrade never got restarted)

after the process still in 17.06.0 version

Steps to reproduce the behavior

run upgrade container
run version

Default logging backend and doc guidance

Logging is configured to use syslog. This means docker logs does not work. And docker service logs <servicename> just hangs without producing any output.

Please provide some guidance in the documentation about how to manage logging within Azure and swarm mode. According to moby/moby#24812 this should generally be working, but perhaps not with the syslog configured default. The default setup should probably allow docker service logs to work correctly, and then people can modify their logging setup as desired from there.

Network interfaces are not being cleared

Expected behavior

Hi, whilst deploying a swarm update, the update failed and the following error was available:

starting container failed: container 48f94450916b0511b7066c5e735b7c13ed5ecd7bbcd748783b04ff4c2435af30: endpoint create on GW Network failed: failed to create endpoint gateway_48f94450916b on network docker_gwbridge: adding interface veth3a88d2a to bridge docker_gwbridge failed: exchange full"

Counting the network interfaces with ifconfig | grep HWaddr | wc -l , there were 1030 interfaces. I'm running a small test swarm of about 10 replicas in total across 4 nodes, so this is a bit excessive!

Rebooting each node from the Azure portal cleared the interfaces.

Actual behavior

Not to have so many network interfaces.

Information

Full output of the diagnostics from "docker-diagnose" ran from one of the instance
diagnostics session: 1488816054-57qpScTMNJqWtP4VdM4RdcoooisvSGmn

Steps to reproduce the behavior

I think this has slowly occurred over a few weeks of usage, so its' hard to reproduce immediately, but I wanted to put this out there in case other people are experiencing problems. I'll monitor the interface count over time and see if it reoccurs.

Unable to deploy swarm using Standard_A0 manager and worker size

Expected behavior

Deploy swarm using standard Docker for Azure template from https://docs.docker.com/docker-for-azure/#quickstart using VM size Standard_A0 for manager and worker

Actual behavior

Deployment times out after ~35 minutes and the Manager and Worker VMSS's never start

Information

Standard_A0 is a valid size according to the template. See https://download.docker.com/azure/stable/Docker.tmpl

Steps to reproduce the behavior

Use standard Docker for Azure template to deploy swarm
Choose Standard_A0 from the list of valid VM sizes for manager and worker
Deployment times out after ~35 minutes and manager and worker VMSS's never start

Ability to obtain client IP address on container HTTP requests

When deploying containers into Docker for Azure, it appears like there is no way to obtain the original client IP address for HTTP requests. The container sees only the internal Docker network address e.g. 10.0.x.x.

Normally, this would be handled via X-Forwarded-For headers, but by the time the request reaches an haproxy container, the source IP is already obscured.

Is there a solution?

IPv6 support

In the same way there is a static IPv4 address routed to the load balancer it would be useful to have an IPv6 address added by default in the initial setup.

Docker-CE-Basic Cannot Be Purchased due to validation errors

Expected behavior

Successful Azure Deployment

Actual behavior

Error message when clicking "Purchase" of the following:

{"telemetryId":"bcf038e5-fb71-4311-8bbd-da6ed8c42f8c","bladeInstanceId":"Blade_2d169c75bd024e6a82928663cc106edb_0_0","galleryItemId":"Microsoft.Template","createBlade":"DeployToAzure","code":"MarketplacePurchaseEligibilityFailed","message":"Marketplace purchase eligibilty check returned errors. See inner errors for details. ","details":[{"code":"BadRequest","message":"Offer with PublisherId: docker, OfferId: docker-ce-basic cannot be purchased due to validation errors. See details for more information.[{\"Offer with PublisherId: docker and OfferId: docker-ce-basic not found. If this offer has been created recently, please allow upto 30 minutes for this offer to be available for Purchase. If error persists, contact support.\":\"StoreApi\"}]"},{"code":"BadRequest","message":"Offer with PublisherId: docker, OfferId: docker-ce-basic cannot be purchased due to validation errors. See details for more information.[{\"Offer with PublisherId: docker and OfferId: docker-ce-basic not found. If this offer has been created recently, please allow upto 30 minutes for this offer to be available for Purchase. If error persists, contact support.\":\"StoreApi\"}]"}]}

Information

Tried multiple VM sizes, resource groups, etc to validate that it was nothing on my specific account
Someone else just posted this question to the Azure forum here

Steps to reproduce the behavior

Navigate to the Docker-CE Template on Azure
Fill out required fields and click "Purchase"

UCP Not showing accurate disk usage

Expected behavior

UCP should have accurate indication of worker disk usage

Actual behavior

Worker disk appears full despite UCP reporting available space

Information

Full output of the diagnostics from "docker-diagnose" ran from one of the instance

OK hostname=swarm-manager000000 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-manager000001 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-manager000002 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000000 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000001 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000002 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000003 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000004 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
Done requesting diagnostics.
Your diagnostics session ID is 1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
Please provide this session ID to the maintainer debugging your issue.

Steps to reproduce the behavior

Spin up docker cluster using beta template from #38 (worker instances are D3_V2)
Deploy a number of services (accumulated worker images are about 14GB)
Service deployments begin to fail with "No such image:<image-name>"
Verify Image exists in DTR and is pullable
Log on to worker and attempt to pull image ( ~200MB image )

swarm-worker000003:~$ docker pull <image-name>: Pulling from <repo>
6d987f6f4279: Already exists 
d0e8a23136b3: Already exists 
5ad5b12a980e: Already exists 
275352573fee: Pull complete 
ffbeb13b7578: Pull complete 
027bb24d721d: Pull complete 
aa04d7355dfa: Extracting [==================================================>]  45.51MB/45.51MB
failed to register layer: Error processing tar file(exit status 1): mkdir /app/node_modules/@types/lodash/gt: no space left on device

Check disk space from worker

swarm-worker000003:~$ df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  29.4G     17.5G     10.4G  63% /
tmpfs                     6.8G      4.0K      6.8G   0% /dev
tmpfs                     6.8G         0      6.8G   0% /sys/fs/cgroup
tmpfs                     6.8G    161.4M      6.7G   2% /etc
/dev/sda1                29.4G     17.5G     10.4G  63% /home
tmpfs                     6.8G    161.4M      6.7G   2% /mnt
shm                       6.8G         0      6.8G   0% /dev/shm
tmpfs                     6.8G    161.4M      6.7G   2% /lib/firmware
/dev/sda1                29.4G     17.5G     10.4G  63% /var/log
/dev/sda1                29.4G     17.5G     10.4G  63% /etc/ssh
tmpfs                     6.8G    161.4M      6.7G   2% /lib/modules
/dev/sda1                29.4G     17.5G     10.4G  63% /etc/hosts
/dev/sda1                29.4G     17.5G     10.4G  63% /var/etc/hostname
/dev/sda1                29.4G     17.5G     10.4G  63% /etc/resolv.conf
/dev/sda1                29.4G     17.5G     10.4G  63% /var/etc/docker
tmpfs                     1.4G      1.3M      1.4G   0% /var/run/docker.sock
/dev/sda1                29.4G     17.5G     10.4G  63% /var/lib/waagent
tmpfs                     6.8G    161.4M      6.7G   2% /usr/local/bin/docker
/dev/sdb1               200.0G    119.0M    199.9G   0% /mnt/resource

Check UCP Dashboard

The fact that the disk is full at all with only 14GB of data seems likely related to #19, #29
But unlike when we experienced #38 There was no indication from the dashboard (or even from the worker instance container itself) that some underlying storage resource was full (see df output above)

cloudstor:azure doesn't work with PostgreSQL

Original problem is described here: Azure/azurefile-dockervolumedriver#65

Persistent disk partition not showing

I'm currently experimenting with persistent storage when deploying Docker for Azure. I'm using the docker4azure image. When I attach disks to the VM scale set, I can see the disks in /dev. However, when I create a partition (I tried both fdisk and parted) the newly created partition does not show up in the /dev/ tree. I'm not quite sure why this is. I know that docker4azure is an Alpine Linux image which doesn't have something like udev, but the partition should still appear in the /dev/ tree.
The partition is listed in the dmesg output:

sdc: sdc1

But it is not available at /dev/sdc1.

I know that persistent storage with docker4azure is sort of in an experimental state but I simply want to attach some disks and partition them. In my view this should work, but for some reason it isn't.

docker4x/logger-azure:azure-v1.13.0-1 logs also to docker daemon

The container with image docker4x/logger-azure:azure-v1.13.0-1 presumably is responsible for writing logs to Azure storage. However, it also appears to write logs to the docker daemon. Does this mean that the logs are duplicated in multiple places: in the docker storage as well as to the log storage? I use logspout to send my docker daemon logs to elasticsearch, and I have a lot of irrelevant output from editions_logger ending up in ES.

The "create SP" container uses incorrect subscription to create resources.

Hi, see output below. Note I choose option (3) = 2c0...bf4 but the resources were actually created in option (1) = 246...94b

[vagrant@localhost ~]$ docker run -ti docker4x/create-sp-azure "WebFarm Deployment with campus access" docker-for-azure-test "UK South"
info:    Executing command login
\info:    To sign in, use a web browser to open the page https://aka.ms/devicelogin and enter the code DQLD3964Z to authenticate.
-info:    Added subscription Enterprise Dev/Test
info:    Added subscription Visual Studio Enterprise(Converted to EA)
info:    Added subscription WebFarm Deployment with campus access
info:    Added subscription SLSP Microsoft Azure Enterprise
info:    Setting subscription "Enterprise Dev/Test" as default
+
info:    login command OK
The following subscriptions were retrieved from your Azure account
1) 2463b7e7-0abb-4617-acff-48123430594b:Enterprise_Dev/Test
2) f0e68d37-fdb9-4359-a6ea-eebb4624351d:Visual_Studio_Enterprise(Converted_to_EA)
3) 2c0a4016-8c3a-4d9c-b88f-908dc4697bf4:WebFarm_Deployment_with_campus_access
4) a02ac5a4-d8ff-4cd6-808b-c3f67ebf7afa:SLSP_Microsoft_Azure_Enterprise
Please select the subscription option number to use for Docker swarm resources: 3
Using subscription 2c0a4016-8c3a-4d9c-b88f-908dc4697bf4
Creating AD application WebFarm Deployment with campus access
Created AD application, APP_ID=a44704fd-18a7-495d-8d17-3e849557bc1a
Creating AD App ServicePrincipal
Created ServicePrincipal ID=d09aa650-4117-4a21-8515-8fb90d202e51
Create new Azure Resource Group docker-for-azure-test in UK South
info:    Executing command group create
+ Getting resource group docker-for-azure-test
+ Creating resource group docker-for-azure-test
info:    Created resource group docker-for-azure-test
data:    Id:                  /subscriptions/2463b7e7-0abb-4617-acff-48123430594b/resourceGroups/docker-for-azure-test
data:    Name:                docker-for-azure-test
data:    Location:            uksouth
data:    Provisioning State:  Succeeded
data:    Tags: null
data:
info:    group create command OK

Parameterize network subnet

Currently, when using the template, the subnet is setup automatically as 10.0.0.0/8. This is extremely broad, and if other services within Azure are using any IP within that class A network, we cannot easily connect them to services running on Docker.

The subnet used by Docker for Azure should be a parameter that is filled in by the user during the setup phase.

azure:cloudstor plugin doesn't load storage correctly

We experienced issues with the azure:cloudstor plugin where the plugin didn't load the Azure storage correctly.

We have a bunch of services, which use the same volume. We created the services using the docker stack deploy command.

I created a dummy container to check the loaded storage on two different nodes:

docker service create --constraint "node.id == 5ry73uzy3m4jf8p933civtbar" --mount type=volume,source=production_audio,destination=/audio --name logger --log-driver json-file alpine sh -c 'while true; do sleep 5; ls -l /audio; done'

docker service logs -f logger # no output

docker service create --constraint "node.id == qb4oajnqi8tc0wvegkr87ssmi" --mount type=volume,source=production_audio,destination=/audio --name logger --log-driver json-file alpine sh -c 'while true; do sleep 5; ls -l /audio; done'

docker service logs -f logger
logger.1.78dlp35p10kg@swarm-manager00000K    | drwxrwxrwx    2 root     root             0 Jul  4 09:49 projects
logger.1.78dlp35p10kg@swarm-manager00000K    | drwxrwxrwx    2 root     root             0 Sep  4 14:20 recordings
logger.1.78dlp35p10kg@swarm-manager00000K    | drwxrwxrwx    2 root     root             0 Jul  4 09:08 uploads
logger.1.78dlp35p10kg@swarm-manager00000K    | drwxrwxrwx    2 root     root             0 Sep  4 14:21 waveforms

We cannot reproduce this issue, it seems that it happens randomly and most of the time when we create a new node on the cluster.

ID                            HOSTNAME              STATUS              AVAILABILITY        MANAGER STATUS
4gi5kwzwlron5y7ekdrnnynm5     swarm-manager00000E   Ready               Active              Leader
5ry73uzy3m4jf8p933civtbar     swarm-manager00000J   Ready               Active              Reachable
hwb5qgfwqtfhko9w4y3lfsc62 *   swarm-manager00000H   Ready               Active              Reachable
qb4oajnqi8tc0wvegkr87ssmi     swarm-manager00000K   Ready               Active              Reachable
vj5ct7afr9u2syptiy3qe8nik     swarm-worker000006    Ready               Active              
z9lqn97sub3p2og7kx8ganni4     swarm-worker000005    Ready               Active

OK hostname=swarm-manager00000E session=1506678273-FLLiB0hHe2gg6PtTOE3ygphZafxPqLZX
OK hostname=swarm-manager00000H session=1506678273-FLLiB0hHe2gg6PtTOE3ygphZafxPqLZX
OK hostname=swarm-manager00000J session=1506678273-FLLiB0hHe2gg6PtTOE3ygphZafxPqLZX
OK hostname=swarm-manager00000K session=1506678273-FLLiB0hHe2gg6PtTOE3ygphZafxPqLZX
OK hostname=swarm-worker000005 session=1506678273-FLLiB0hHe2gg6PtTOE3ygphZafxPqLZX
OK hostname=swarm-worker000006 session=1506678273-FLLiB0hHe2gg6PtTOE3ygphZafxPqLZX
Done requesting diagnostics.
Your diagnostics session ID is 1506678273-FLLiB0hHe2gg6PtTOE3ygphZafxPqLZX

Are there any known issues about that behaviour? Is there a way where I can check this or re-initialize the plugin?
To fix this problem we have to create a new node and delete the old one.

Installing git removes sudo and other packages

Expected behavior

Git to be installed, and no other changes.

Actual behavior

sudo, bash and other critical packages are removed. This is pretty fatal, and requires reimaging the host from the Azure portal, and rejoining it to the swarm.

Information

swarm-manager000000:~$ docker-diagnose
OK hostname=swarm-manager000000 session=1488383469-FivOCmxzAQA589aYV0tiX5uTSKawYLwI
OK hostname=swarm-manager000001 session=1488383469-FivOCmxzAQA589aYV0tiX5uTSKawYLwI
OK hostname=swarm-manager000002 session=1488383469-FivOCmxzAQA589aYV0tiX5uTSKawYLwI
OK hostname=swarm-worker000000 session=1488383469-FivOCmxzAQA589aYV0tiX5uTSKawYLwI
Done requesting diagnostics.
Your diagnostics session ID is 1488383469-FivOCmxzAQA589aYV0tiX5uTSKawYLwI
Please provide this session ID to the maintainer debugging your issue.

Steps to reproduce the behavior

run sudo apk add git

The output of this is:

swarm-manager000000:~$ sudo apk add git
(1/60) Purging bash (4.3.46-r4)
Executing bash-4.3.46-r4.pre-deinstall
(2/60) Purging openssh (7.4_p1-r0)
(3/60) Purging openssh-sftp-server (7.4_p1-r0)
(4/60) Purging sudo (1.8.19_p1-r0)
(5/60) Purging gawk (4.1.4-r0)
(6/60) Purging ifupdown (0.7.53.1-r1)
(7/60) Purging net-tools (1.60_git20140218-r1)
(8/60) Purging mii-tool (1.60_git20140218-r1)
(9/60) Purging openssl (1.0.2j-r2)
(10/60) Purging parted (3.2-r5)
(11/60) Purging py2-pip (9.0.0-r0)
(12/60) Purging rsyslog (8.20.0-r1)
(13/60) Purging supervisor (3.2.0-r0)
(14/60) Purging py-meld3 (1.0.2-r0)
(15/60) Purging py-setuptools (29.0.1-r0)
(16/60) Purging python2 (2.7.13-r0)
(17/60) Purging util-linux (2.28.2-r1)
(18/60) Purging findmnt (2.28.2-r1)
(19/60) Installing busybox-initscripts (3.0-r8)
Executing busybox-initscripts-3.0-r8.post-install
(20/60) Installing libcap (2.25-r1)
(21/60) Installing chrony (2.4-r0)
Executing chrony-2.4-r0.pre-install
(22/60) Installing keyutils-libs (1.5.9-r1)
(23/60) Installing krb5-conf (1.0-r1)
(24/60) Installing libcom_err (1.43.3-r0)
(25/60) Installing libverto (0.2.5-r0)
(26/60) Installing krb5-libs (1.14.3-r1)
(27/60) Installing talloc (2.1.8-r0)
(28/60) Installing cifs-utils (6.6-r0)
(29/60) Installing dhcpcd (6.11.5-r0)
(30/60) Installing e2fsprogs-libs (1.43.3-r0)
(31/60) Installing e2fsprogs (1.43.3-r0)
(32/60) Installing e2fsprogs-extra (1.43.3-r0)
(33/60) Installing fuse (2.9.7-r0)
(34/60) Installing hvtools (4.4.15-r0)
(35/60) Installing libmnl (1.0.4-r0)
(36/60) Installing libnftnl-libs (1.0.7-r0)
(37/60) Installing iptables (1.6.0-r0)
(38/60) Installing openrc (0.21.7-r4)
Executing openrc-0.21.7-r4.post-install
(39/60) Installing strace (4.14-r0)
(40/60) Installing sysklogd (1.5.1-r0)
(41/60) Installing xz-libs (5.2.2-r1)
(42/60) Installing xz (5.2.2-r1)
(43/60) Purging readline (6.3.008-r4)
(44/60) Purging ncurses-libs (6.0-r7)
(45/60) Purging ncurses-terminfo (6.0-r7)
(46/60) Purging ncurses-terminfo-base (6.0-r7)
(47/60) Purging libssl1.0 (1.0.2j-r2)
(48/60) Purging libcrypto1.0 (1.0.2j-r2)
(49/60) Purging device-mapper-libs (2.02.168-r3)
(50/60) Purging libbz2 (1.0.6-r5)
(51/60) Purging libffi (3.2.1-r2)
(52/60) Purging gdbm (1.12-r0)
(53/60) Purging sqlite-libs (3.15.2-r0)
(54/60) Purging libestr (0.1.10-r0)
(55/60) Purging libfastjson (0.99.4-r0)
(56/60) Purging libgcrypt (1.7.3-r0)
(57/60) Purging libgpg-error (1.24-r0)
(58/60) Purging liblogging (1.0.5-r1)
(59/60) Purging libnet (1.1.6-r2)
(60/60) Purging libmount (2.28.2-r1)
Executing busybox-1.25.1-r0.trigger
Executing ca-certificates-20161130-r0.trigger
OK: 38 MiB in 50 packages

swarm-manager000000:~$ sudo
-sh: sudo: not found

Support for multiple nodes sizes

Currently D4A create 2 VMSS (one for masters, one for workers), I suggest more VMSS can be created, this new sets can be used for different environments (production/staging/etc) or different proposes (CPU intensive/Mem intensive/etc).

Use cases

isolate environments
use labels to deploy containers to more suitable VMs

Expected behavior

Be able to use several VMs sizes as workers

Actual behavior

Only one VM size is allowed

Additional Resources without Tags

Expected behavior

I should be able to add additional resources to the Azure Template that don't have tags.

Actual behavior

When I add a resource that doesn't have any tags on it, the swarm will come up without having cloudstor:azure plugin installed, because it is not able to determine the channel tag.

This may depend on the order that the resources are created, or just the order that the resources are listed in the resource group.

The error message I am getting is:

Traceback (most recent call last):
  File "/usr/bin/aztags.py", line 47, in <module>
    main()
  File "/usr/bin/aztags.py", line 44, in main
    print(get_tag_value(resource_client, args.tag_name))
  File "/usr/bin/aztags.py", line 23, in get_tag_value
    if tag_name in item.tags:
TypeError: argument of type 'NoneType' is not iterable
 Skip cloudstor installation

Information

aztags.py has a function/method:

def get_tag_value(resource_client, tag_name):
    for item in resource_client.resource_groups.list_resources(RG_NAME):
        if tag_name in item.tags:
            return item.tags[tag_name]
    raise KeyError(tag_name + " Not found in any resource")

get_tag_value of aztags.py should check for null tags (please don't trust my phython skills):

def get_tag_value(resource_client, tag_name):
    for item in resource_client.resource_groups.list_resources(RG_NAME):
        if not (item.tags is None) and tag_name in item.tags:
            return item.tags[tag_name]
    raise KeyError(tag_name + " Not found in any resource")

Home directory still requires sudo to write

According to https://docs.docker.com/docker-for-azure/release-notes/ the latest version should no longer require sudo to write to the home directory. It still does, as it is owned by root with 755 permissions:

swarm-manager000000:~$ ls -al
total 12
drwxr-xr-x    3 root     root          4096 Jan 24 15:52 .
drwxr-xr-x    3 docker   docker        4096 Jan 24 15:52 ..
drwx------    2 docker   docker        4096 Jan 24 15:52 .ssh

Add the possibility to add mount option to cloudstor

Expected behavior

Add mount option like nobrl.

Actual behavior

We can't.

Information

Grafana use a sqlite database and it doesn't work on a share volume using cloudstor and Azure.

This seems to be a "common" CIFS issue related to the byte-range blocking that behaves unexpectedly with sqlite locks. It's usually resolved by using the nobrl flag in the mounting options.

Adding this parameter is dangerous because we can create corrupt database, and it should be included by default, but if we can add it per mount basis, it could resolve some issues.

Proposition

We could add mount flags like volume-opt=smb_mount_param_X=... or volume-opt=smb_mount_param_Y when the flag don't have a value.

Thanks!

Custom Script Extension Failing to Install

Cross-posting from Azure/custom-script-extension-linux#90 for visibility.

Stable channel deployment missing latest stable version

Expected behavior

After deployment I expect to have latest stable version (17.06.2-ce)

Actual behavior

version deployed is 17.06.0-ce

Information

swarm-manager000001:~$ docker version
Client:
Version: 17.06.0-ce
API version: 1.30
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:15:15 2017
OS/Arch: linux/amd64

Server:
Version: 17.06.0-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:51:55 2017
OS/Arch: linux/amd64
Experimental: false

Steps to reproduce the behavior

deploy using azure portal
run docker version

waagent.log getting too big after some period of time

Expected behavior

There should be some logging policy within waagent.log like:

SizeBasedTriggeringPolicy
TimeBasedTriggeringPolicy

or maybe just different log level by default.

Actual behavior

I was running out of space on my VMs and find out that waagent.log is taking almost 1/3 of my disk space (25GB). Log size was around 8GB.

Adding SSH key to authorized_keys

Hi,

I want to add another ssh key to the authorized keys of each of the nodes in my swarm so a colleague can also ssh into the swarm - is there a way to do this without logging into each node in turn?

I thought something like this might do the trick:

swarm-exec docker run -v /home/docker/.ssh:/docker-ssh bash bash -c  "echo \"<PUBLIC SSH KEY>\" >> /docker-ssh/authorized_keys"

but the file remained unchanged. I tested this on my local machine and it had the correct effect but not on the swarm

Sorry if this isn't the correct place to ask this, let me know if I should ask elsewhere.

Cheers
Dave

Document how to do host mounts and/or backups/restores

The fact that managers and workers are running in containers creates unexpected behavior when doing things like host mounts.

For example, on a worker node:

$ cd ~docker
$ touch foobar
$ docker run -it --rm -v /home/docker/:/foo ubuntu /bin/bash`
# ls -a /foo
.ssh

Presumably this is because the mount is on the host, and not from the container that executes the docker command, so the ubuntu container is seeing the vm's /home/docker and not the workers /home/docker.

I noticed the latest version of the documentation does not even mention that the manager and workers are running inside containers themselves, so the above behavior would be very surprising to someone who does not know this.

The reason I was doing the host mount was to restore some volume data from a backup in a tar.gz. Because I was unable to do the host mount, I ended up piping the tar.gz into the restore container via standard input. Perhaps this technique, or other recommended approach for this use case, could be documented/mentioned also?

Swarm managers using up all available memory

Expected behavior

A swarm to which we can deploy multiple stacks, each consisting of 12 services and representing a testing environment for each open Pull Request of our app, automatically provisioned.
Whenever a PR is closed, a task removes its corresponding stack from the swarm. It runs circa 400 containers.

Actual behavior

As mentioned above, except that whenever we approach the 400 containers mentioned above, the swarm drastically increases its RAM and CPU usage, until the point at which it is using all available RAM on all nodes and it is no longer responsive.

Information

Our swarm is spec'd as follows:

3 managers, Standard_D11_v2
7 workers Standard_D12_v2

Full output of the diagnostics from "docker-diagnose" ran from one of the instance

Um, a bit of a problem here, as all nodes return the following on running docker-diagnose

swarm-worker000003:~$ docker-diagnose
Error: No such object: meta-azure

A reproducible case if this is a bug, Dockerfiles FTW

Working on this...

Steps to reproduce the behavior

Provision a swarm as described above, and launch approximately 30 stacks, each made of 12 services.
After a few hours, the swarm's resource usage begins to increase drastically until it becomes nigh-unresponsive.

Any more info you require, please feel free to ask me.

This is a showstopper for us, as our testing envs cannot be properly deployed on the swarm atm.

Volumes with `cloudstor:azure` driver prevent set timeStamp

I've created a jenkins service:

  docker service create --name jenkins \
       --mount type=volume,volume-driver=cloudstor:azure,source={{.Service.Name}}-{{.Task.Slot}}-
   vol,destination=/var/jenkins_home  \
        -p 8080:8080 -p 4040:4040 jenkinsci/jenkins

But Jenkins stops with exception:

SEVERE: Failed to initialize Jenkins
hudson.util.HudsonFailedToLoad: java.lang.RuntimeException: java.io.IOException: Failed to set the timestamp of /var/jenkins_home/secrets/initialAdminPassword to 1495234797271
at hudson.WebAppMain$3.run(WebAppMain.java:252)
Caused by: java.lang.RuntimeException: java.io.IOException: Failed to set the timestamp of /var/jenkins_home/secrets/initialAdminPassword to 1495234797271
at jenkins.install.InstallState$3.initializeState(InstallState.java:107)
at jenkins.model.Jenkins.setInstallState(Jenkins.java:1060)
at jenkins.install.InstallUtil.proceedToNextStateFrom(InstallUtil.java:96)
at jenkins.model.Jenkins.(Jenkins.java:950)
at hudson.model.Hudson.(Hudson.java:86)
at hudson.model.Hudson.(Hudson.java:82)
at hudson.WebAppMain$3.run(WebAppMain.java:235)
Caused by: java.io.IOException: Failed to set the timestamp of /var/jenkins_home/secrets/initialAdminPassword to 1495234797271
at hudson.FilePath$22.invoke(FilePath.java:1481)
at hudson.FilePath$22.invoke(FilePath.java:1470)
at hudson.FilePath.act(FilePath.java:997)
at hudson.FilePath.act(FilePath.java:975)
at hudson.FilePath.touch(FilePath.java:1470)
at jenkins.install.SetupWizard.init(SetupWizard.java:114)
at jenkins.install.InstallState$3.initializeState(InstallState.java:105)
... 6 more

swarm-manager000000:~$ docker version
Client:
Version: 17.05.0-ce
API version: 1.29
Go version: go1.7.5
Git commit: 89658be
Built: Thu May 4 21:43:09 2017
OS/Arch: linux/amd64

Server:
Version: 17.05.0-ce
API version: 1.29 (minimum version 1.12)
Go version: go1.7.5
Git commit: 89658be
Built: Thu May 4 21:43:09 2017
OS/Arch: linux/amd64
Experimental: false

No way to use a VM's attached VHD

I use D2_v2 VMs for my swarm. They offer a 100GB VHD, which I consider enough for my use case, for the time being. Still, this disk isn't being used at all by docker, instead it uses the system mount (30GB) which ends up running out of space very quickly.

Expected behavior

Docker to use the full storage extent of the VM I'm paying for (i.e: store images in the attached VHD)

Actual behavior

Docker uses the system mount (30GB) and it runs out of space pretty quickly, making it impossible for me to run services because newer images never get downloaded. Also, even if I buy VMs with larger disks, it'd make no difference since the system mount is always 30GB.

Information

swarm-manager000000:~$ docker-diagnose 
OK hostname=swarm-manager000000 session=1496436851-iVO4EGwLG7PNWob3jI5qOnPF16rW0Les
OK hostname=swarm-manager000001 session=1496436851-iVO4EGwLG7PNWob3jI5qOnPF16rW0Les
OK hostname=swarm-manager000002 session=1496436851-iVO4EGwLG7PNWob3jI5qOnPF16rW0Les
OK hostname=swarm-worker000000 session=1496436851-iVO4EGwLG7PNWob3jI5qOnPF16rW0Les
OK hostname=swarm-worker000001 session=1496436851-iVO4EGwLG7PNWob3jI5qOnPF16rW0Les
OK hostname=swarm-worker000002 session=1496436851-iVO4EGwLG7PNWob3jI5qOnPF16rW0Les
OK hostname=swarm-worker000003 session=1496436851-iVO4EGwLG7PNWob3jI5qOnPF16rW0Les
OK hostname=swarm-worker000004 session=1496436851-iVO4EGwLG7PNWob3jI5qOnPF16rW0Les
Done requesting diagnostics.
Your diagnostics session ID is 1496436851-iVO4EGwLG7PNWob3jI5qOnPF16rW0Les
Please provide this session ID to the maintainer debugging your issue.

This could be solved by either making it the default setting for the docker daemon. Even though this would make all images and containers be lost when the VM is reset, all services and stacks would be re-scheduled to other nodes so it shouldn't have much impact in existing applications, and would allow users to leverage their swarms better. Also, it'd be good to count with the space advertised in the VM size website.

Steps to reproduce the behavior

Create any swarm with D2_v2 size VMs
ssh into any node and pull 30GB+ worth of images
See yourself running out of space even though you're supposed to have 100GB by doing:
3.1. cd /
3.2. sudo du -d 1 -h -c

Missing upgrade of stable channel to 17.12-ce

Hi, wanting to upgrade a 17.09-ce swarm, as having network issues (eg Address already in use)

Expected behavior

upgrade.sh 17.12.0-ce-azure1
or
upgrade.sh 17.12.1-ce-azure1

Actual behavior

I see that there is a 17.12 release on the stable channel
https://docs.docker.com/docker-for-azure/release-notes/#stable-channel

However there doesnt seem to be a 17.12 tag in docker4x/upgrade-azure
https://hub.docker.com/r/docker4x/upgrade-azure/tags/

Actually the naming format for the upgrade tags seems to have changed
https://docs.docker.com/docker-for-azure/upgrade/#upgrading

Was: docker4x/upgrade-azure:17.06.1-ce-azure1
Now: docker4x/upgrade-azure:18.02-latest

Or are we waiting for https://github.com/docker/docker-ce/releases/tag/v17.12.1-ce

After deploy on Azure, no swarm mode is available

Expected behavior

After deploying on Azure, login to swarm-manager, issue docker node ls to check the list of managers and nodes.

Actual behavior

An error message stating that:
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.

Information

Steps to reproduce the behavior

Deploy Docker for Azure template
Log in to swarm-manager000000
docker node ls
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.

Any ideas on what might be happening here?

`/etc/sudoers` lost on virtual machine restart

I restarted a VM in a scale set. It now does not let me sudo any more:

$ sudo -i
sudo: unable to stat /etc/sudoers: No such file or directory
sudo: no valid sudoers sources found, quitting
sudo: unable to initialize policy plugin

docker. and messages.log files fill up the first partition on a busy worker

Expected behavior

log files don't fill up the drive

Actual behavior

log files fill up the drive

Steps to reproduce the behavior

Hi @ddebroy - related to #31, @jparkerCAA and I noticed notice that the docker and messages logs on each worker are being written to the small 30gb partition on each host. Once we enabled our continuous integration pipeline, we filled up the drive within 2 days. It would be nice to specify a maximum file size for the various logs and they should be relocated to the much larger secondary partition on each host.

Install docker-compose into the manager shell

The manager shell does not have docker-compose installed. This would be useful to deploy services via compose files.

Update doesn't work on edge channel

I want to upgrade to Docker 17.05.0-ce but the upgrade.sh script fails.

Expected behavior

upgrade.sh https://download.docker.com/azure/edge/Docker.tmpl

Actual behavior

upgrade.sh https://download.docker.com/azure/edge/Docker.tmpl

executing upgrade on d12eb6d1e505
  File "/usr/bin/azupgrade.py", line 402
    subprocess.check_output(["docker", "node", "demote", node_id])
                                                                 ^
IndentationError: unindent does not match any outer indentation level

Information

Client:
Version: 17.04.0-ce
API version: 1.28
Go version: go1.7.5
Git commit: 4845c56
Built: Tue Apr 4 00:37:25 2017
OS/Arch: linux/amd64

Server:
Version: 17.04.0-ce
API version: 1.28 (minimum version 1.12)
Go version: go1.7.5
Git commit: 4845c56
Built: Tue Apr 4 00:37:25 2017
OS/Arch: linux/amd64
Experimental: false

We have a swarm cluster with 3 masters and 2 workers. I called the script on one manager node (not the master).

Logger-azure should flush its buffers after some timeout

Logs are sometimes buffered for some time before ending up in the Azure storage account. I did some tests and it looks like docker4x/logger-azure flushes the log buffer only when buffer is full or when some content is buffered for more than 30 seconds but this check is done only when some new logs are coming in.
It might then happen that some content in buffer is older than 30 seconds, but because there are no new logs coming in, the check is never performed and so the logs stay in buffer.

I tested this by deploying the following test container and looking when the logs arrive in the storage account. The observed lag was between 5-8 minutes.

FROM bash:4.4.12
COPY start.sh /
CMD ["/start.sh"]

#!/usr/local/bin/bash

count=0
while :; do
    echo -n "$((count++)): "
    date
    sleep 1
    if (( count > 20 )); then
        sleep 3600
    fi
done

Add container to attachable network doesn't expose ports

Expected behavior

expose ports

Actual behavior

no ports

Steps

Trying to connect a VPN container to one attachable overlay network following this workaround:

moby/swarmkit#1030 (comment)

I tried to expose the port by hand in azure portal, but seems like Is not possible to edit it manually when created by D4A