Deploying a HA Docker Swarm Cluster

January 2, 2016

Given Docker’s propensity for creating easy to use tools it shouldn’t come as a surprise that Docker Swarm is one of the easier to understand and run of the “Docker Clustering” options currently out there. I recently built some Terraform configs for deploying a Highly Available Docker Swarm cluster on Openstack and learned a fair bit about Swarm in the process.

This guide is meant to be a platform agnostic howto on installing and running a Highly Available Docker Swarm to show you the ideas and concepts that may not be as easy to understand from just reading some config management code.

CoreOS

The reason for using CoreOS here is that to make Swarm run in High Availability mode as well as being able to support docker networking between hosts we need to use service discovery. We can choose to use etcd, consul, or zookeeper here, CoreOS comes with etcd thus makes it an excellent choice for running Docker Swarm.

You will need three servers capable of running CoreOS. See the “Try Out CoreOS” section of their website for various installation methods for different infrastructure. For this guide I will use the official CoreOS Vagrant Example.

skip the rest of this section if you install CoreOS for a different platform

Clone down the Vagrant example:

$ git clone https://github.com/coreos/coreos-vagrant.git vagrant-docker-swarm 
Cloning into 'vagrant-docker-swarm'...
remote: Counting objects: 411, done.
remote: Total 411 (delta 0), reused 0 (delta 0), pack-reused 411
Receiving objects: 100% (411/411), 100.33 KiB | 0 bytes/s, done.
Resolving deltas: 100% (181/181), done.
Checking connectivity... done.
cd vagrant-docker-swarm

Edit the Vagrantfile to set $num_instances = 3:

on Unix-like systems you can do this easily with sed

sed -i 's/\$num_instances = 1/\$num_instances = 3/' Vagrantfile

Get a new etcd discovery-url:

if you are on a windows box and don’t have curl you can paste the url into a web browser to get the discovery-url

$ curl https://discovery.etcd.io/new\?size\=3
https://discovery.etcd.io/6a9c62105f04dac40a29b90fbed322ef

Create a cloud-init file called user-data in the base of the repo using the discovery-url from above:

#cloud-config

coreos:
  etcd2:
    discovery: https://discovery.etcd.io/888fd1e440faf680a7abb3fd934da6fd
    advertise-client-urls: http://$public_ipv4:2379
    initial-advertise-peer-urls: http://$public_ipv4:2380
    listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001
    listen-peer-urls: http://$public_ipv4:2380,http://$public_ipv4:7001
  units:
    - name: etcd2.service
      command: start

Start up the CoreOS VMs and log into the first one to check everything worked ok:

$ vagrant up
Bringing machine 'core-01' up with 'virtualbox' provider...
Bringing machine 'core-02' up with 'virtualbox' provider...
Bringing machine 'core-03' up with 'virtualbox' provider...
...
$ vagrant ssh core-01
$ etcdctl member list
3c5901a3db54efa3: name=f1bae7bba7714ed7b4585c6b1256ddb2 peerURLs=http://172.17.8.101:2380 clientURLs=http://172.17.8.101:2379
9eeb141350af8439: name=5c8e57890d114d7d9d7aef662033a6e0 peerURLs=http://172.17.8.103:2380 clientURLs=http://172.17.8.103:2379
ebcc652087dfe6e8: name=de426249d3b34e23a5706d99b4900665 peerURLs=http://172.17.8.102:2380 clientURLs=http://172.17.8.102:2379

Docker Swarm

Now that we have several CoreOS servers with a working etcd cluster we can move on to setting up Docker Swarm.

We need to modify docker to listen on tcp port 2376 as well as registering itself to service discovery (which will allow us to set up overlay networking later on). We do this by creating a file custom.conf in /etc/systemd/system/docker.service.d/ on each server.

if not using vagrant change eth1 to match the primary interface for your server

[Service]
Environment="DOCKER_OPTS=-H=0.0.0.0:2376 -H unix:///var/run/docker.sock --cluster-advertise eth1:2376 --cluster-store etcd://127.0.0.1:2379"

We then need to reload the systemctl daemon and then restart docker for these changes to take effect.

sudo systemctl daemon-reload
sudo systemctl restart docker

Check that you can access docker via tcp on one of your hosts:

$ docker -H tcp://172.17.8.101:2376 info
Containers: 0
Images: 0
Engine Version: 1.9.1
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.3.3-coreos
Operating System: CoreOS 899.1.0
CPUs: 1
Total Memory: 997.4 MiB
Name: core-01
ID: BK64:WF3J:5JU6:VYLI:YJSO:CAQH:HPYM:MPTG:FMTA:VLE3:HSMP:F4VQ
Cluster store: etcd://127.0.0.1:2379/docker

We’re now ready to run Docker Swarm itself. There are two extra components to running Docker Swarm, a Swarm Agent and a Swarm Manager.

The Swarm Agent watches the local Docker service via it’s TCP port and registers it into service discovery (etcd in our case). We will run this on each server like so:

set the –addr= argument to match the primary IP of each node

$ docker run -d --name swarm-agent \
    --net=host swarm:latest \
        join --addr=172.17.8.101:2376 \
        etcd://127.0.0.1:2379

The Swarm Manager watches service discovery and exposes a TCP port (2375) which when accessed by a Docker client will perform actions and schedule containers across the Swarm cluster.

To ensure High Availability of our cluster we’ll run a Swarm Manager on each server:

$ docker run -d --name swarm-manager 
    --net=host swarm:latest manage \
    etcd://127.0.0.1:2379

Assuming everything went smoothly we can now access the swarm cluster via the Swarm Managers TCP port on any of the servers:

$ docker -H tcp://172.17.8.101:2375 info
Containers: 6
Images: 5
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 3
 core-01: 172.17.8.101:2376
  └ Status: Healthy
  └ Containers: 2
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 1.023 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.3.3-coreos, operatingsystem=CoreOS 899.1.0, storagedriver=overlay
 core-02: 172.17.8.102:2376
  └ Status: Healthy
  └ Containers: 2
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 1.023 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.3.3-coreos, operatingsystem=CoreOS 899.1.0, storagedriver=overlay
 core-03: 172.17.8.103:2376
  └ Status: Healthy
  └ Containers: 2
  └ Reserved CPUs: 0 / 1
  └ Reserved Memory: 0 B / 1.023 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.3.3-coreos, operatingsystem=CoreOS 899.1.0, storagedriver=overlay
CPUs: 3
Total Memory: 3.068 GiB
Name: core-01

Our next step is to create an overlay network using the docker network command:

$ docker -H tcp://172.17.8.101:2375 network create --driver overlay my-net
614913b275dee43a63b48d08b4f5e52f7c0e531d70c63eeb8bb35624470da0c4

$ docker -H tcp://172.17.8.101:2375 network ls                            
NETWORK ID          NAME                DRIVER
86ecb0cf32c6        core-02/none        null                
c7a291ed8366        core-01/host        host                
3747364c5961        core-03/none        null                
8245d6d3ac67        core-02/host        host                
614913b275de        my-net              overlay             
61ead145e9dd        core-01/bridge      bridge              
c9457c4f4588        core-03/bridge      bridge              
b8a6c75cb3b9        core-03/host        host                
bdc4d5ccd778        core-02/bridge      bridge              
66afdc892361        core-01/none        null

Finally we’ll create a Container on one host and then check that it is accessible from another:

replace the node==XXXX argument with the hostname of one of your hosts, make sure to use a different node for each docker command

$ docker run -it --name=web --net=my-net \
    -H tcp://172.17.8.101:2375 \
    --env="constraint:node==core-01" nginx
e0fe18c946a5692806608f939d4d6f31c670e3f42bf3942a77142bed2095983e

$ docker run -it --rm --net=my-net \
    -H tcp://172.17.8.101:2375 \
    --env="constraint:node==core02" busybox wget -O- http://web
Connecting to web (10.0.0.2:80)
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>

If you’ve been following along you have successfully deployed a Highly Available Docker Swarm cluster. From here you could use a load balancer to load balance the Swarm Manager port (2375) or even use Round Robin DNS.

You may have notice there is no authentication or authorization on this and anybody with a Docker binary and TCP access to your hosts could spin up docker containers. This is fairly easily fixed by using Docker’s TLS cert based authorization.

To read how to secure both Docker and Docker Swarm with TLS read the followup post Secure Docker with TLS.