Docker images are “supposed” to be small and fast. However unless you’re precompiling GO binaries and dropping them in the busybox
image they can get quite large and complicated. Without a well constructed Dockerfile
to improve build cache hits your docker builds can become unnecessarily slow.
Dockerfile
’s are regularly [and incorrectly] treated like bash
scripts and therefore are often written out as a series of commands which you would curl | sudo bash
from a website to install. This usually makes for an inefficient and slow Dockerfile
When you’re building a new Dockerfile
for an application there can be a lot of trial and error in determining what packages are needed and what commands need to run. Optimizing your Dockerfile
ensures that the build cache will hit more often and each build between changes will be faster.
The general rule of thumb is to sort your commands by frequency of change, the time it takes to run the command and how sharable it is with other images.
This means that commands like WORKDIR
, CMD
, ENV
should go towards the bottom while a RUN apt-get -y update
should go towards the top as it takes longer to run and can be shared with all of your images.
Finally any ADD
( or other commands that invalidate cache ) commands should go as far down the bottom as possible as this is where you’re likely to make lots of changes that will invalidate the cache of subsequent commands.
There’s a lot of base images to choose from from the bare OS images like ubuntu:trusty
to application specific ones for python:2
or java:7
. Common sense might tell you to use ruby:2
to run an ruby based app and python:3
to run a python app. However now you have two base images with little in common that you need to download and build. Instead if you use ubuntu:trusty
for both then you only need to download the base image once.
Each command in a Dockerfile
is an extra layer. You can very quickly end up with an image that’s 30+ layers. This is not necessarily a problem, but by joining RUN
commands together, and using a single EXPOSE
line to list all of your open ports you can reduce the number of layers.
By grouping RUN
commands together intelligently you can share more layers between containers. Of course if you have a common set of packages across multiple containers then you should look at creating a seperate base image containing these that all of your images are built from.
For each layer that you can share across multiple images you can save a ton of disk space.
If you use Volume containers, don’t bother trying to save space by using a small image, Use the image of the application you’ll be serving data to. If you do that and docker commit
the data volume you not only have your data commited to the container, but the actual application as well which is very useful for debugging.
If you’ve built an image and discover when you run it that there’s a package missing add it to the bottom of your Dockerfile
rather than in the RUN apt-get
command at the top. This means you can rebuild the image faster. Once your image is correct and working you can reorganize your Dockerfile
to clean such changes up before commiting it to source control.
A Dockerfile
for installing graphite would look something like this if it was written like a bash
script:
FROM ubuntu:trusty
MAINTAINER Paul Czarkowski "paul@paulcz.net"
RUN apt-get -yq update
# Apache
RUN \
apt-get -yqq install \
apache2 \
apache2-utils \
libapache2-mod-python \
python-dev \
python-pip \
python-cairo \
python-pysqlite2 \
python-mysqldb \
python-jinja2
sqlite3 \
curl \
wget \
git \
software-properties-common
RUN \
curl -sSL https://bootstrap.pypa.io/get-pip.py | python && \
pip install whisper \
carbon \
graphite-web \
'Twisted<12.0' \
'django<1.6' \
django-tagging
# Add start scripts etc
ADD . /app
RUN mkdir -p /app/wsgi
RUN useradd -d /app -c 'application' -s '/bin/false' graphite
RUN chmod +x /app/bin/*
RUN chown -R graphite:graphite /app
RUN chown -R graphite:graphite /opt/graphite
RUN rm -f /etc/apache2/sites-enabled/*
ADD ./apache-graphite.conf /etc/apache2/sites-enabled/apache-graphite.conf
# Expose ports.
EXPOSE 80
EXPOSE 2003
EXPOSE 2004
EXPOSE 7002
ENV APACHE_CONFDIR /etc/apache2
ENV APACHE_ENVVARS $APACHE_CONFDIR/envvars
ENV APACHE_RUN_USER www-data
ENV APACHE_RUN_GROUP www-data
ENV APACHE_RUN_DIR /var/run/apache2
ENV APACHE_PID_FILE $APACHE_RUN_DIR/apache2.pid
ENV APACHE_LOCK_DIR /var/lock/apache2
ENV APACHE_LOG_DIR /var/log/apache2
WORKDIR /app
# Define default command.
CMD ["/app/bin/start_graphite"]
However an optmized version of this same Dockerfile based on what was discussed earlier would look like the following:
# 1 - Common Header / Packages
FROM ubuntu:trusty
MAINTAINER Paul Czarkowski "paul@paulcz.net"
RUN apt-get -yq update \
&& apt-get -yqq install \
wget \
curl \
git \
software-properties-common
# 2 - Python
RUN \
apt-get -yqq install \
python-dev \
python-pip \
python-pysqlite2 \
python-mysqldb
# 3 - Apache
RUN \
apt-get -yqq install \
apache2 \
apache2-utils
# 4 - Apache ENVs
ENV APACHE_CONFDIR /etc/apache2
ENV APACHE_ENVVARS $APACHE_CONFDIR/envvars
ENV APACHE_RUN_USER www-data
ENV APACHE_RUN_GROUP www-data
ENV APACHE_RUN_DIR /var/run/apache2
ENV APACHE_PID_FILE $APACHE_RUN_DIR/apache2.pid
ENV APACHE_LOCK_DIR /var/lock/apache2
ENV APACHE_LOG_DIR /var/log/apache2
# 5 - Graphite and Deps
RUN \
apt-get -yqq install \
libapache2-mod-python \
python-cairo \
python-jinja2 \
sqlite3
RUN \
pip install whisper \
carbon \
graphite-web \
'Twisted<12.0' \
'django<1.6' \
django-tagging
# 6 - Other
EXPOSE 80 2003 2004 7002
WORKDIR /app
VOLUME /opt/graphite/data
# Define default command.
CMD ["/app/bin/start_graphite"]
# 7 - First use of ADD
ADD . /app
# 8 - Final setup
RUN mkdir -p /app/wsgi \
&& useradd -d /app -c 'application' -s '/bin/false' graphite \
&& chmod +x /app/bin/* \
&& chown -R graphite:graphite /app \
&& chown -R graphite:graphite /opt/graphite \
&& rm -f /etc/apache2/sites-enabled/* \
&& mv /app/apache-graphite.conf /etc/apache2/sites-enabled/apache-graphite.conf
This is our most shareable layer. All the images running on the same host should start with this. You can see I’ve added a few things like curl
and git
which while they’re not necessarily needed they’re useful for debugging and because they’re in such a shareable layer, they don’t take up much room.
Here we get to our language specifications. I’ve included the Python and Apache sections here because it’s not super clear which should go first.
If we put python first, then any other image that uses Apache can get a few free python packages, If we put Apache first then we could have a Ruby app that also includes that layer and get Apache for free ( hell you can just give it python for free anyways ).
I’m calling these out seperately for a few reasons.
Firstly, they should come either directly directly after the Apache section so that it’s easier to make them common ( and cached ) between multiple images. You might not think it matters since calls like ENV
are so cheap, but I have seen random ENV
calls take 10 seconds or so. If you have a lot, then its good to keep them cached, but you also don’t want a changed ENV
to invalidated the cache of installing Apache.
They’re a pretty good example of something you might want to start with at the bottom of your container and move them up higher once you’re unlikely to change them again.
Secondly, to mention that I really wish Docker provided a way to specify multiple ENVS on the same line so that I can reduce the number of layers I end up with.
This contains some Graphite specific apt
and pip
packages. You could join them into a single command by joining them with &&
but I kept them seperate so that if pip
package requirements change it won’t need to also reget the apt
packages.
This contains a bunch of cheap commands like ADD
and VOLUME
they’re probably less likely to change than the previous package installs, but are also cheaper to run, so its less important if their cache is invalidated.
Keep them towards the bottom though as you don’t want any changes to them to invalidate the cache for a more costly command.
You should wait until the last possible moment to use the ADD
command as any commands after it are never cached.
I have grouped these final commands into a single layer and they’re after the ADD
commands as they manipulate files that come from the ADD
Hopefully this has given you some insight into how to build a better Dockerfile
. These are all things I have learned from experience in building my own Docker images and while they may not apply to all situations ( or may be flat out wrong ) they defintely seem to improve my development experience.