Generated using AI. Be aware that everything might not be accurate.

Chapter 1: Understanding Docker Image Layers

Before you can optimise a Docker image, you need a precise mental model of what an image actually is and how its size accumulates. This chapter builds that model. Understanding it makes every subsequent technique feel obvious rather than arbitrary.

What Is a Docker Image?

A Docker image is not a single file. It is an ordered sequence of read-only layers, each representing a filesystem change. When you run a container, Docker adds a thin writable layer on top — the container layer — but the image layers beneath it are never modified.

The filesystem technology that makes this work is called a union filesystem. On Linux, Docker defaults to OverlayFS, which presents the stacked layers as a single unified view. Files in higher layers shadow files in lower layers; if a file exists in layer 3 and layer 7, the container sees layer 7’s version.

Container (writable)          ← ephemeral, discarded on rm
─────────────────────
Layer 4: COPY . /app          ← your source code
Layer 3: RUN pip install ...  ← your dependencies
Layer 2: RUN apt-get install  ← system packages
Layer 1: Base image (Ubuntu)  ← OS foundation

Each layer stores only the diff relative to the layer below it — the files added, modified, or marked for deletion.

How Layers Are Created

Every Dockerfile instruction that modifies the filesystem creates a new layer:

FROM — imports all layers from the base image
RUN — executes a command; any files created or modified form a new layer
COPY — copies files from the build context; creates a new layer
ADD — like COPY but also decompresses archives; creates a new layer

Instructions that do not modify the filesystem — ENV, ARG, EXPOSE, LABEL, WORKDIR, CMD, ENTRYPOINT — create metadata but do not add layer size.

The Whiteout Problem

This is the single most important concept for understanding why naive cleanup commands do not reduce image size.

Suppose you write this Dockerfile:

FROM ubuntu:22.04
RUN apt-get update && apt-get install -y build-essential
RUN apt-get purge -y build-essential && apt-get autoremove -y

You might expect the final image to be roughly the size of the base Ubuntu image, since you installed and then removed the same package. It is not. The build-essential files are baked into layer 2. Layer 3 does not delete them — it adds whiteout entries that hide them from the union filesystem view, but the layer data is still present in the image and still downloaded when the image is pulled.

The only way to prevent a file from being in the final image is to never create it in a layer that is committed to the image. This is why:

Cleanup commands must be in the same RUN instruction as the installation they are cleaning up.
Build artifacts that are not needed at runtime must be built in a separate stage (multi-stage builds, Chapter 3).

Inspecting Layers

Docker provides several commands for examining an image’s layer structure.

`docker history`

docker history --no-trunc myapp:latest

This prints each layer, the instruction that created it, and its size. The --no-trunc flag shows the full command without truncation. This is your first diagnostic tool when an image is larger than expected.

`docker image inspect`

docker image inspect myapp:latest

Returns JSON with RootFS.Layers (the list of layer digests) and Size (total uncompressed size in bytes). Useful for scripting.

`dive` (Chapter 8)

The dive tool provides an interactive terminal UI that shows which files are in each layer, which files were modified across layers (wasted space), and an efficiency score. It is the most practical tool for identifying where size is coming from.

Layer Caching

Layer caching is Docker’s mechanism for avoiding redundant work during repeated builds. It is also the reason that layer ordering matters for build speed.

Docker hashes each layer using:

The instruction text
The content of any files COPYed in that instruction
All preceding layer hashes

If a layer’s hash matches a cached layer, Docker reuses the cache and skips execution. If a layer’s hash does not match — because the instruction changed, or a COPYed file changed — Docker executes that layer and invalidates the cache for all subsequent layers.

This has a direct implication for how you order your Dockerfile instructions:

# BAD: source code copied before dependencies
# Any source change invalidates the pip install cache
COPY . /app
RUN pip install -r /app/requirements.txt

# GOOD: dependency manifest copied first
# pip install cache is only invalidated when requirements.txt changes
COPY requirements.txt /app/
RUN pip install -r /app/requirements.txt
COPY . /app

In the “good” version, pip install only re-runs when requirements.txt changes. In the “bad” version, it re-runs on every single source code change.

Key Takeaways

A Docker image is a stack of immutable layers; total size equals the sum of all layers.
Deleted files are not removed from the layer they were created in — they are hidden by whiteout entries in later layers.
Cleanup must happen in the same RUN instruction as creation, or via multi-stage builds.
Layer caching invalidates all layers downstream of a change; put stable layers first and volatile layers last.
docker history --no-trunc is your first tool for understanding where size comes from.

← Introduction

Table of Contents

Chapter 2: Choosing the Right Base Image →

>> You can subscribe to my mailing list here for a monthly update. <<

Gaëlle Candel