Building healthier containers

6 minute read

lightweightvm

Intro

Containers are nothing like virtual machines!

Now that we’ve cleared that up, this post will try to shed some light regarding:

  • How containers, as we know them, came to exist
  • Major differences between containers and virtual machines
  • Examples of how to build minimal containers
  • Demystifying the scratch container
  • Examples of how to debug running containers using other containers
  • Benefits of minimal containers
  • Tools that help build minimal containers

Disclaimer

I’ll be using docker throughout this post as it’s more widely used, but these concepts should apply to other runtime environments like, for example, rkt, lxd or containerd.

It’s all about abstraction

When virtual machine Hypervisors started their rise, they provided full or paravirtualization, fancy names for virtualizing everything or using special drivers on the guest to improve the manipulation of the real machine (host). Both guest and host had a full operating system copy, including their own kernel, libraries, tools and so on.

With containers (jails, zones, etc.), the host and the “guest” share the same kernel to achieve process isolation. Eventually, a set of nifty new Linux kernel features called cgroups(7) (CPU, memory, disk I/O, network, etc.) and namespaces(7) (mnt, pid, net, ipc, uts and user) appeared to better restrict and enforce that isolation.

lxc

It used to be a very daunting task to manage those kernel features, so tools were created to abstract that complexity. LXC was the first I used and spent more time with. It wasn’t very user friendly, but it got the job done.

Apparently, it was so cool that some folks created an abstraction layer over it to make it trivial to anyone. I first saw that abstraction, back in 2013, showcased on this talk by Solomon Hykes, an engineer working for a company called dotCloud, nowadays known as Docker.

docker

And the rest is history. Eventually docker dropped the need for LXC, it now deals with the kernel features abstraction directly (libcontainer) and has an entire ecosystem for container management.

But it still looks like a Virtual Machine to me

I can understand why we compare containers to virtual machines. They “feel” the same, and that’s great. But keep in mind virtual machines need their own kernel, init system, drivers, etc., and containers just use the host’s kernel to isolate processes (preferably, one process per container).

So, why are people shipping an entire kernel and system tooling inside a container, generating massive images with stuff that will never be used?

The container runtime provides the basic filesystem and kernel features for your application to run, which means you can focus on your aplication and benefit from the advantages of a minimal container.

I’ve prepared a few examples to help materialize these concepts.

Meet busybox

busybox

Busybox is a very handy binary. It performs several functions depending on how it’s called. We’ll use it as our example application:

mkdir -p /tmp/container/bin
cd /tmp/container/bin
curl -LO https://busybox.net/downloads/binaries/1.27.1-i686/busybox
chmod +x ./busybox
# If you want all the things
# for T in $(busybox --list); do ln -s busybox $T; done
ln -s busybox ls
ln -s busybox sleep
cd /tmp/container
tar -cvf /tmp/container.tar .

And now that we have a new shiny tar file (container image) with a binary, a couple of symlinks and with no kernel or extra junk, it’s time to import it:

# myapp is just a tag
docker import /tmp/container.tar myapp

At this time, things are starting to get interesting. Let’s try running our myapp container and do a simple ls:

docker run -ti myapp /bin/ls -lah
total 16
drwxr-xr-x    1 0        0           4.0K Dec 28 15:41 .
drwxr-xr-x    1 0        0           4.0K Dec 28 15:41 ..
-rwxr-xr-x    1 0        0              0 Dec 28 15:41 .dockerenv
drwxr-xr-x    2 501      0           4.0K Dec 28 15:39 bin
drwxr-xr-x    5 0        0            360 Dec 28 15:41 dev
drwxr-xr-x    2 0        0           4.0K Dec 28 15:41 etc
dr-xr-xr-x  133 0        0              0 Dec 28 15:41 proc
dr-xr-xr-x   13 0        0              0 Dec 28 15:41 sys

Where did all that stuff come from? Shouldn’t it only have /bin?

There are differences between a container image and that same image during runtime. The Open Container Initiative (OCI) libcontainer spec explains it quite nicely.

That’s sourcery, I want my Dockerfile back

Sure, whatever rocks your boat.
You probably heard about the scratch container. Let’s build our own and call it zero:

cd /tmp/container

# Create and import an empty container image, just like scratch
touch zero.tar
docker import zero.tar zero

# Generate a Dockerfile for our app
cat <<EOF> Dockerfile
FROM zero
COPY bin/ /bin/
EOF

# Build and run our app
docker build . -t myapp-zero
docker run -ti myapp-zero /bin/ls -lah

Spoiler alert

If we’re on the same page, you must be realizing that all this fuss around containers should be instead around tar files, right?

🤔

I demand a shell

One could argue that a shell is mandatory for debugging. Obviously strace has to be present, but what if I need to copy files from/to the container? Maybe use a SSH daemon?

Well, let me put this crystal clear: You don’t!

As one of the underlying building blocks of containers are namespaces, you can use nsenter(1) to run programs with namespaces of other processes.

If that’s so, why don’t we use the same PID/NET namespace between containers, effectively sharing those resources?

For instance, you could build a toolkit container with all the tools one could ever need and attach it to a container that doesn’t even have a shell.

I, for one, did exactly that. And we’ll be using it in this example:

# Instantiate myapp using sleep to keep it up
docker run -tid myapp /bin/sleep 600

# Get the hash of the running myapp
CONTAINER_HASH=$(docker ps | grep myapp | grep Up | awk '{print $1}')

# Copy something to myapp from the host
touch SOMETHING
docker cp SOMETHING $CONTAINER_HASH:/bin

# Attach a toolkit container to myapp
docker run -it \
  --pid=container:$CONTAINER_HASH \
  --net=container:$CONTAINER_HASH \
  --cap-add sys_admin \
  kintoandar/toolkit

Now we’re on a bash shell in the toolkit container attached to the running myapp. Let’s look around.

root@d10ba9eb50c7:/# ps aux
PID   USER     TIME   COMMAND
    1 root       0:00 /bin/sleep 600
    6 root       0:00 bash
   15 root       0:00 ps aux

We can see the sleep process is running as PID 1, but where’s the myapp filesystem?

root@d10ba9eb50c7:/# ls -lah /proc/1/root/bin/
total 908K
drwxr-xr-x 1  501 root 4.0K Dec 29 19:13 .
drwxr-xr-x 1 root root 4.0K Dec 29 19:13 ..
-rw-r--r-- 1  501 root    0 Dec 29 19:13 SOMETHING
-rwxr-xr-x 1  501 root 900K Dec 29 18:43 busybox
lrwxrwxrwx 1  501 root    7 Dec 29 18:43 ls -> busybox
lrwxrwxrwx 1  501 root    7 Dec 29 18:43 sleep -> busybox

So, do you still think someone needs a shell and all those tools on the myapp container?

I can argue that there’s someone who would if, per example, a remote code execution vulnerability on the application was found. In this case, that malicious someone would love to have a shell laying around and maybe some useful tools like curl/wget.

With that said, let’s then strive to restrict the attack surface on our containers and, as a bonus, you’ll get:

  • Less network bandwidth required to move container images around
  • Less storage requirements due to image size
  • Less IOPS needed due to image size
  • Less software equals less vulnerabilities to scan, manage, patch, upgrade…
  • Faster build times
  • Faster ship times

Dependency hell

I get it, it’s hard to manage all the dependencies of a real application and completely detach it from the operating system where it was built, but rest assured, there are more people who feel the same and the community is here to help.

These are some tools to make things less painful:

If you want to have complete control of what’s inside your container and not depend on prebuilt packages (rpm, deb, etc.), just use buildroot.

buildroot

For more buzzwordy tools I recommend this talk from Michael Ducy.

That’s it, that’s all

Well, not quite, this is just the beginning. There are a lot of standards/implementations evolving and being adopted (OCI, CNI, CRI, etc.).

oci

All of them improve the ecosystem around containerization allowing everyone to step in and contribute.

Containers are here to stay and understanding what makes them tick is no longer optional.

Leave a comment