In late 2017, the Kata Containers* project was announced. This project is in development now and is part of an emerging group of technologies based around the concept of wrapping container workloads in extremely lightweight virtual machines (VMs). These technologies combine the speed and flexibility of containers with the isolation and security of VMs, making them ideal candidates for busy multitenant deployments.
Meanwhile in the network function virtualization (NFV) space, containerization is proving to be an interesting alternative to full-function VMs for virtual network functions (VNFs).
These areas are combining at a fortuitous time. This article introduces the idea of lightweight virtualized containers for NFV usage, explaining how they fit into existing container technology. It also describes how integrating NFV-friendly technologies like the Data Plane Development Kit (DPDK) and Vector Packet Processing (VPP) is not too heavy of a lift for NFV developers and operators.
Demystifying Container Management
The Kata Containers project descends from two other successful virtualized container projects: Intel® Clear Containers and Hyper.sh* container runtime, runv. Launched under the governance of the OpenStack* Foundation, the project takes the best parts of both projects as well as contributions from those who want to pitch in to make the best hypervisor-driven container workload driver stack. The project is currently in development, with the first release anticipated in the first half of 2018. For now (as of this writing), if you'd like to experiment with lightweight virtualized container workloads, we recommend trying out an existing runtime, for example, Intel Clear Containers.
Let's take a look at the components of a container management system, because the parts, especially what's called a runtime, can be a bit overwhelming in the face of rapid development and change. The diagram in Figure 1 comes from the Intel Clear Containers project. We'll reference it as we work through the various generic components.
Figure 1. Components of one container system (Docker* and Intel® Clear Containers).
The Components of Container Management
The OCI specifications
Before we get into the actual components, it is important to understand the Open Container Initiative (OCI) specifications, which all of the projects under discussion adhere to. There are two major specifications: runtime and image.
The image specification is the easiest one to understand. It defines how to package and distribute a container image that runs a specific workload. Whenever you use
docker pull to fetch a container image from the Internet or even from a local container image registry, you are likely fetching an OCI-compliant image. The key takeaway is that OCI container images should run via OCI-compliant runtimes regardless of the underlying container technology.
The runtime specification takes a bit more explanation. It does not specify the API or command set that is used to launch and manage containers. That is the province of the container management system, such as Docker*, or CRI-O (more about this later). It also doesn't specify the technology used to launch container workloads, such as Linux* cgroups or Virtual Machine Managers (VMMs). What it does specify is the characteristics of a runtime as a program that launches and destroys containers, independent of either operating system or hardware. A working implementation of this system was provided to the OCI by Docker, in the form of runc.
The runc runtime was capable of taking Docker-issued commands and running Linux cgroup-based containers (normal or bare-metal containers). Other container management and orchestration systems, notably Kubernetes* and the CoreOS rkt* project, developed OCI-compliant abstracted runtimes of their own. The abstraction meant that while the runtime implemented the OCI-specified independence from the operating system and hardware, they went a step further and abstracted the container, network, and volume implementations as well.
Figure 2. Container runtimes of multiple varieties.
Abstracted runtimes have led to a much more open container management ecosystem. Docker has also developed an abstracted runtime (our term, not an industry standard) called containerd. It is visible in the above diagram of an Intel Clear Containers implementation. Kubernetes' abstracted runtime is known as CRI-O. There is yet another abstracted runtime in development called cri-containerd that aims to unify Docker and Kubernetes management into a single runtime.
If the abstracted runtimes implement the OCI spec for container management and orchestration systems but abstract away the specifics of container technology, another runtime will be needed to actually launch and destroy containers. These spec-compliant runtimes (again, not an industry term) tend to be aimed at launching and managing a particular kind of container technology. This is where the overloading of the term runtime can get rather confusing.
Examples of this type of runtime are the original runc implementation from Docker, the cc-runtime from Intel Clear Containers, runv from Hyper.sh, and several more. The future Kata Containers runtime will also fall into this category and will work with several of the abstracted runtimes previously mentioned.
Up until this point, we have been discussing runtimes. Abstracted runtimes generally consist of a single process running as a daemon on the host that is launching containers. The specification-compliant runtime that actually launches container processes generally does its job once per container, and then exits. Shims are launched per-container as well and maintain the small number of open communication channels that are needed to keep a container in contact with the management system and available for use. They exit when the container exits.
In Figure 1, two shims are shown: containerd-shim and the Intel Clear Containers shim. In this instance, the containerd-shim is accustomed to working with runc to set up the I/O communication with the container process. Since it is not natively set up to work with cc-runtime, the Intel Clear Containers shim is required to broker this interaction.
The Intel Clear Containers shim, called cc-shim, forms a connection between the abstracted runtime and the proxy (see below), which is a necessary component of VM-based container implementations. Since containerd doesn't have a native method of interacting with the proxy, cc-shim or its equivalent in other systems brokers this communication.
In general, a shim component in a container management system performs this kind of translation between other components, on a per-container, persistent basis.
The agent is a unique component of VM-based container systems. It is a daemon that runs inside each container VM, and its purpose is to configure the VM on boot to load and run the container workload correctly. It also maintains communication with the proxy.
Container management systems that work directly with Linux kernel cgroups (normal containers) can set up I/O channels, networking devices, volume mounts, and so on without needing to communicate with a different operating system running inside a VM. Therefore, they do not need this component. VM-based systems do need a proxy to handle inside-the-VM configuration and structures. For example, mounting a persistent volume requires both external preparation (configuring the hypervisor for the virtual volume device) and internal preparation (the volume mount). The proxy communicates with the agent to handle internal configuration items.
Component: hypervisor and virtual machine
The hypervisor/VMM and VM used in a container system are specialized. The VM needs to have a highly tuned, lightweight kernel that boots in milliseconds, instead of the more common full operating systems in ordinary VMs that can take several seconds (or longer) to boot. To achieve this, the hypervisor is tuned to strip out any and all pieces of device emulation or passthrough that are not useful for the container's operation. For example, only one type of network device type needs to be probed for since the VM kernel will only support one device type. Another example is that CD-ROM drives do not need to be probed for in a container VM.
This is how lightweight container VMs are created and why they function at very close to parity with bare-metal cgroup-based containers. Only the most relevant and needed portions of the VM system are retained. Intel Clear Containers also works with some additional capabilities like Kernel Shared Memory (KSM) to further speed operation. KSM keeps read-only binaries that are shared by all the containers on the system, such as the container kernel, in a single memory range on the host.
Component summary: composability
There are many different moving parts in a container management system. To some degree this is due to the history of how containers came to be popularized and how the various dividing lines have broken down over this history. In general, a goal of many containerization projects is composability, meaning that each of these components can swap in different binaries without reducing or breaking the capability of the overall system. In reality, things are not quite there yet.
In the next section, we'll see how one element of composability makes NFV-friendly workloads not only possible, but also relatively simple to implement in a virtualized container system.
So, What about NFV?
Here is an interesting fact: most of the systems and components that we've described in the previous section are written in Go*. There are good reasons for that, but for the NFV world, the real benefit is that container systems that are written in Go can utilize the virtcontainers Go language library to handle networking and volume connections.
Virtcontainers is now a sub-project of Kata Containers. It was brought into that project from Intel Clear Containers, for which virtcontainers was originally developed. Therefore, both Intel Clear Containers and the forthcoming Kata Containers will link against virtcontainers.
Here is the important part: virtcontainers natively supports:
- SR-IOV (Single-Root I/O Virtualization (via vfio devices)
- DPDK poll-mode and vhost devices
- FD.io VPP
These technologies are critical for the NFV industry. Providing these capabilities out-of-the-box makes it that much easier for NFV to take the leap from cgroup-based containers to VM-based containers.
A Closer Look at Container Networking
Virtcontainers provides support for both the Container Network Model (CNM) and the Container Network Interface (CNI). Docker uses CNM for plug-in-based networking in its container system. The CNI does the same for CoreOS* and Kubernetes.
Let's take a high-level look at how the CNM works with a VM-based containerization system (see Figure 3).
Figure 3. Container Network Initiative (CNI) implementation for virtual-machine-based containers.
As shown in the figure, the generic runtime here is the per-container specification-compliant runtime, that is, cc-runtime or runv or the to-be-named Kata Container runtime. The CNI implementation, libcni, is a part of this runtime.
In step 1, the runtime creates the blue-bordered network namespace, which should be a reasonably familiar feature to NFV operators. This namespace contains all devices associated with the VM. In step 2, the configuration required for the container is read from the CNI configuration files, which is where information specific to the plug-in will be obtained.
The plug-ins for CNI are how networking is actually implemented for all containers on the host system. Native interface plug-ins are available such as bridge, ptp (veth pair), vlan, and so on. In the current state, Intel Clear Containers doesn't support all interface plug-ins, but Kata Containers does aim to support all of them. There is also a wide variety of meta-plug-ins and many different types of third party plug-ins. These plug-ins are how NFV-friendly technologies like those previously mentioned are implemented for CNM/CNI. For example, here are links to the SR-IOV and DPDK-vhostuser plug-in repositories.
All of this is part of the CNI static configuration on the host. Nothing changes for the parts of the networking system that we're setting up for the container, regardless of the plug-in configuration. To continue the outlined process, in Step 3 the runtime will communicate with the configured plug-in to start network service for the container. A device is created, in this case cni0, and a veth pair is set up between that device and the container's network namespace.
From here, the rest is plumbing for the VM. In step 4, a bridge inside the namespace is created, a tap device is plumbed to the bridge for the VM to use with standard virtio drivers, and the previous veth pair endpoint is plumbed to the bridge as well. With that path for traffic established, in Step 5 the VM and container workload are started inside the network namespace.
Container technology continues to be an exciting area of development for data centers and for NFV. Later this year, a Kata Containers release will be available that implements industry-standard lightweight VM-based containers. This will offer the security and isolation of VMs with the speed and flexibility of containers, using the same container management tools.
Until Kata has a release, Intel Clear Containers is available to try out the technology, and most of what we've discussed is available in that project.
NFV developers and operators can take advantage of these systems quickly since NFV-friendly technologies are baked in and are independently available as plug-ins to the CNM and CNI networking interfaces used in Docker, Kubernetes, and other container management and orchestration systems.
About the Author
Jim Chamings is a senior software apps engineer in the Developer Relations Division at Intel. He works with cloud and NFV developers, operators, and other industry partners to help people get the most out of their data centers and cloud installations.