Creating your own linux containersLast edited on Jul 10, 2016
I'm not trying to say that this solution is better than LXC or docker. I'm just doing this because it is very simple to get a basic container created with chroot and cgroups. Of course, docker provides much more features than this, but this really is the basis. It's easy to make containers in linux, depending on the amount of features you need.
cgroups are a way to run a process while limiting its resources such as cpu time and memory. The it works is by creating a group (a cgroup), define the limits for various resources (you can control more than just memory and CPU time) and then run a process under that cgroup. It is important to know that every child process of that process will also run under the same same cgroup. So if you start a bash shell under a cgroup, then any program that you manually launch from the shell will be constrained to the same cgroup than the one of the shell.
If you create a cgroup with a memory limit of 10mb, this 10mb limit will be applicable to all processes running in the cgroup. The 10mb is a constraint on the sum of memory usage by all process under the same cgroup.
On slackware 14.2 RC2, I didn't have to install or setup anything and cgroups were available to use with the tools already installed. I had to manually enable cgroups in the kernel though since I compiled my own kernel. Without going into the details (because this is covered in a million other websites) you need to make sure that:
- cgroup kernel modules are built-in are loaded as modules
- cgroup tools are installed
- cgroup filesystem is mounted (normally accessible through /sys/fs/cgroup/)
Here's how to run a process in a cgroup
cgcreate -g memory:testgroup # now the "/sys/fs/cgroup/memory/testgroup" exists and contains files that control the limits of the group # assign a limit of 4mb to that cgroup echo "4194304" > /sys/fs/cgroup/memory/testgroup/memory.limit_in_bytes # run bash in that cgroup cgexec -g memory:testgroup /bin/bash
Note that instead of using cgexec, you could also just write the current shell's PID into
Making your own containers
A container is nothing more than a chroot'd environment with processes confined in a cgroup. It's not difficult to write your own software to automate the environment setup. There is a "chroot" system call that already exist. For cgroups, I was wondering if there was any system calls available to create them. Using strace while running cgcreate, I found out that cgcreate only manipulates files in the cgroup file system. Then I got the rest of the information I needed from the documentation file located in the Documentation folder of the linux kernel source: Documentation/cgroups/cgroups.txt.
Creating a cgroup
To create a new cgroup, it is simply a matter of creating a new directory under the submodule folder that the cgroups needs to control. For example, to create a cgroup that controls memory and cpu usage, you just need to create a directory "AwesomeControlGroup" under /sys/fs/cgroup/memory and /sys/fs/cgroup/cpu. These directories will automatically be populated with the files needed to control the cgroup (cgroup fs is a vitual filesystem, so files do not exist on a physical medium).
Configuring a cgroup
To configure a cgroup, it is just a matter of writing parameters in the relevant file. For example: /sys/fs/cgroup/memory/testgroup/memory.limit_in_bytes
Running a process in a cgroup
To run a process in a cgroup, you need to launch the process, take its PID and write it under /sys/fs/cgroup/
- The launcher creates a cgroup and sets relevant parameters.
- The launcher clones (instead of fork. now we have a parent and a child)
- The parent waits for the child to die, after which it will destroy the cgroup.
- The child writes its PID in the /sys/fs/cgroup/
/ /task file for all submodules (memory, cpu, etc)
- At this point, the child runs as per the cgroup's constraints.
- The child invokes execv with the application that the user wanted to have invoked in the container.
The reason I use clone() instead of fork, is that clone() can use the CLONE_NEWPID flag. This will create a new process namespace that will isolate the new process for the others that exist on the system. Indeed, when the cloned process queries its PID it will find that it is assigned PID 1. Doing a "ps" would not list other processes that run on the system since this new process is isolated.
Destroying a cgroup
To destroy a cgroup, just delete the /sys/fs/cgroup/
So interfacing with cgroups from userland is just a matter of manipulating files in the cgroup file system. It is really easy to do programmatically and also from the shell without any special tools or library.
My container application
My "container launcher" is a c++ application that chroots in a directory and run a process under a cgroup that it creates. To use it, I only need to type "./a.out container_path". The container_path is the path to a container directory that contains a "settings" files and a "chroot" directory. The "chroot" directory contains the environment of the container (a linux distribution maybe?) and the "settings" file contains settings about the cgroup configuration and the name of the process to be launched.
You can download my code: cgroups.tar
I've extracted the slackware initrd image found in the "isolinux" folder of the dvd.
cd /tmp/slackware/chroot gunzip < /usr/src/slackware64-current/isolinux/initrd.img | cpio -i --make-directories
Extracting this in /tmp/slackware/chroot gives me a small linux environment that directory and I've created a settings file in /tmp/slackware. Call this folder a "container", it contains a whole linux environment under the chroot folder and a settings file to indicate under what user the container should run, what process it should start, how much ram max it can get etc. For this example, my settings file is like this:
user: 99 group: 98 memlimit: 4194304 cpupercent: 5 process:/bin/bash arg1:-i
And running the container gives me:
[12:49:34 root@pat]# ./a.out /tmp/slackware Mem limit: 4194304 bytes CPU shares: 51 (5%) Added PID 9337 in cgroup Dropping privileges to 99:98 Starting /bin/bash -i [17:26:10 nobody@pat:/]$ ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND nobody 1 0.0 0.0 12004 3180 ? S 17:26 0:00 /bin/bash -i nobody 2 0.0 0.0 11340 1840 ? R+ 17:26 0:00 ps aux [17:26:14 nobody@pat:/]$ ls a a.out bin boot cdrom dev etc floppy init lib lib64 lost+found mnt nfs proc root run sbin scripts sys tag tmp usr var [17:26:15 nobody@pat:/]$ exit exit Exiting container
When cloning with CLONE_NEWNET, the child process gets a separate netwrok namespace. It doesn't see the host's network interfaces anymore. So in order to get networking enabled in the container, we need to create a veth pair. I am doing all network interface manipulations with libnl (which was already installed on a stock slackware installation). The veth pair will act as a kind of tunnel between the host and the container. The host side interface will be added in a bridge so that it can be part of another lan. The container side interface will be assigned an IP and then the container will be able to communicate wil all peers that are on the same bridge. The bridge could be used to connect the container on the LAN or within a local network that consists of only other containers from a select group.
The launcher creates a "eth0" that appears in the container. The other end of the veth pair is added in a OVS bridge. An ip address is set on eth0 and the default route is set. I then have full networking functionality in my container.
Settings for networking
bridge: br0 ip: 192.168.1.228/24 gw: 192.168.1.1
Mem limit: 4194304 bytes CPU shares: 51 (5%) Added PID 14230 in cgroup Dropping privileges to 0:0 Starting /bin/bash -i [21:59:33 root@container:/]# ping google.com PING google.com (184.108.40.206): 56 data bytes 64 bytes from 220.127.116.11: seq=0 ttl=52 time=22.238 ms ^C --- google.com ping statistics --- 1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max = 22.238/22.238/22.238 ms [21:59:36 root@container:/]# exit exit Exiting container