Building Containers in Pure Bash and C

The Future of Containers in the Enterprise

February 10, 2016

New York

Building Containers in Pure Bash and C

So yeah, I'm Jessie, I work for Docker. I'm a Docker maintainer. And I'm just gonna go over the basics of what really a container is made out of because it's not like actually a physical element in itself, it is more a combination of different ingredients that are baked into the linux kernel and we use those to create this concept of a container.

Cool. So what is the container? The building blocks for a container are Linux namespaces and control groups. So in terms of namespaces they limit what the process sees, so there's a few different types like pid namespace, net namespace, mnt namespace, uts namespace (I'm gonna over these each individually)ipc and user, and they are created with like clone or unshare.

The files for them are in /proc and then the pid and then namespace, that's kinda how you keep track of what's running. When the last process in a namespace dies everything is destroyed. All the namespaces with it and you can re-enter a name space with nsenter, which we will also show.

So net namespaces are where the network interfaces are. So it's like your own private network interface area, you can have like local, you can name it whatever you want, most people use like [unclear] and then clone them into a container. There's separated routing table, so basically like whatever net you have on your host when you create a new net namespace you get an entirely new set of interfaces that you create yourself.

[BLANK_AUDIO] So okay, this is going to be like a lot of kind of back and forth with demo and talking, so the first thing we are going to need to do is get this file that I have and a I'll just show show…so it's like this short URL I guess if you are following along. And this just gives you like a c file for cloning things, we can open that up.

I also have a cheat sheet cuz I'm going to forget things cool so the first one we are going to do is a net namespace so just to show like on my host ip a I have all this like interfaces, I'm obviously connected to WiFi, there is my docker bridge and then I have this like veth interface from probably some container that's running somewhere that I forgot about, oh chrome okay so that's cool and all and there is in this clone.c it really just takes this clone flags which right now we're just going to leave it blank so that it won't clone a new namespace so that I can show you what happens.

It will call clone, wait for the pid to come up and it will run a command in the namespace, it's pretty simple. This is the part where it's running the command. So it's just like exec-ing cool. So I'll compile this [BLANK_AUDIO] name it net, and right now remember I didn't the only clone flag is like sigchild so it's not going to do a new net namespace currently.

And then the command I'm gonna run is ip a. So I see everything on my host again like cool we expected that cuz I didn't do anything. So now if I add the flag for clone new net [BLANK_AUDIO] we'll compile it again, and then run this again. And wow. Sudo, yeah so then I have only local hosts which is cool, and I could—if I wanted to use like a IP set to clone in different interfaces and set up like all these networking stack that I wanted in my container, but this was more just to show that this is entirely separate now from my host and that's cool. We used C to do that, but you can also use unshare which is a really cool bash command utility. unshare takes like various flags for namespaces.

You'll see them later on but like unshare, you know you can just like name the namespace as you go but we'll do that. ip a, and cool this is the same exact thing so you can do it two different ways you can use just one bash command, or you can write your own c file which is fun in itself I guess.

unshare is pretty cool though. So, yeah that's a net namespace, moving on. [BLANK_ AUDIO] So then another namespace is the uts namespace, which is really just like hostname so you can change your hostname and it won't reflect the hostname on your actual host, [INAUDIBLE] okay so I'll show this now.

If we go back into that same file, let me delete this out just to show the differences. You can also do like multiple clone flags which we'll also do later later but yeah. So my host name right now is debian and I really don't want to change it on my host so I'll just show you that like if I… [BLANK_AUDIO] So if I do run this where it's not gonna do anything because we've removed the clone flag, so if I just echo out my host name it should say debian, cool, okay.

[COUGH] sorry, so now if I add the flag with this cool little thing that I—let's first do this. [BLANK_AUDIO] Cool so we add the flag to clone that, but also I want to change the hostname itself. So I'm gonna uncomment this code that I have here which will set the hostname in the container to my myhostname, or we can make it more fun [BLANK_AUDIO] Cool, I'll compile this again [BLANK_AUDIO] then run it again.

Cool so obviously like my hostname on my host, just to prove it to you, is still debian. And then that one in the container is something else and that's kinda like how if you like run a Docker container you can give it whatever host name you want. You can also do this with unshare.

I'm gonna copy and paste this one and be lazy because it's kind of long. So what we are doing is creating a new uts namespace, and then execing the hostname thing && hostname, I had to do the bash -c because there is an && in there and otherwise that would run on my host, you know how shells work.

So yeah and still the one on my host is debian, so again two different ways to do it, same type of namespace, it's just something else that is being controlled and separate view from everything else. [BLANK_ AUDIO] So the next one we have is IPC, and this just has a separate view of message queues and shared memory.

So yeah, I didn't realize it. It's gonna be so much back and forth actually. Cool. So we'll go back into this clone.c. I'll re-comment this out cuz we don't really need to set the hostname anymore. [BLANK_AUDIO] Cool, that should be fine. I don't know why it's doing that. And then let me remove this [BLANK_AUDIO] And what I'm going to do is create like a message in my IPC queue on my host, just to prove that there's something in there.

So it has queue id zero. So without any flags which I'm pretty sure I removed right? Okay so we'll recompile this with nothing [BLANK_AUDIO] typing is so hard today. Okay, cool. We name this IPC , cool. And then I'll run it to get the queue, and it should be the queue from my host, so [BLANK_AUDIO] Cool.

So, there's my host queue. The one that we made but what we wanna do is obviously do this in a namespace so that we don't have the host up there. So, I'll do this. Clone new IPC. [BLANK_AUDIO] We'll recompile it, and run it again, and it's empty. So, same goes for that. Pretty cool, so you can kind of imagine, or at least hopefully start to imagine exactly how this is like compiled into a container.

[LAUGH] But, yeah, you can also do this with unshare. [BLANK_AUDIO] And it's the same exact thing, which is pretty sweet. And let's see, what do we have left? User namespaces are also very cool. So, they allow mapping a UID and GIDs on the host to different ones in a container so, say you have UID 0 in your container or namespace, you can map that to 92 whatever it is on the host and then it's almost like mapped to this unprivileged user.

And the actually really awesome thing about this is because we added user namespaces to Docker itself, I wanted to—if you are familiar, I run almost everything on my host in docker containers even this chrome container it's a container. So the problem I ran into is that I like to map /dev/sound and /dev my video device, video zero into the container.

And they had all the wrong permissions because obviously they were going to a user namespace, but the map of the like audio group was mapped to this like anonymous user on my host, so what I ended up doing is just taking say 16 is audio in my container and 16 is audio in my host I map them to each other so that I had the permissions to do it.

I wouldn't recommend doing that in production but it did work, and is super cool as a hack to run devices in like user namespace containers cool. So this one [BLANK_AUDIO] I'll just remove this. [BLANK_AUDIO] Cool so you can see I'm still the same user since I removed the flag. If I do it again—and obviously we're not using any mnt namespace, that's why you can see everything that is in this folder [BLANK_AUDIO] Cool so it's nobody, and actually the really cool thing about user namespaces too is that you can see I didn't have to like sudo this one.

It can be run unprivileged, there have been like some vulnerabilities as to breakouts with user namespaces because technically like if you are mapping to zero in the container you can sometimes escape but it's. They're working on it I guess. I mean it's Linux so what do you expect? >> [LAUGH] >> So you can do the same thing with unshare.

[BLANK_AUDIO] [INAUDIBLE] together. So yeah it's pretty cool, like now I'm nobody and have no privileges in there. So moving on [BLANK_AUDIO] mnt namespaces give the process its own rootfs, and you can mask different parts of the mounts like a /proc and /sys which is what we do in Docker itself.

And I'll actually show it in terms of a pid namespace where you want to have this process be only seen by itself, or any child processes from it. So basically a pid namespace will give you pid 1 for the process that is in the namespace and then all others are from there. And you can also nest them, and actually a cool example of this is Chrome.

If you run Chrome on Linux, they have the whole like Chrome sandbox thing, and what that actually does is take like each tab has its own process in Chrome if you don't know this, and also Chrome is running in like a user namespace but then each tab itself is one process and in that process is in a pid namespace just so that if anything breaks out of the tab, whether like JavaScript or like some Adobe flash plugin, it can't see anything else running.

It's pretty cool I think. That's also why in terms of like the processes created by Chrome, they're so many, if you've ever noticed. Usually like if you're running top because you Chrome is super slow that's what you'll see is just a crap ton of Chrome processes. So yeah. Cool. [BLANK_AUDIO] So first we'll do this.

[BLANK_AUDIO] So we're gonna clone a new pid namespace, not even worried about that one right now, [BLANK_AUDIO] Okay so, yeah. I added the flag to clone a new pid namespace, but these are all the processes on my host so you're thinking like what happened and the thing is like in this pid namespace you can still you can still see all the entries in /proc so what we need is also a mnt namespace.

And then to, yeah. Add a mnt namespace and I'll show you we're gonna mount and unmount /proc in this file as well. [BLANK_AUDIO] so we're gonna mount /proc [BLANK_AUDIO] here, cool. [BLANK_AUDIO] And I'll re-compile this [BLANK_AUDIO] This should work [BLANK_AUDIO] Mm-hm okay hang on, I was doing this on a different server before, I might have some sort of bug.

[BLANK_AUDIO] Cool, okay that is weird that it wasn't working on my host, but. Okay, ignore that. So this, obviously now I can only see the process itself that's running, which is really cool. And so, that combines both the mnt namespace and the pid namespace together which is really just a clutch of containers with everything else combined.

So moving on, so namespaces are just like one of the ingredients, but then another key ingredient to creating containers themselves are control groups, and whereas namespaces like, if we go back, namespaces themselves, limit what the process sees but control groups like limit what the process can use, so it's like resource metering and limiting, and they also will give you even stats back which is really cool, and that's kinda how you get the Docker stats command that you know and love today.

And there's a few different kinds, so there's memory, CPU, blkio, network, device, and then actually a new one was just added in kernels greater than 4.3, so only like five people in the world probably run that. Three of which are in this room. [LAUGH] But yeah, so it's cool like there's a pid C group that you can get stats from as to how many pids are running in this namespace, and you can also set the max number of pids which is really cool for preventing a fork bomb or something like that, and actually I'm trying to, like the story I I'll do about how Chrome uses a different pid namespace for every thread.

It'll be cool if they limited that to one because then you would know that nothing in that tab is actually creating more processes so I was trying to get a patch in before this. So the first one. Memory cgroups, these can be limits for physical memory, kernel memory, or total memory.

Accounting keeps track of what's actually being used, you can set up notifications so that you won't have the kernel kill your process when it hits the limit, cuz there's soft limits and hard limits. And the hard limit is actually going to kill it if you don't set up a notification.

[COUGH] [BLANK_AUDIO] There's also the CPU cgroup. So you can do two different things. You can keep track of like the CPU being used, you can use different weights as to what processes can use what percentage of CPU. You can also use the cpuset group to pin to specific CPUs, so say like I want one container to just use like the cpuset 0, and then I want another container to use cpuset 2, you can do that which is so cool.

It gets tricky in terms of like trying to do that on multiple hosts and have like a scheduler that does that but most of them I think today work with stuff like that to use the resources that are on each cluster, but saying like use cpuset 2 when you have like 500 nodes, like okay where, but yeah, there are like some interesting things you can do for that.

[BLANK_AUDIO] The blkio cgroup is really cool too. It keeps track of like inputs and outputs for the group and the device, so in Docker now, actually in the most recent version, we allow you to set this for like say /dev/std or something. And you can throttle the limit that is being written on that device.

So, I had this crazy idea, it didn't actually make it into the blog post because my coworkers thought I was insane, but you could like stop a tarbal bomb archive, bomb thing like those tars that are tars within tars that just go down and then you expand it and it just like nukes your system, you could stop that technically with this, if you set it low enough it will take like you know seven, ten years to actually unpack and kill your system but it is so ridiculous because honestly you should just use read only instead.

But that is like kind of a cool use case, I don't know. Nobody else thought it was cool but me. So you can set throttle limits for read and write. You can set them different and you can also set weights just like you can with CPU. So the percentage that it's using for the device itself.

Oh cool. So that was actually, it spun right through that, cool. We can do questions or I can just start showing random stuff with runc too, if you wanna see that, cuz that took way less time than I thought it would. I factored in a lot of time for actually helping people do it. So did anyone actually follow along? Nice, okay was it doable? Okay, cool.

Is there anything like specific that you want to see? I can do it. >> [LAUGH] >> That wasn't like a challenge. >> [INAUDIBLE] >> A clone what? >> [INAUDIBLE] >> Oh, if I did multiple you mean? Yeah it totally works. I mean that's how you do multiple namespaces, I can do one with multiple. [BLANK_AUDIO] I might as well just do it all. [BLANK_AUDIO] So I think I got them all…oh, IPC.

[BLANK_AUDIO] actually I can see the server is not configured correctly if I remember correctly for cloning a new usr namespace, let me just check [BLANK_AUDIO] yeah it's not, so I'm just gonna take that out so you can imagine if it was in. [BLANK_AUDIO] [COUGH] >> So we have the one pid, let's see, what else was there, ip a, so we have the host name itself.

It's kinda cool. Combining them all really just gives you a container. [LAUGH] In the natural sense. It's pretty nice. [BLANK_AUDIO] Yes. >> [INAUDIBLE]>> Yeah, so actually the coolest part about container versus VMs, in a sense, it's made up of all these ingredients so you can include what you want, and not include everything that you don't want.

So, say for example, I want two containers to share the same net namespace, and then I want this like this straight container to see the pid namespace of another container. You can take all these ingredients, and actually, almost like a puzzle, do like this one from this one, this one has it's own, this one connects to this one, which you can't do with a VM which is, I think the heart of what makes containers cool.

Because I've totally done that before, you have a source container here, then you have like a database and the actual application itself sharing a net namespace so they can communicate, and you can also do the same with like IPC stuff if your application is using that. And yeah, I mean, I honestly think that's the coolest.

The hardest thing to share between containers would be a mnt namespace, and there's no way currently to do that in Docker, and I do not know how easy it would be to implement. Any other questions? Has anybody tried—oh yeah. [INAUDIBLE]>> So, the PID cgroup was actually just added like a few months ago, maybe, so I think that if you have a feature for something new, you would have to suggest it to them and it could be added, it's more just like their process.

I think it would probably be pretty hard to add a new namespace but it would be cool. I really want like a time namespace, cuz right now like you change time in a container, it changes on your host, so various things are actually namespaced but there are a lot that aren't, and like most of those we try to block in docker itself so that you won't run into them and actually change something on your host, which is why like the sacomf edition in the latest version of docker is so cool, but yeah it gets complicated, like a time namespace would be super sick.

[BLANK_AUDIO] Yes. >> [INAUDIBLE] >>Yeah, so the one I said was an application and it's database in the same net namespace for sure that one's really nice. Sharing a pid namespace and having a strace container or debugging is super cool too. Trying to think about what other ones I've used actually lets see, I do this for something Yes I didn't do that the way that I thought I was >> Oh that's where the volumes from >> [COUGH] >> Yeah I mean trying to think like host name if you're gonna share the net namespace you might as well share the UTC namespace so that if you want to ping the other container by the host name it would work.

Yeah I kind of feel like that's really the only use case I can think of off the top of my head. I feel like that's pretty common [BLANK_AUDIO] Well so with Docker actually you can do Docker run --net container the name of your container or whatever, and you can do that with most of the flags [COUGH] sorry, but if you're gonna do it yourself you'd have to point to the file descriptor for the namespace itself if you're gonna write the code for it or something.

>> [INAUDIBLE] >> Yeah, yeah. [BLANK_AUDIO] So, almost like allowing all and then just black listing some items yeah. No, I don't know. I don't think so. [LAUGH] I don't think that would work though. I don't know, I'd have to try it but I feel like when you're already in a namespace you can't just, other than the example I gave with poking holes in the user namespace for my audio and video driver, that's technically in the mappings.

I don't think that there's really that many other holes that you can poke unless you know like IP tables routing to get the like interfaces in the container to hook up to the host or something. Could be interesting slightly really hacky. [LAUGH] Who here, like, uses Docker day to day? Just wondering.

Nice. Have you used the newest version? >> [INAUDIBLE] >> Yeah. Cool, do you like it? >> [INAUDIBLE] >> Nice. >> You worked on? >> Yeah, I am a maintainer, yeah, that's what I work on day to day so if you've hated it I would be like, no I don't know what you're talking about. >> [LAUGH] >> Yes.

[INAUDIBLE] you can have a really low client version, and a high server version and it will work, but you can't do the opposite, but I did test a while ago like 0.7 of Docker the client with the API and it worked, and I would not know if that holds true today though but it would be interesting, yeah.

And like as far as contributing to the project or anything, we try to make it as easy as possible and you could totally just reach out to me if its not. [BLANK_AUDIO] Yes. >> [INAUDIBLE] >> Yeah, totally. Okay so actually I have some really cool stuff. Let's see, so recently I converted all my containers to runc that are on my host.

Let's see what do I have here. I feel like htop. [BLANK_AUDIO] So runc has a very simple command line interface. Let me just show. There's like start, kill, stop all that and really what makes it work are these config files and the rootfs and the rootfs for my actual container is all in here and you can like look at it, but it actually is just like alpine with htop installed. But what makes it cool is like thes spec files which they recently changed. It used to be made up of the runtime.json and the config.json, but now they combined it into one they just haven't upgraded yet but it basically defines like the environment, the UID and GUID for the container itself, the current working directory, mounts, these are all pretty standard capabilities.

And then this runtime time it will set up the namespaces, the mounts, with the flags. You can have these hooks which I think is like the coolest thing ever. And like, we would never probably implement this in Docker because it's like playing with fire. But I have all these hooks to like set up a network inside the network namespace, so.

It'll create almost exactly like what Docker does, I just put it in to a binary in itself it creates a bridge and it hooks like—veth pairs up so that you can have a gateway and you can actually reach the Internet in your network namespace, and it will give an IP to the runc container itself.

And then I just put that in as a hook, this one actually technically doesn't need it and I was actually testing all the setconf stuff so I have this hook that straces everything. That doesn't matter. But yeah, so you can set all your resources limits, here there's like nothing crazy going on.

All the namespaces that you're gonna use for it and this one specifically I wanted to use pid host 4 because I don't care about htop if I can't see the shit that's actually running cuz otherwise it would just show obviously htop, so here I'm doing that and then devices themselves so yeah we'll go back so that's cool this is like everything running on my host in htop in a container.

I could have if I wanted to set namespace pid and then done a path to the file descriptor for pid that I wanted to share with a different container that you're running somewhere else and it's nice, like runc just really converts this like json config into a container. I can show another one that's actually, let's see so this is just like standard alpine and it's actually gonna do all the namespaces.

So there's like pid, network, IPC, UTS, mount, and usr. It creates them all. And actually my hooks themselves. You can see I have this like binary net nest, that creates bridge, bla, bla, bla. [BLANK_AUDIO] It's weird. Okay. Let's see. [BLANK_AUDIO] Yeah, so that one does not like me right now.

I wonder what I did. It could be like one of the hooks actually. [BLANK_AUDIO] Look at this. Okay. Let me just try something. It's like live coding here sometimes. It's probably this. [BLANK_AUDIO] >> [COUGH] >> Mm-hm. It's not. Yeah. [SOUND] killing my time. [BLANK_AUDIO] yeah it's a work in progress, like the runc binary itself, cuz they are constantly changing the specs so all my tools that I like build around them, break sometimes.

Let's try this one.
[BLANK_AUDIO] okay, I'm just gonna give up on that, yeah I do not know why that one hates me [BLANK_AUDIO] So I mean [INAUDIBLE] [BLANK_AUDIO] Yeah there's something wrong with sorry, the first one worked so use your imagination >> [LAUGH] >> Yeah so I think that honestly what happened is they constantly change the spec and then I do not update my files like before.

So it's probably something like that. I actually have this automatic generator that will take like a running docker container and convert it to be runc spec format. I would not suggest using it right now cuz I've not updated it but until the spec is like set in stone for runc, it's very hard to build tools around it because I constantly find myself out of date, and I don't know when that's gonna happen but I hope it's soon because I like really committed to moving all my crap to it, yeah.

So I can show that [BLANK_AUDIO] my Chrome is currently running in a docker container itself, so that's pretty cool, it's not actually [INAUDIBLE] [BLANK_AUDIO] I have like it's mounting like the sound and video. This one isn't using a usr namespace because I actually ended up switching to runc so that I could do a usr namespace with my custom UUID/GUID mappings that I would not suggest anyone else use.

But it's still pretty cool, I mean I run Chrome in a container every single day so it's pretty weird any other questions? >> Real quick. So you've got [INAUDIBLE] >> Oh yeah so I personally don't use SE Linux because I run Debian, but I use in all my containers the [INAUDIBLE] profile cuz I added that and it makes me happy, but I would suggest using them both. The only container that I actually broke with the setconf default profile was my Chrome container because chrome itself creates new namespaces and we blocked clone because there are some problems with new user namespaces, like having privileges when they shouldn't.

So It was just like a more formality so I have a custom profile they use for Chrome itself that allows me to do that. But I honestly think both are great and it's only gonna get better. I made a proposal actually to have these custom security profile profiles which I can show you because I mean its honestly very hard to write your own setconf profile because you basically have to know all the syscalls that are running in your app itself and you know who has time for that. It took us forever just to write the default one, let alone one for an application itself.

You could technically like trace it and get all the syscalls, but it's gonna leave something out like if you've ever used aa-genprof, the one that generates AppArmor profiles, not perfect, everybody always complains because it like leaves something out then someone will turn it off because it's not It's not working. So there's like this proposal to add something like in Open BSD they have like this new pledge thing and it's more abstracted away from the user so DNS would say allow send to, receive from stuff with sockets and that way you don't have to be like some sort of syscall expert, know what your app is doing or even try to trace your entire app, so this is something I'm working on, but hopefully it will get better sometime soon because I honestly highly doubt that anyone is going to write a custom setconf profile and if you do that's really cool.

We do have the default one like if say your container isn't working with the default profile you can like almost take this default.json and either remove or add whatever is causing your problems then you have at least like more security than nothing. [BLANK_AUDIO] Which I would highly recommend.

[BLANK_AUDIO] That's how I made my Chrome one too. Any other questions? [BLANK_AUDIO] >> [INAUDIBLE] >> yeah, it's cool. [BLANK_AUDIO] Oh yeah I can diff it, actually. So let's see there's some crap at the top about architectures [BLANK_AUDIO] One of them is not formatted, well hang on I know this is never gonna work out because it's, actually I'll just edit one of them [BLANK_AUDIO] It's still yeah, it's using the, int for the allow and deny versus the string version of it so it's gonna be hard.

Really honestly what I did was delete the clone line because our setconf profile is a whitelist in itself, so it had clone and a mask as to the arguments that clone was calling that matches like clone user, clone new anything and so I just remove that all together. Actually I just added it and took out the mask so clone is in here if you look and that was the only one that I actually broke.

[BLANK_AUDIO] So the clone syscall is there just like with no arguments it's gonna allow anything to clone. The way that we did the mask is because clone you can do various different things with it if you need to fork a new process sometimes you're gonna use clone, so obviously that needs to be allowed in a container and can't be blocked by default, so you can create like these masks for what it's allowed to do and what it's not allowed to do, so ours looks like this [BLANK_AUDIO] So it takes all these syscalls and then actually compares it to this like zero value so that the mask knows that it's not allowed to allow those. Really it took me a long time to figure that one out [LAUGH] [BLANK_AUDIO] Any other questions? [BLANK_AUDIO] >> Let's have a round of applause for Jessica >> [APPLAUSE] >> Thank you so much.

Speaker:

Jessica Frazelle: Core Maintainer, Docker