Shared publicly  - 
Wow, that cgroups article over at LWN is awful. In the story and Jake's comments make clear that he assumes that the technical decisions we make in systemd on cgroups where driven by politics. That's just bullshit though. Especially in the cgroups case it would be helpful to actually understand the problem first before assuming that doing that in PID 1 was not the right technical choice. I mean, it's a good idea for anyone to first figure out the technical details before assuming politics, but particularly for people from the press that'd be a smart move.

Uncool, LWN, uncool.
Tejun Heo's profile photoAlex Besogonov's profile photoKen MacLeod's profile photoJef “Credible Hulk” Spaleta's profile photo
Sorry you don't like the article, Lennart, but somehow you seem to think it's an article about systemd?  It's not, it's about Serge's cgroup manager project.  You've taken exception to a half-sentence toward the end, the part after Jake said that not depending on a separate daemon from init made sense.

Non-subscribers who want to judge for themselves can see the article at
Always assume good! I actually liked the article (surprise, surprise) and I found it fair.
I also couldn't help noticing that the sentence "forcing anyone who wants to use cgroups to use systemd clearly isn't" was a bit strong; however, this is in fact what the team developing systemd would be doing, if alternatives such as cgmanager didn't arise. I don't think that the article is accusing you of making a political decision, it doesn't even try to guess why systemd develops the way it does.

If you re-read the article with unbiased eyes, you'll see that in fact there is absolutely no criticism of systemd.
This comment from Lennart seems to be key:

"And yes, there are tons of reasons why you want cgroup management in PID1: because it's trivially easy there and simple, and it is not if you do it outside of PID1. If you do it out-of-process, you need to replicate pretty much the entire state of your service manager in your cgroup manager, since the entities that are resource managed are in 95% of the cases exactly the same entitites that are service managed. So splitting them up means you need some form of IPC that constantly replicates the entire tree of services from PID1 in that other cgroup daemon. That's fragile and messy, and a lot of unnecessary code. IPC always is. Then, there is the issue of cyclic deps: a good PID 1 knows at any time securely and reliably of each process to which service it belongs. That's easy to do with cgroups. However, if cgroup management is done outside of PID1 in a different process, then PID 1 suddenly becomes a client to that other process, while that other process is also a managed process PID 1 wants to use cgroups to manage for. To start and manage that other daemon that will allow you to deal with cgroups you need cgroups in the first place. And that's just broken."

You can turn off parts of the what systemd provides, but the cgroup management is not one of those parts. So if you want an alternate cgroup manager you need an alternate PID 1 (init).  But it seems like the reason for wanting an alternate cgroup manager in the first place is that you have a non-systemd init.  So is this really only a problem for people who want to run systemd but don't want systemd to manage their cgroups for some reason?
+Jonathan Corbet well, it does reference systemd doesn't it? And those are the parts I have a problem with, since they suggest we were not playing nice, and would make life hard for people who don't do systemd. It uses the strong word "to force" in our contents. But we don't force, we don't make life hard, and Jake shouldn't suggest otherwise regardless if that's an article about systemd or not.
Don: The question is more why the different cgroup manager daemons (be it systemd or cgmanager) cannot provide the same DBUS interface; then all problems from the application side would be solved. But this seems to be not possible (according to Lennart):
+Jonathan Corbet
 I would have enjoy it if Jake had spent more time focusing on the discussion between Tejun and Tim/Serge.  The discussion on the heels of Tejun's lkml cgroup status quo really speaks to what is happening now. Neither Tim nor Serge really seem to get why the central manager is needed at all.  And if they don't understand why its needed, how can they build an alternative implementation to system'd that makes sense?  Tejun spends a lot of time going back and forth with them.. even prior to Lennart commenting in defense of systemd's approach.

Look... Tim and Serge just don't seem to get what the inherent problems. How is the engineering an altnerative where the primaries haven't had the light bulb go on over their head really going to end well?  They feel forced into building this alternative because they don't see it as a solution to a real problem. Their heart isn't in it. Status quo of wild and crazy do what you want cgroup management is good enough for them.

The systemd devs seem to be on the same page as kernel side development on where cgroups needs to go to be a secure set of knobs.  Why don't the alternative cgroup manager understand the need for it?  That's the story.  There is far more communication between Tejun and this alternative group than with Lennart. The real story is in that communication. Lennart's perspective isn't nearly as important as the lack of core agreement on the need for the change in the cgroup management between Tejun and Tim/Serge.  
+Lennart Poettering , I did say that it made sense to put cgroup handling in with PID 1 -- in the same sentence that you are complaining about, in fact.  I'm not sure what 'cool' has to do with any of this, but you and I clearly disagree on the motivations/plans/techniques/schemes of the systemd project.  I have watched the project fairly closely from the outset and I'm actually a fan, as I think you know, but I do think you could at least try to work with the rest of the Linux ecosystem better. It made sense to me that you left the other Unixes behind because they lack the features you use in systemd, but now you are choosing to ignore any non-systemd-using Linux distro. That is most certainly your choice, but that doesn't mean I have to agree with it.  And that's what I was trying to get across, but quite possibly failed to do so, at least for you.
+Jake Edge I could try "to work with the rest of the ecoystem better"???

What do you think I do every day? How do you think the cgroup stuff in systemd came into existence? That was after we discussed this in all detail with Tejun. In fact, I initially didn't even want systemd to be the one and only manager of cgroups on a system, Tejun first had to convince me that that was a good idea. I spend talking to people all frickin day, I go to conferences more than most, we do hackfests and whatnot. Of course I work together with people from all communities, after all we build the basic building block many build their own stuff on. And you tell me I could work with the rest of the ecosystem better??? Oh, come on!

You know, a suggestion of "you should work with th rest of the ecosystem better" is so frickin' cheap. You don't appear to have any insight on how we cooperate. And all that, it's not even my baby the cgroups stuff, it's Tejun's. And we worked with him and a lot of people to make it work. Maybe the Ubuntu people should start with that too?

I will not work with people who immediately tell me how stupid I am, you can bet on that, but other than that I am very open to work with anybody. But for me technical stuff matters more than anything else. I will not give up on that just so that by your definition of "working with the rest of the ecosystem together".
Wow, +Lennart Poettering , I don't agree, but I don't see anything particularly useful coming out of continuing to discuss it.  I clearly touched a hot button (or more than one).  Sorry for that!
+Jake Edge ,  The problem is it comes off as systemd pushing this change instead of systemd developers responding to the needs of the kernel devs to fix very real problems.
The long term push to change is coming from the kernel side. Please, please folloup with a perspective from Tejun on why a single manager is needed. Really dig into some of the security concerns he he keeps referring to as an impetus for needing to change.    The workman-devel discussion from June/July.. resulting in the closure of workman developmen full of little gems from Tejun.
+Jef Spaleta , I think you are reading some other article.  I can see nothing in mine that suggests or implies that systemd pushed for the change to cgroups. Sorry.  Your suggestion is appreciated, though.
+Jake Edge I think you're taking far to much notice of FUD on the intertubes when you make the statement "to work with the rest of the ecosystem better". I think this is actually a very insulting statement to make. Sure you can find people who make these claims, but having known Lennart for a number of years I honestly cannot understand how anyone can come to that conclusion.

Lennart specifically goes out of his way work with other people, to cooperate, to reach out and to inform people. He has reached out to Upstart people far more than it seems you believe to keep them aware of the problems that will result from the kernel changes, but it's clear they were unable or unwilling to appreciate the impact this would have.

Making such accusations when it's very clear the opposite is the case is I think what's really hit the nerve here and rightly so.

It's like telling the guy who drives a Prius, grows his own vegetables in his garden, uses solar power and a wind turbine, recycles everything and insulates his home that he doesn't care about the environment just because he owns a smartphone and they are made by big factories.

You're just spouting the same old FUD. Next time, check your sources, perhaps ask Lennart directly about it, arrange a phone call or something to get the facts straight. LWN should be about quality journalism, not more fodder for the peanut gallery.
+Colin Guthrie asking "to work with the rest of the ecosystem better" is not an insult. If I ask you to do something better it doesn't automatically mean that I think you did wrong, it simply means that I believe that you could do better (where "better" can of course be very subjective).
Then, if one wants to fight over every sentence, trying to find some evil meaning in the other's mounth, that's another story. Then, read your own comment and see if you can find some insulting statement there as well...
+Jake Edge  I think the fact that Tejun is pushing hard from the kernel side is an important piece of the story here.  Tim is openly hostile to the idea of collapsing down to a single heirachy, right out of the starting gate in the discussion, because it kills he's current in production system and generates a hell of a lot more work for him to re-engineer it.  That hostility infects the tone of rest of the discussion leading to the Tim/Lennart interaction.  The really important discussion is about 50 messages ahead of where Lennart taps into workman-devel list to suggest that systemd is already kickass enough for Google to use if they so chose.

What I don't understand is why Google isn't interested in trying to use systemd already. It is what it is.

And the thing of it is, everybody, Tejun, Serge, Tim. Lennart, seem to agree that Google's infrastructure need with regard to cgroup container security is a very special case. Tejun comes of as very sympathetic to the pain the change is eventually going to cause for Google's infrastructure.  And Tejun repeatedly points out that Google's special case, can rely on the existing cgroups api for a while longer. because Google use case doesn't have the same security issues as they control both host and containers equally and do not have to treat their containers as potentially malicious. 

It's still not clear to me that the alternative approach will adequately solve the problems Tejun needs to see addressed and will work well for Tim as a stakeholder. Tim seems to want access to lower level knobs than Tejun would want the kernel to expose. I still think there is a communication gap between kernel side and Google's containerized infrastructure perspective.
+Alberto Mardegan sorry, mate, you lost your rights to give pieces of advice how people should cooperate upstream when you joined Canonical... ;-)
Frankly +Jonathan Corbet , +Jake Edge whilst it is always nice to have the press available to "speak to", you've made a decision to publish.

There were no factual errors and this was, more of an overview / opinion piece in any regards.

I thought the article was perfect.

It clearly communicated to me precisely the choices I face as an admin (go with systemd as cgroup manager, not go with systemd as cgroup manager)

It wasn't about pandering to any particular project, or personalities, or ego - it laid out the facts as they are.
+Lennart Poettering , one sidebar.  In trying to understand why, for the systemd API isn't acceptable to upstart/Ubuntu... I watched the vUDS session and read the etherpad notes.  There's mention of systemd not supporting "nested" cgroups as need for lxc container situations. I'm not sure I understand the limitation with regard to "nesting". I was wondering if you understood that concern, or if it was previous brought up in systemd-devel for discussion and I just completely missed it.

I think understand why Google falls outside of the high level API usage case. But I would fully expect upstart/Ubuntu to want something very close in terms of a high level API...targetting similar general purpose use cases. So I was wondering if "nesting" limitation has come up in discussing development systemd's cgroup management API. Or if its just an external misperception of the API's capabilties.
+Jef Spaleta I am not sure what "nested" cgroups are supposed to be. That's not a term I use. If "nested" is supposed to mean "organized in a tree", then sure, we certainly do support trees of cgroups. I mean, that's kinda the whole idea of it... (which is why it isn't clear to me what "nested" is supposed to mean in this context...)

The reason why the systemd solution doesn't work is that they are sold to Upstart, and that's their priority.

systemd's API is bound to systemd concepts. I.e. to get a group of processes managed you create a scope unit and place it in a slice unit (which ultimately creates a cgroup one day, but you never get in contact with that). You can also place service units in slice units. Scope, Slice and Service units are all shown by systemctl and introspectable via the usual tools. You can filter logs by them, you can set all kinds of properties on them. A scope is hence much more than just a cgroup. It is a real management unit of systemd, that is relevant in many other areas too.
Ooh, some clarifications.

* Yes, I think cgroup should be managed by some entity which is trusted and privileged. There are two big reasons for this.

One is the fact that the whole thing is neither designed or implemented for delegation to untrusted domain. As I've now mentioned multiple times, security is mostly about logistics and details and implementing something with huge and complex interface surface without meticulously auditing every detail and expecting it to be secure is outright stupid. I can assure everyone that delegating cgroup to !priv has always been insecure and AFAICS it will continue to be - you can create vmalloc allocations of unlimited size, you can put processes into undefined hung state which will also drag your debugger into the same state if it tries to attach to it, and so on. Just imagine implementing this same feature with system calls and how much effort we'd have to be putting in in secure design and implementation. cgroup has not done any of it. Expecting it to be securely delegatable is fairy-tale naive.

The second is that delegating to !priv users often leads to each binary growing awareness of cgroup interface and manipulating it directly. This immediately promotes cgroup interface to the status of system calls, which is completely messed up. It becomes a side channel to breach the kernel API design and review process. I don't even understand how something like this has ever been allowed to happen but this has served as an active shortcut for both kernel and userland devs and is disastrous for both in the long term.

I don't know how designed this delegation to untrusted domain "feature" is. I suspect it came about as an accident just because the interface was filesystem based. Anyways, kernel devs have been working assuming the interface is a privileged admin thing and some userland has been happily delegating them out to untrusted domains.

* Serge seems to be mostly on the same page. I don't think there were any fundamental misunderstandings. It was just me being stupid about how chown was being used in its scheme which seems fine now.

* Google folks can speak for themselves but I asked around at the recent conferences and they don't seem to have any fundamental disagreements at this point. All the contention points were quite minor, or they at least seemed that way to me.

Hope this clears up at least some of the misunderstandings. Thanks.
I've been using delegation to !priv users with the current cgroup design just fine. It works as expected - I simply give permissions to a cgroup subtree to a user and they can do whatever they want.

I have heard the claim that cgroup delegation is uber-insecure and will eat your pets for lunch about 100^500 times by now. But somehow people always omit the exact details of the horrible gaping security holes. Except for the brain-damaged "starve your siblings" scenario and non-hierarchical cgroups (which are a mis-feature in themselves).

Let's see your technical objections:

1) "Putting processes into an undefined state and fouling up the debugger". Boohoo. So a !priv user can mess with their debugger. Horrible. And totally unprecedented, since ptrace is a nice rock-solid interface that always works flawlessly.

2) Unlimited vmalloc - totally a kernel issue that MUST be fixed anyway.

And your non-technical objections: cgroup ABI must be treated as an ABI. That ALREADY is true, systemd is NOT going to be the single user of the cgroups ABI and so it must be stable.
+Ken MacLeod  it's not clear to me that this is not allowed on a system where a host is running systemd and a set of heterogeneous containerized instances, running different init systems. I thought the whole point was this systemd approach here was to make it possible for these containers to know the host cgroup manager is systemd, and to communicate back to that host manager via machined API.

I'm just not sure if the alternative approach I'm seeing sketched out actually meets the requirements for the goals of the "sane_behavior" rework.  I thought the point of requiring a single controller instance on the host was to make it so containers were going to be making requests to that host process instead of twiddling bits of their section of the heirachy directly. The requests must flow through the host cgroup manager whatever that is..via an API of some sort.  systemd has an API for containers to use..... So I'm still left confused

Next, let's see the requirements for the central cgroups daemon. It must  support delegation to other users. At minimum, to support containers running old versions of systemd or other cgroup users.

How can this be done? I envision a way to delegate a subtree of the whole cgroups tree and allow the delegated software to be limited to using it.

So you need to have a way to isolate a part of DBUS tree and make sure only authorized users can access it.

Then there's a question of security - how do I audit cgroups delegation tree? Is there an audit API?

Can I use my favorite AppArmor (systemd is going to fully support AppArmor, right?) policies to make sure that only one process in my cgroups+namespaces based container is allowed to tinker with cgroups? How do I make sure that my browser can't ask for a delegated cgroups tree using this nice systemd-based DBUS interface?

Basically, ALL these issues are solved nicely with the classic Linux filesystem-based interfaces. And trying to solve them another way would just generate a solution isomorphic to a filesystem tree. Except it'll be stupid and buggy.

PS: personally, I think that cgroups+systemd team should be hit by a cluebat. Repeatedly.
+Lennart Poettering , One last question.... concerning the need to wait for atomic rename on the kernel side to be able to move scopes and services around from one slice to another in the systemd cgroup abstraction when sane_behavior is place.

My understanding is slices,scopes and services map into directory objects in the cgroupfs hierarchy so my understanding is you'll need the atomic rename to become available to move these around the hierarchy via systemd cgroup management api.

What I'm not clear on is what you need if an admin want to use the systemd cg api to just move individual processes from one scope or service to another. That maps effectively to removing a process id from the tasks file somewhere in the hierarchy and writing it into the tasks file in a different part of the hierarchy.  Is there any kernel side changes blocking sane behavior of that operation? Reading the systemd wiki on the ControlGroup its not clear if that's currently allowed of if its also waiting on the rename issue to be resolved for correct operation. Maybe I just missed the explanation of how an admin does this with the systemd api.  
+Jef Spaleta

There's a race condition there - if a process dies and other process takes this PID.
+Jef Spaleta That sounds like it would mean each instance of a container's cgroup manager has to know how to talk to the kernel (if it were a bare-metal instance) and every possible host cgroup manager.  So systemd, running in a container, would need to be able talk to a cgmanger host, and cgmanager, running in a container, would need to be able to talk to a systemd host.  Since each manager's approach appears to be different that doesn't sound like that's possible, at the host manager's higher API level.

I think that's what "nesting" is asking for, the ability of a container's cgroup manager to control everything from its root down in the cgroup hierarchy, through the kernel.

+Tejun Heo Any insight here?
+Ken MacLeod  containers will have to talk to the manager running in the host regardless.  So as soon as you have two different apis in use anywhere in the container have the problem. Every manager API will equally nestable or unnestable with every other.

To say that systemd is any more unnestable than the yet to be finalized cgmanager is pointless..especially when the cgmanager stuff is still basically trying to figure out what they want to do. I have yet to see any discussion as to why the systemd API is not usable as a general API even by a different implementation of a manager process.   
Add a comment...