Shutting down is hard.

I am sorry for piling on, but here's an interesting aspect of Upstart I'd like to shed some light on. Not necessarily because it is hard to fix, but simply because it is quite interesting.

Upstart (as it stands now) will eat your file system. And that since quite a while. And I am pretty sure I know why that is, and amazed that this has gone unfixed since so long...

Here's the long story: as one might expect Linux doesn't permit remounting read-only a file system a program still has a file open on (or mapped) for write. However, sometimes it doesn't permit remounting read-only a file system a program still has a file open on (or mapped) for read either. And that happens in the case where the file has been deleted after opening but where the deletion couldn't be executed yet, because that open fd/map required it to continue to exist.

During shutdown the init system (or some related program) needs to unmount all file systems and remount the remaining ones it cannot unmount to read-only.  If it doesn't do that and shuts down anyway this will usually be noticed on next boot and result in an fsck run. In the best case that's the only thing you will notice from it. In the worse case you lose unwritten data, because the file system wasn't in order when the machine was powered down.

Now, to make sure that the file systems can be unmounted/remounted, any self-respecting init system will enter a kill loop just before, terminating all remaining processes until none remain -- with one exception: PID 1 itself. Normally that should totally suffice to make sure that the file systems can be unmounted, after all PID 1 doesn't keep any files open/mapped for write during runtime.  --- or well, that's the theory at least. In real-life the second case pointed out above will actually happen quite frequently: when the system got upgraded during runtime (via dpkg, rpm or suchlike) and any file that is opened or mapped by the PID 1 process is replaced by a newer version, than the old version will be unlinked but still be referenced by PID 1, thus not allowing the read-only mounting right before shutdown to succeed. Which files those are? Well, libraries such as libc or libdbus as used by the init system, or even the init system binary itself.

How to fix this? A naive approach could be to issue "telinit u" everytime  the init system or any of its libraries are updated. However, that is not really feasible in real life simply because it is impossible to identify all those libraries and other files, since they are not necessarily explicit dependencies. For example, via NSS and similar plug-in APIs a process might map some library that ldconfig won't tell you about. Or via the locale subsystem your process might end up mapping locale files that you can't guess in advance...

The usual approach to fix this for good is to make sure that PID 1 executes another binary (or itself) right before the unmount loop, thus releasing all maps and closing all left-over fds. In the shutdown scripts of older Fedora versions, or in RHEL you hence found a call to "telinit u" right before the unmounts/remounts. That call appears to be missing from Upstart, hence resulting in file system corruptions everytime Upstart or any of the libraries and other resources Upstart needs are updated.

In +systemd we went two steps further actually. In systemd the main init process will actually never bother with the unmounting or the killing spree. Instead it will simply execv() a special binary called "systemd-shutdown" (which then becomes the new PID1), when it is time to go down. That binary is relatively simple, but also very effective in bringing down the remaining file systems and storage devices. It basically is a big loop in which it tries to unmount/remount remaining file systems, kill remaining processes, detach remaining loop devices, detach remaining swap devices, detach remaining DM devices. It does that as long as this achieves something, i.e. as long as it managed to kill at least one remaining process, or to unmount/remount at least one remaining file system, to detach at east one loop device, or to detach at least one DM device. The code is quite robust to handle arbitrary stackings of things, since it simply tries again and again until everything that can be done is done. It will not try to understand dependencies, it just tries and tries and tries until nothing goes anymore. This is simple, and robust. After all it's your data here, and we need to be quite careful with it...

Now, our more sophisticated shutdown loop that is preceeded by execv() already makes sure we can safely and reliably shut down systems, much better than we ever could do before. However, there's one more thing we do, to make sure your valuable data is kept safe: in many cases the root file systems might be located on complex storage setups involving volume managers, software RAID or network storage. Often, these require userspace components around, that cannot be safely dealt with with the loop mentioned above: since the systemd-shutdown binary is located on the root fs, the root fs can never be unmounted, only mounted read-only. And this means that any storage system used underneath the root partition cannot be disassembled either before we go down. Well, that's at least the traditional assumption here. However, when systemd is used together with +dracut - initramfs generator and framework we go one substantial step further. Since it was the the initrd that set up the storage for the root fs, and mounted it in the first place we will actually return contol back to the initrd when we go down. This involves transitioning back into the initrd file system, thus allowing the root file system to be unmounted, and the backing storage devices to be detached.

If we put this altogether, then this results in an extraordinarily safe way to shut down on systemd. Upstart OTOH loses your data. Booh, Upstart, booh!

(But again, it's easy to fix this to make at least the trivial cases safe, where no DM/MD/RAID/iscsi/loop/... is used, see above)
Shared publiclyView activity