Yet another systemd comment

At this point, the whole discussion has started to annoy me. I started so many replies to so many comments on reddit, posts on Arch’s forums and or mails on [arch-general] – and I aborted most of them. The stupidity of the whole discussion makes me sick. But in the end, you cannot change who you are, and I am a man of many angry words.

As with every discussion, there’s two extremes: Morons who praise systemd as if it would save the world, and retards that think systemd is the reason the world will end this year. Of course, all of them are wrong. If you belong to one of those groups: Fuck you! Everyone else: Keep reading.

Write programs that do one thing and do it well.

I won’t pretend to know the internal design of systemd. I also won’t argue that it actually does or does not follow that principle.

Let us instead remember that this UNIX principle was formulated decades ago. Back then and now, what was one of the biggest problem with computers? Bugs in your software. How do you avoid bugs? Divide your work into small tasks, only write the code for each task once, put it together as you need it. How? Use one tool, let it write its output to a file or the standard output, let the next tool read the output and process it. Does it sound like a good idea? Maybe, but here’s what’s wrong with it:

  • The only way of communication between tasks are strings. Every tool has to serialize its output, and the next tool has to parse it again. The communication is one-way only, parsing can be complex and error-prone.
  • cmd1 | cmd2 | cmd3 – an error occurs in cmd2, how do you handle that?
  • For performing a complex task, you need to create a shitload of processes and call a shitload of tools. Sounds efficient.
  • Syntax is checked on runtime.

I’m no expert, but to me it sounds like that could be improved. So, what’s changed in the last few decades? Shared libraries. Let me repeat that: shared fucking libraries. Here’s what you get:

  • Libraries contain code for common tasks that can be integrated into any program. Bug fixes in the library automatically affect all programs that use it.
  • Well-defined APIs, structured data, less data copying.
  • Proper error reporting.
  • No bugs from output and parsing errors.
  • Syntax is checked on compile time.

I’m not going to say more about this. Does systemd follow the classical UNIX principle? I don’t care. It’s the fucking 21st century. If we have better ways of doing things, let us please use them! Let us not follow a guideline that is decades old and mostly obsolete.

systemd is not easily hackable, Arch’s initscripts are more flexible

Let us have a look at the old initscripts. They boot up your system. They start your daemons. And they don’t do that very well. They don’t nearly cover all use cases. The code is complex. Extending them means changing the scripts and doing it again after the next update. There are bugs which are hard or even impossible to fix.

A lot has improved since Tom took over maintainership of the scripts. However, it has become clear that they can only be an intermediate solution: If we want them to remain simple, we will keep missing important features. If we keep extending them, they will become so complex that nobody understands them anymore.

Now, to the people that claim systemd is not hackable: Have you looked at /usr/lib/systemd/system? You can change, disable or extend almost every single detail of it. You can override almost all of its behaviour by placing files in /etc/systemd/system and it will be persistent to upgrades. All of that without changing a single line of C or shell code. If you want to, you can even mask all of its units and instead make it call a script very similar to our old rc.sysinit.

In short: systemd’s hackability and flexibility supercedes initscripts’ considerably.

systemd – why not?

I have only been running pure systemd on my machines for a few weeks. I like it. It has some rough edges, its support in Arch Linux is incomplete. There’s room for improvement. It will not save the world. It will not eat your children either.

A final message to the people who keep complaining

You can yell all you want, yet Arch Linux will slowly move towards systemd. Initscripts will probably keep working for a long time, but they will eventually disappear. It doesn’t help if you insult us for it. It doesn’t help if you state a thousand times that you leave Arch over “the systemd issue”. We don’t care. You can either embrace systemd and enjoy all its advantages, provide an alternative or use another distribution. We don’t care. We make Arch for ourselves, and for the ones that like it like we do. Whether we have a million users or one hundred users – we will keep making it the distro we like. Deal with it.

Arch Linux USB Install and Rescue Media

For some time now, it was possible to use dd to write Arch Linux Install images to USB media. These media would boot fine, but cannot be used for anything other purpose. In the latest 2010.05 release, we used isohybrid for the first time to create a combined image that can be used from both CD and USB.

After such an image has been written to USB, it is even possible to add a second partition and use that for data storage. I was happy it was so easy, only to find out that it wasn’t: When a friend tried to give me a few files, I realized that Windows was unable to access the second partition. So, I was trying to prepare my USB drive such that:

  • I would be able to boot the Arch Linux 2010.05 Netinstall Dual from it.
  • Non-Linux devices (Windows computers for example) could read from and write to it.
  • The whole archiso mess would be invisible for any Windows user accessing it.

Before we start, there is one thing you must never forget: Windows is god damn stupid. Even the latest and greatest Windows 7 will not recognize a filesystem on USB media unless it is on the first primary partition. It will kindly ignore anything else and produce the weirdest errors in the partition manager.

Let’s begin – this process will erase all data on the drive (we could probably do without it, but I didn’t try). First, create a suitable partition layout (from now on assuming that the USB drive is on /dev/sdb):

# LANG=en_US.utf8 fdisk -ulc /dev/sdb

Disk /dev/sdb: 8019 MB, 8019509248 bytes
247 heads, 62 sectors/track, 1022 cylinders, total 15663104 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xc8b9659d

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048    14966783     7482368    b  W95 FAT32
/dev/sdb2   *    14966784    15663103      348160   83  Linux

The second partition should be about 340 MB, not smaller but also not much bigger, we don’t want to waste space. Also notice that the bootable flag needs to be set on sdb2. Now create filesystems on the partitions:
# mkfs.vfat -n SOMELABEL /dev/sdb1
# mkfs.ext2 /dev/sdb2
# tune2fs -i0 -c0 -m0 /dev/sdb2
# e2label /dev/sdb2 ARCH_201005

Notice the label of the Linux partition: While it is arbitrary, it must match the label in the bootloader configuration, which is ARCH_201005 by default. We are finished with the FAT part now: The way we set it up, even Windows will be able to recognize and use it properly. Now, mount the archiso image and the Linux partition:
# mkdir -p /mnt/{archiso,usbboot}
# mount -o loop,ro archlinux-2010.05-netinstall-dual.iso /mnt/archiso
# mount /dev/sdb2 /mnt/usbboot

Copy all the contents of the archiso image onto the USB and umount the ISO:
# cp -a /mnt/archiso/* /mnt/usbboot/
# umount /mnt/archiso

All that is left to do is set up a bootloader. We will use extlinux, as we will be able to reuse all existing configuration from isolinux. First, install the syslinux package, if you don’t have it already:
# pacman -S syslinux
Now, remove the old isolinux bootloader, rename the configuration file, install the extlinux bootloader and umount:
# rm /mnt/usbboot/boot/isolinux/isolinux.bin
# mv /mnt/usbboot/boot/isolinux/isolinux.cfg /mnt/usbboot/boot/isolinux/extlinux.conf
# extlinux --install /mnt/usbboot/boot/isolinux/
# umount /mnt/usbboot

For simplicity, I didn’t rename the isolinux folder here, although the bootloader is technically not isolinux anymore. Feel free to rename it to extlinux if you feel the urge. One last step is setting up a MBR that will recognize the active flag of the second partition and boot its boot sector:
# cat /usr/lib/syslinux/mbr.bin > /dev/sdb

And you are done. Your USB drive will now boot the Arch Linux i686 and x86_64 Netinstall just like the ISO does.

Early Userspace in Arch Linux

There have been some major changes in Arch’s early userspace tools recently. So I thought I’d take the time to sit down and explain to everyone what these changes are about.

Booting Linux Systems: Why do we need Early Userspace?

Traditionally, booting a Linux system was simple: Your bootloader loaded the kernel. The kernel was extracted and initialized your hardware. The kernel initialized your hard disk controller, found your hard drive, found the root file system, mounted it and started /sbin/init.

Nowadays, there is a shitload of different controllers out there, a huge number of file systems and we are a good distro and want to support them all. So we build them all into one big monolithic kernel image which is now several megabytes big and supports everything and the kitchensink. But then someone comes along and has two SATA controllers, three IDE controller, seven hard drives, plus three external USB drives and who knows what. The Linux kernel will now detect all those asynchronously – and where is the root file system now? Is it on the first drive? Or the third? What is “the first drive” anyway? And how do I mount my root file system on the LVM volume group inside the encrypted container residing on a software RAID array? You see, this is all getting a bit ugly, and the kernel likes to pretend it is stupid – or it simply doesn’t care about your pesky needs, especially now that it has become so fat after we built every imaginable driver in the world into it.

What now? Simple: We pass control to userspace, handle hardware detection there, set up all the complicated stuff that people want, mount the root file system and launch /sbin/init ourselves. You are probably asking yourself “How do I execute userspace applications when the root file system is not mounted?”. The answer is: magic!

What is initramfs?

Okay, the answer is not magic. The answer is actually initramfs: Each Linux system has a ramfs file system that is always mounted and called rootfs. You will probably never see it, because your real file systems are mounted over it. However, the kernel also has a compressed cpio archive attached to it that it extracts directly into rootfs after boot. Even better, you can attach a compressed cpio archive to your kernel from the bootloader which is also extracted into rootfs.

Before the kernel runs the old-fashioned init code, it checks whether rootfs contains a file called /init. If it does, it skips the traditional mounting/init code and instead executes /init. This program is now responsible for doing all these complex task that the kernel thought to be too complicated. This way, we can build a kernel that has no built-in support for any hard disk controller or filesystems at all, instead we build them all as modules (this is actually what we do in the Arch Linux default kernel) and include the needed ones in the initramfs image.

klibc – The Purgatory of the Distro Initramfs Maintainer

klibc was originally created to be a small and lightweight C library for early userspace. It comes with a number of tools to support you in setting everything up. It also comes with klcc, an ugly perl script that calls gcc and builds binaries against klibc instead of your usual C library. When mkinitcpio was originally created in 2006 by Aaron Griffin as a replacement for the old, unflexible mkinitrd and mkinitramfs scripts, it was decided to base it on klibc. From the beginning, klibc had lots of problems:

  • The set of shipped tools was limited and the tools that were included lacked vital options.
  • Most external tools could not be built against klibc or had to be heavily patched to do so.
  • There was no dynamic linker, all binaries were hard-linked against a specific version of klibc – this version changed every time anything in the klibc source or the kernel headers you built against changed, requiring a rebuild of all binaries that used klibc.
  • It was not possible to create any dynamic libraries other than klibc itself.

All this resulted in high maintenance effort to keep udev and module-init-tools working, we also had to maintain a small klibc-extras package with our own tools to replace those that were missing from klibc, and we had to include any more advanced application like lvm or cryptsetup as glibc-based statically linked binaries.

At some point, klibc stopped being compatible with the current kernel headers and we had to introduce more and more hacks to be able to rebuild it again when needed. As of Linux 2.6.30, I was unable to build a working version of klibc at all, leaving us with an old binary which could not be bugfixed anymore. In the middle of 2009, upstream died completely, there were no commits made to the git repository anymore, and the mailing list only received a handfull of posts each month. That was when I started to ask myself the following question: Where is the point in maintaining a separate C library and tools that are only used for a fraction of a second each time you boot? What we supposedly gained from this was a smaller initramfs and thus faster boot time.

Keeping it simple

In 2009, I decided that in order to be able to create an initramfs environment with low maintenance effort, many features and much flexibility, the following changes needed to be made:

  • Do not maintain a separate C library for it, simply use the one from the normal system
  • For basic system and scripting tools, use busybox to get a good compromise between high functionality and small binary size
  • For filesystem label, UUID and type detection, use util-linux-ng’s blkid for full and bleeding-edge support of all new and old filesystems
  • For other advanced functions, use modprobe, udev, lvm, cryptsetup, mdadm/mdassemble from the normal Arch packages

This way, I would only need to maintain the mkinitcpio scripts themselves and a properly configured busybox binary. I had used busybox for quite some time on my OpenWRT router(s) and was thus familiar with how awesome it was. It also turned out that implementing NFS root support was easier if we used the nfsmount and ipconfig utilities that were shipped with klibc.

It is February 2010 now, and in the last few weeks I finally had the time to do all the work. Just a few days ago I released mkinitcpio 0.6. This version is much stabler, more flexible and less error-prone than any klibc-based version we ever had in the past. On average, the initramfs is now between 600KB and 1MB bigger than the klibc-based ones, I guess nobody will ever complain about that – it is still smaller than on most other distributions. And I am glad that I hopefully never have to touch klibc again.

Using POSIX capabilities in Linux, part two

Okay, it has been over half a year since I last got to writing about this topic. And I don’t want to write about what I originally intended – which was capchroot. However, I am going to introduce you to the concept of inheritable file capabilities and inheritable thread capabilities and how to use them with capsudo.

If you read part one and experimented with capabilities, you probably noticed that the set of effective capabilities gets lost whenever you execute a subprocess using one of the exec* system calls. Looking at the capabilities manpage, there is an interesting formula that explains the situation:

P'(permitted) = (P(inheritable) & F(inheritable)) |
                (F(permitted) & cap_bset)

P'(effective) = F(effective) ? P'(permitted) : 0

P'(inheritable) = P(inheritable)    [i.e., unchanged]


 where:
P           denotes the value of a thread capability set before the execve(2)
P'          denotes the value of a capability set after the execve(2)
F           denotes a file capability set
cap_bset    is the value of the capability bounding set (described below).

So, to be able to inherit a capability from a parent, the following must be true:

  • The thread must have the capability in its inheritable set.
  • The executable file must have the capability in its inheritable set.
  • The executable file must have the effective bit set (this can be omitted if the executable is aware of capabilities and raises the permitted capability to an effective capability during execution).

For the first point, we’ll have a look at capsudo. It’s a small tool written by yours truly, which requires libcap and iniparser. Get the source, build it with make, install the config file to /etc/capsudoers and put the binary somewhere (/usr/bin in our example). Then, run setcap cap_setpcap=p /usr/bin/capsudo. The CAP_SETPCAP capability allows it to put arbitrary capabilities into the thread’s inheritable set, but does not allow them to become permitted unless you execute a program with the correct file inheritable capability.

Now we’ll use this to allow capturing in tcpdump and wireshark to certain users without setuid and without root:

  • Run setcap cap_net_raw=ei /usr/sbin/tcpdump
  • Add the following section to /etc/capsudoers:
    [tcpdump]
      caps = cap_net_raw
      command = /usr/sbin/tcpdump
      allow_user_args = 1
      users = user1 user2
      groups = group1 group2

    The users user1 and user2 are now allowed to use tcpdump with the CAP_NET_RAW capability, as well as all members of group1 and group2.

  • Run capsudo tcpdump -ni wlan0 and capture traffic.

To do the same with wireshark, we need to do something slightly different: Instead of running the setcap command on /usr/bin/wireshark, run it on /usr/bin/dumpcap. This is because wireshark does not capture itself, but calls dumpcap. The beauty here is that despite the CAP_NET_RAW inheritable capability being in the thread, wireshark has no privileged rights at all until it calls dumpcap, which then only gets the capability to capture, and nothing more.

  • Run setcap cap_net_raw=ei /usr/bin/dumpcap
  • Add the following section to /etc/capsudoers:
    [wireshark]
      caps = cap_net_raw
      command = /usr/bin/wireshark
      allow_user_args = 1
      users = user1 user2
      groups = group1 group2
  • Run capsudo wireshark and capture traffic.

Another use case would be running a http server on port 80 without root:

  • Run setcap cap_net_bind_service=ei /usr/bin/yourhttpserver
  • Add the following section to /etc/capsudoers:
    [yourhttpserver]
      caps = cap_net_bind_service
      command = /usr/bin/yourhttpserver
      allow_user_args = 1
      users = httpd
    
  • Start the service with capsudo yourhttpserver and open a privileged port.

That’s all for today, I hope you enjoyed it and find ways to use this to your advantage, so that we may at some point minimize the number of places where we have to use setuid or become root.

Hello from FrOSCon 2009

We’re at FrOSCon all weekend, so I thought I’ll send a group picture of the whole team:

The Arch FrOSCon team

From left to right: Diana, Andy, Jens, Gerhard, Daniel, Pierre, me, Roman and Dieter.

Shit happens when you party naked … or use crappy shell scripts

For all of you eagerly waiting for part two of my libcap writeup: It’s coming – I have written some more (and less crappy) code and was too lazy to write the post down yet, so stay tuned.

But after today’s events, I wanted to talk about another topic: Most of you probably noticed how all core, extra and testing packages disappeared from the Arch Linux mirrors. And you are probably wondering how such a thing can happen. To understand it, you have to know a few things about how packaging works, we’ll take the extra repository as an example: After building a package, the packager uploads it to our master server and runs a script /arch/db-extra which will check if the SVN folder matches the data in the package, adjusts the extra.db.tar.gz file and copies the package to the right folder on the FTP. So what happens to the old version of the package? The answer is: nothing. Instead, there’s a cleanup cronjob running every 3 hours that will delete files from the FTP that don’t belong there.

To do this, the script unpacks the extra.db.tar.gz file, iterates over all packages in there and checks if the file is on the FTP. If not, it adds the file name to a list of missing files. Then it takes all files with the same package name, but a different version number and adds them to a list of files to be deleted. In a second run, it then looks at all files called *-i686.pkg.tar.gz or *-x86_64.pkg.tar.gz and checks if there is a corresponding package in the database. If not, it also adds this file to a list of packages to be deleted. In the end, all obsolete files are moved to a special cleanup-directory and an email is sent to inform about everything that happened and warn about missing files. This ensures two things:

  • For each package in the database, there is exactly one package file on the FTP.
  • If a package file is on the FTP, we can be sure it belongs to a package in the database.

Enough theory, so what does this script look like? Here it is – or at least the version that we used until a few hours ago. Now look at line 61:

bsdtar xf "$ftppath/$reponame.db.tar.$DB_COMPRESSION"

Some genius (no idea who, and I won’t use “git blame” to find out) thought it would be a great idea to use the DB_COMPRESSION variable from makepkg.conf just to find out that we use ‘gz’ as file extension here. Some other genius (probably me, as I wrote the first version of this script) thought it was unnecessary to check for the existence of the file or to verify the return value of bsdtar. Yet another genius thought that upgrading the pacman package in the middle of the night will make the world a better place. And if you look here, you’ll see that some changes were made to makekpg.conf in the new pacman version.

So in the end, this lead to the following disaster (twice!):

  • The ftpdir-cleanup script used extra.db.tar. as the db filename, which didn’t exist.
  • bsdtar failed extracting the db, leaving the directory empty.
  • When iterating over the package files, the script found that none of them was in the repository, moving them all to the cleanup directory.

Gladly, the contents of the cleanup directory is only cleaned up on very rare occasions, so we were able to move all the files back to the right places. But most mirrors had already synced and thus don’t provide any Arch Linux packages now. Our master server will have its bandwidth maxed out for a few days while the mirrors resync and many users will be very annoyed. But the problems have been fixed now and we will be completely alive again in a few days.

What did we learn from this episode? Always check errors in your shell scripts, or shit is going to happen.

Using POSIX capabilities in Linux, part one

Last week, I was messing around with capabilities and thought that this barely known Linux feature deserves more documentation and attention. In the traditional Unix-like system, there was just one form of privilege: root. If you are root, you are allowed to do virtually anything. However, certain tasks unprivileged users should be able to perform are usually not permitted. There are two solutions to the problem:

  • Temporarily gaining root powers using su or sudo
  • Setting the setuid bit on an executable

In both cases, you always gain full root powers. You’re probably thinking that with sudo or setuid it is possible to only let a user execute a limited set of commands, so it can’t be that bad, right? While you are basically right, you should know that I am – like many other Linux users – paranoid. Errors in programs or libraries could potentially be exploited by an evil user to make a setuid or sudo program do things that weren’t intended – like launching a root shell. Even if you are the only user on your system, an attacker could gain access to your computer through vulnerabilities in your browser, mail client or any other application that parses complex input from untrusted sources.

Since Linux 2.2, the root privileges have been split into small bits, called capabilities. When executing a system call, instead of checking if you are root, the kernel will check if you have the necessary capabilities. I first read about capabilities a few years ago, only to find out that there was no way of using them in any reasonable way. root always had all capabilities, unprivileged users had none and there was no way for a user to obtain them. In Linux 2.6.24, something called “file capabilities” has been introduced. This mechanism allows the administrator to assign capabilities to certain binaries, which the user will gain upon executing them.

A list of capabilities is available in the capabilities(7) man page. In the examples below, we will concentrate on the CAP_NET_RAW capability. But first of all, you need to make sure that a 2.X version of libcap is installed on your system. If you use Arch, it has probably already been installed as a dependency of syslog-ng and other packages. If you don’t, don’t be discouraged, any recent distribution should have it.

$ pacman -Q libcap
libcap 2.16-3
$ pacman -Ql libcap | grep bin/
libcap /usr/sbin/
libcap /usr/sbin/capsh
libcap /usr/sbin/getcap
libcap /usr/sbin/getpcaps
libcap /usr/sbin/setcap

We will use the setcap utility to manipulate file capabilities. This requires that your filesystem of choice supports extended attributes. Before we finally begin, the keywords permitted, inheritable and effective have to be explained. They have slightly different meanings in the context of thread capabilities and file capabilities.

For thread capabilities, the privileges that each effective capability offers are available, the inheritable capabilities are the ones that will be passed on to children and the permitted set limits the set of effective capabilities. For file capabilities, inheritable means that the capability can be inherited from a parent upon execution and permitted capabilities will be available in the permitted set of the thread. In this context, effective does not refer to a specific capability, but to the file: It means that each permitted capability will be available as effective capability once the program has been executed. This may seem confusing – it also confused me at first, especially when I found out that setting a capability to inheritable in the file doesn’t mean it will be inheritable in the thread. For now we will only work with effective and permitted, inheritable will be treated in the upcoming second part.

Enough confusion, let’s do something with it. First of all, ping needs root privileges to work – or not:

# ls -l /bin/ping
-rwsr-xr-x 1 root root 33360 4. Okt 2008 /bin/ping

As you can see, /bin/ping is setuid root. We remove that:

# chmod -s /bin/ping
# ls -l /bin/ping
-rwxr-xr-x 1 root root 33360 4. Okt 2008 /bin/ping

Now you will notice that pinging does not work anymore unless you are root. We now set the effective bit on ping and add the CAP_NET_RAW capability to the permitted set:

# setcap cap_net_raw=ep /bin/ping
# getcap /bin/ping
/bin/ping = cap_net_raw+ep

Now, you will see that ping works again. Let’s do the same with traceroute:

# ls -l /bin/traceroute
-r-sr-xr-x 1 root root 23616 4. Okt 2008 /bin/traceroute
# chmod -s /bin/traceroute
# setcap cap_net_raw=ep /bin/traceroute
# getcap /bin/traceroute
/bin/traceroute = cap_net_raw+ep

Sounds easy – but sometimes it isn’t. Some programmers seem to think it is a good idea to check whether you are root before trying to execute a specific system call. One example is tcptraceroute:

$ tcptraceroute google.de
Got root?

Even after setting cap_net_raw=ep like above, this message doesn’t disappear. This is a piece of tcptraceroute’s code:

if (getuid() & geteuid())
    fatal("Got root?\n");

A similar piece of code is also in libnet which is linked statically into tcptraceroute. When I removed the code in both of them and recompiled, the above trick also worked for tcptraceroute. We could go on and add capabilities to many programs to get rid of setuid in many places. It won’t always work, depending on how the programs are written. If a program just executes system calls without asking whether it is allowed to do so, it is likely to work without modification once you add the right capabilities – it doesn’t even have to be aware of capabilities at all.

You could have the idea of adding capabilities to other programs, for example
# setcap cap_sys_chroot=ep /usr/sbin/chroot
But wait: This will allow any user to chroot to any directory (and it works), but we don’t want that. So please, after trying it, remove the capability again:
# setcap -r /usr/sbin/chroot

Next time, we’ll see how we can be a little bit more selective when allowing privileged actions to unprivileged users, we’ll have a look at capchroot and see how to solve the chroot problem, we’ll see how the inheritable flag can be used to allow certain adminstration tasks (like configuring network interfaces) to certain users and finally, we will enable wireshark to capture network traffic without root privileges. Eventually – maybe in a third part – we will see if and how we can integrate this into pacman packages and thus have it automated on installation and upgrades. Even if you found this introduction to be boring, it will definitely get more interesting. Stay tuned.