Shit happens when you party naked … or use crappy shell scripts

For all of you eagerly waiting for part two of my libcap writeup: It’s coming – I have written some more (and less crappy) code and was too lazy to write the post down yet, so stay tuned.

But after today’s events, I wanted to talk about another topic: Most of you probably noticed how all core, extra and testing packages disappeared from the Arch Linux mirrors. And you are probably wondering how such a thing can happen. To understand it, you have to know a few things about how packaging works, we’ll take the extra repository as an example: After building a package, the packager uploads it to our master server and runs a script /arch/db-extra which will check if the SVN folder matches the data in the package, adjusts the extra.db.tar.gz file and copies the package to the right folder on the FTP. So what happens to the old version of the package? The answer is: nothing. Instead, there’s a cleanup cronjob running every 3 hours that will delete files from the FTP that don’t belong there.

To do this, the script unpacks the extra.db.tar.gz file, iterates over all packages in there and checks if the file is on the FTP. If not, it adds the file name to a list of missing files. Then it takes all files with the same package name, but a different version number and adds them to a list of files to be deleted. In a second run, it then looks at all files called *-i686.pkg.tar.gz or *-x86_64.pkg.tar.gz and checks if there is a corresponding package in the database. If not, it also adds this file to a list of packages to be deleted. In the end, all obsolete files are moved to a special cleanup-directory and an email is sent to inform about everything that happened and warn about missing files. This ensures two things:

  • For each package in the database, there is exactly one package file on the FTP.
  • If a package file is on the FTP, we can be sure it belongs to a package in the database.

Enough theory, so what does this script look like? Here it is – or at least the version that we used until a few hours ago. Now look at line 61:

bsdtar xf "$ftppath/$reponame.db.tar.$DB_COMPRESSION"

Some genius (no idea who, and I won’t use “git blame” to find out) thought it would be a great idea to use the DB_COMPRESSION variable from makepkg.conf just to find out that we use ‘gz’ as file extension here. Some other genius (probably me, as I wrote the first version of this script) thought it was unnecessary to check for the existence of the file or to verify the return value of bsdtar. Yet another genius thought that upgrading the pacman package in the middle of the night will make the world a better place. And if you look here, you’ll see that some changes were made to makekpg.conf in the new pacman version.

So in the end, this lead to the following disaster (twice!):

  • The ftpdir-cleanup script used extra.db.tar. as the db filename, which didn’t exist.
  • bsdtar failed extracting the db, leaving the directory empty.
  • When iterating over the package files, the script found that none of them was in the repository, moving them all to the cleanup directory.

Gladly, the contents of the cleanup directory is only cleaned up on very rare occasions, so we were able to move all the files back to the right places. But most mirrors had already synced and thus don’t provide any Arch Linux packages now. Our master server will have its bandwidth maxed out for a few days while the mirrors resync and many users will be very annoyed. But the problems have been fixed now and we will be completely alive again in a few days.

What did we learn from this episode? Always check errors in your shell scripts, or shit is going to happen.

9 Comments

  1. Mike says:

    Hmm, I thought I saw some missing packages earlier :p

  2. fukawi2 says:

    Yeah, that error handling thing sucks, but is a good idea… It’s kinda like gravity in that respect.

  3. [...] it was crazy. This is by no means the biggest cockup Arch has done recently, I recommend reading http://archlinux.me/brain0/2009/08/16/shit-happens-when-you-party-naked-or-use-crappy-shell-scripts/ for a good read on how to fuck up a repo, admitedly I have done this to Ophion twice, but not quite [...]

    • brain0 says:

      Haha, you should read what that guy posted. I commented on his blog, referring to him as a troll in one place, so I doubt he’ll approve the comment.

  4. Emess says:

    I approved it. It was in no way meant as a troll and I do apologise for offending anyone. I was considerably irate at the time and that seems to be reflected in the rant. Admitedly the aim of that particular blog is to rant about things, however it does in no way excuse personally targetting anyone, in this case the Arch developers. I think I included the link to here to emphasis a point rather than specifically say “Arch is shit” because it is by no means so. Arch does remain my favorite distro and primary operating system, but little fuckups every so often are annoying for anyone with time constraints and pressure.

    tl;dr sorry I was rude, I love Arch, broken things are annoying

  5. Allan says:

    I think the real lesson here is to never rely on a config file from another piece of software. If changes in the other piece of software are not followed closely, well…. shit happens.