March 14th, 2009


ext4, application expectations and power management

Edited to add: Sorry, it turns out that ext4 does now have a heuristic to deal with this case. However, it seems that btrfs doesn't. My point remains, though - designing a filesystem in such a way that a useful behaviour is impossible is unhelpful for the vast majority of application writers, even if it does improve performance in some other use cases.

Further edit: It's mentioned in the comments that btrfs will force a flush of data when a rename is performed as of 2.6.30. That prevents the data loss, but it means that we're still stuck with disk access when we'd be happy with that being batched up for the future. It feels like we're optimising for the wrong case here.

Original text:

There's been a certain amount of discussion about behavioural differences between ext3 and ext4[1], most notably due to ext4's increased window of opportunity for files to end up empty due to both a longer commit window and delayed allocation of blocks in order to obtain a more pleasing on-disk layout. The applications that failed hardest were doing open("foo", O_TRUNC), write(), close() and then being surprised when they got zero length files back after a crash. That's fine. That was always stupid. Asking the filesystem to truncate a file and then writing to it is an invitation to failure - there's clearly no way for it to intuit the correct answer here. In the end this has been avoided by avoiding delayed allocation when writing to a file that's just been truncated, so everything's fine.

However, there's another case that also breaks. A common way of saving files is to open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). The mindset here is that a crash will either result in foo.tmp being zero length, foo still being the original file or foo being your new data. The important aspect of this is that the desired behaviour of this code is that foo will contain either the original data or the new data. You may suffer data loss, but you won't suffer complete data loss - the application state will be consistent.

When used with its (default) data=ordered journal option, ext3 provided these semantics. ext4 doesn't. Instead, if you want to ensure that your data doesn't get trampled, it's necessary to fsync() before closing in order to make sure it hits disk. Otherwise the rename can occur before the data is written, and you're back to a zero length file. ext4 doesn't make guarantees about whether data will be flushed before metadata is written.

Now, POSIX says this is fine, so any application that expected this behaviour is already broken by definition. But this is rules lawyering. POSIX says that many things that are not useful are fine, but doesn't exist for the pleasure of sadistic OS implementors. POSIX exists to allow application writers to write useful applications. If you interpret POSIX in such a way that gains you some benefit but shafts a large number of application writers then people are going to be reluctant to use your code. You're no longer a general purpose filesystem - you're a filesystem that's only suitable for people who write code with the expectation that their OS developers are actively trying to fuck them over. I'm sure Oracle deals with this case fine, but I also suspect that most people who work on writing Oracle on a daily basis have very, very unfulfilling lives.

But anyway. We can go and fix every single piece of software that saves files to make sure that it fsync()s, and we can avoid this problem. We can probably even do it fairly quickly, thanks to us having the source code to all of it. A lot of this code lives in libraries and can be fixed up without needing to touch every application. It's not the end of the world.

So why do I still think it's a bad idea?

It's simple. open(),write(),close(),rename() and open(),write(),fsync(),close(),rename(), are not semantically equivalent. One is "give me either the original data or the new data"[2]. The other is "always give me the new data". This is an important distinction. fsync() means that we've sent the data to the disk[3]. And, in general, that means that we've had to spin the disk up.

So, on the one hand, we're trying to use things like relatime to batch data to reduce the amount of time a disk has to be spun up. And on the other hand, we're moving to filesystems that require us to generate more io in order to guarantee that our data hits disk, which is a guarantee we often don't want anyway! Users will be fine with losing their most recent changes to preferences if a machine crashes. They will not be fine with losing the entirity of their preferences. Arguing that applications need to use fsync() and are otherwise broken is ignoring the important difference between these use cases. It's no longer going to be possible to spin down a disk when any software is running at all, since otherwise it's probably going to write something and then have to fsync it out of sheer paranoia that something bad will happen. And then probably fsync the directory as well, because what if someone writes an even more pathological filesystem. And the disks sit there spinning gently and chitter away as they write tiny files[4] and never spin down and the polar bears all drown in the bitter tears of application developers who are forced to drink so much to forget that they all die of acute liver failure by the age of 35 and where are we then oh yes we're screwed.

So. I said we could fix up applications fairly easily. But to do that, we need an interface that lets us do the right thing. The behaviour application writers want is one which ext4 doesn't appear to provide. Can that be fixed, please?

[1] xfs behaves like ext4 in this respect, so the obvious argument is that all our applications have been broken for years and so why are you complaining now. To which the obvious response is "Approximately anyone who ever used xfs expected their data to vanish if their machine crashed so nobody used it by default and seriously who gives a shit". xfs is a wonderful filesystem for all sorts of things, but it's lousy for desktop use for precisely this reason.

[2] Yes, ok, we've just established that it actually isn't that in the same way that GMT isn't UTC and battery refers to a collection of individual cells and so you don't usually put multiple batteries in your bike lights, but the point is that this is, for all practical intents and purposes, an unimportant distinction and not one people should have to care about in their daily lives.

[3] The disk is free to sit there bored for arbitrary periods of time before it does anything, but that's fine, because the OS is behaving correctly. Sigh.

[4] Dear filesystem writers - application developers like writing lots of tiny files, because it makes a large number of things significantly easier. This is fine because sheer filesystem performance is not high on the list of priorities of a typical application developer. The answer is not "Oh, you should all use sqlite". If the only effective way to use your filesystem is to use a database instead, then that indicates that you have not written a filesystem that is useful to typical application developers who enjoy storing things in files rather than binary blobs that end up with an entirely different set of pathological behaviours. If I wanted all my data to be in oracle then I wouldn't need a fucking filesystem in the first place, would I?