Yesterday I lost several terabytes of personal data, some small portion of it irreplaceable, because I failed to observe the most basic rule of data on computers:
No matter how fancy your storage system is, always keep a wholly independent backup.
The scenario was as follows.
I had some fairly rickety hardware backing a single Btrfs filesystem with 8 fairly large disks in it, in a raid1 configuration.
By "fairly rickety", I mean a pair of 4-bay Vantec enclosures connected to a USB3 hub, connected to a laptop, with a single USB3 port, running Ubuntu, in a tiny unventilated closet.
Given how vastly more reliable these enclosures were than the hideously awful bit-flipping mess I originally got from MediaSonic for this purpose (DO NOT BUY THIS), I thought that the Vantec was actually a pretty good deal.
But nevertheless the Vantec enclosures behaved… suspiciously. Particularly, in certain situations, they would start spuriously disconnecting drives all the time.
(If anyone has a suggestion for an actual reliable thing for putting non-trivial numbers of disks onto a USB3 bus, I'd like to hear about it.)
These spurious disconnections would result in xhci (USB3) flakiness, which would cause bus resets, which would cause write errors, which would cause occasional kernel backtraces or even panics. When a disk was failing (and several disks did in fact fail in the last couple of years, and the filesystem as a whole continued performing admirably), this was the most noticeable symptom. There was no nice, clean red light on the enclosure saying "please replace this disk". Just a whole bunch of random flakiness followed by an episode of Glyph Lefkowitz, Block Storage Detective where I had to figure out which disk was screwing everything up.
However, I chalked up the unreliability to the "hot closet" and "laptop not designed to be a server" and "USB hub hanging haphazardly from a clotheshanger" aspects of this situation, rather than a manufacturing defect of the enclosures. So I decided recently to upgrade to a for real server and move all the disks over there, potentially putting at least some of them into actual hotswap SATA slots, rather than an ad-hoc enclosure.
Unfortunately, the enclosures interacted even more poorly with the USB3 hardware on this new server (possibly because it has a crappy USB3 component, possibly because it has a better USB3 component that is more discerning about things like voltage underflows), and were causing write errors much more regularly.
At this point I made the hubristic mistake to attempt to "fix" the filesystem.
My bright idea was that while writing to the disks within the enclosure could damage them, reading from them should be fine, right? And deleting a device And writing to the disks within the nice new SATA bays should be fine. Testing on other disks revealed it to be so, and so I proceeded to put the newer half of the disks into the SATA bays and the other half into the slightly more reliable, slightly newer enclosure. Then,
btrfs device delete \
/dev/[disk in the enclosure] \
/btrfs-filesystem
and I was off to the races.
There are two problems with this particular invocation.
The first problem is that despite surviving many bad writes, many broken disks, many bad sectors and many random interruptions, Btrfs does not like having "transactions" (such as balance
, device add
, or device delete
) interrupted. There's a small risk that the filesystem may go bad.
And to couple with this, while btrfs device delete /dev/sdx
does not write anything of consequence to /dev/sdx
, if it's in a filesystem with /dev/sdy
and /dev/sdz
, it will write to both sdy
and sdz
.
In my case, while sdz
was on the nice internal SATA bus, sdy
was also on the flaky enclosure.
Hilarity ensued. (By which I mean, weeping ensued.)
The btrfs device delete
command crashed. Shortly thereafter, the kernel paniced. Then, pretty much any attempt to do anything with the filesystem resulted in errors like:
parent transid verify failed on 36911460777984 wanted 371005 found 371000
followed by a segfault, an exit with an assert
, or a kernel oops.
After hours and hours of frantically trying ever more dangerous modes of recovery, I eventually had to face the fact that a lot of this data would be lost.
One interesting thing about Btrfs here is that it really doesn't give you a lot of levers to pull. The idea is that the filesystem code is just supposed to work: there's no real reason to have a million little utilities to work around bugs; the only really legit configuration option is "I know that some of this data might be corrupt, so please just try to give me something from my now-broken filesystem" as opposed to "please guarantee me the maximum correctness possible using all the checksums and redundant data you've got" which is the default behavior.
This means that each subsequent recovery step is progressively more potentially destructive. btrfsck
(or as it's now more properly known, btrfs check
) has big warnings saying that --repair
might cause more harm than it fixes. There are even more destructive options, like --init-csum-tree
and --init-extents-tree
which destroy redundant information that btrfs uses to implement reliability, in the hopes that it can fix issues where the code to read that information is buggy.
I tried all of these things. None of them worked, but they all progressively degraded the hope of eventually recovering this filesystem.
I would of course have liked to back everything up first, but not doing so was a calculated risk. The filesystem was simply too big: it would have cost me hundreds of dollars and taken several days just to be able to physically back it up first - and even if I had tried, it's not clear how I would have found an enclosure that could hold all the additional disks I purchased to make the backup.
The irony is that the operation which destroyed it was the only reasonable way I could fix this problem.
One of the things that has been reinforced for me from this episode is that the USB ecosystem is remarkably shoddy. I'm still not even exactly sure where the bug was, or which two components didn't interoperate, or why. The problematic enclosure, attached to a different machine, seems to behave fine (I'm currently doing a Time Machine backup to an AppleRAID set up on the same disks and enclosure, which has been quite happy up to 231G so far).
(Edit: although the initial backup succeeded, it appears that periodic disconnects are, in fact, still an issue even on the Mac. Looks like the enclosure may be to blame after all.)
I have also twiddled around some "XHCI mode" options in the BIOS, which may have fixed the issue - a different USB3 enclosure stopped emitting periodic resets once I turned some inscrutable nonsense from "Auto" to "Enabled". Was that the problem? Will I ever know? I guess I'll have to try again with an external volume stress test, since the internal SATA connections are (thus far, at least) 100% error-free under an hours-long stress test.
It's important to note that I don't blame the Btrfs developers here. There's a bug, sure, but that just means their software's not perfect. And this software performed almost unrealistically well through literally years of continuous, deliberate abuse. The problem was a storage monoculture: I had all of my data in one place, in one format.
Now… as it happens, that's not quite completely true. When we moved to San Francisco we made paranoid backups of everything and packed up multiple copies with us as well as shipping various disks with the movers. Not all of these survived the journey, but as it turned out, most of the data that I was seriously concerned about losing, the really old stuff that would be impossible to replace, that isn't in the working set of any of our current machines, it turns out is in fine shape.
Oh, and where the heck do I get the drivers? Some random person on NewEgg suggests that I need to download some drivers from some sketchy site unaffiliated with the manufacturer, since neither the chipset manufacturer nor the enclosure manufacturer has any downloads officially available.
It seems as though all these 4-bay USB3 enclosures one can get on Amazon are repackaging of minor iterations of the same awful JMicron chipset. After updating the firmware a bunch from various sketchy sites around the Internet, nothing has improved - if anything the drives fail slightly more regularly now (although I've just been stress-testing them).
After 48 hours of nearly continuous web searching for solutions to this problem, it appears that the most reliable guidance is to buy a discrete USB3/eSATA translation device with substantially better reviews than any of the enclosures, and then connect everything with eSATA cables.
So I guess I'm going to spend $60 to find out if that's a viable solution.
I guess I'm still going to try this experiment, but this review indicates that it still has the "occasional reset" problem –although perhaps only during a hot-swap? – and this one indicates it's still from the cursed land of JMicron.
In fact not only is it from JMicrom, but my NexStar HX4 is a "JMS 539 PM" chipset and this one is a "JMS 539 B", which suggests that they're quite similar. Nevertheless, it seems that several people with my exact issue have switched over and it's been working well for them.
The modification times on the files from various sketchy firmware sites strongly suggest that the "JMS 539 B" is a significantly more recent chip though, which gives me hope (sort of)?
After plugging in the aforementioned DATOpic adapter… I'm cautiously optimistic, it appears that this has straight-up solved the unreliability issue (at least on Linux, I haven't tested on a Mac yet).
I've got both bays plugged in, both churning away on I/O, for the better part of 6 hours now and there is no appreciable error rate or interesting log traffic or anything. It seems as though the whole problem might have been this one ever so slightly outdated JMicron chip.
In a final, ironic twist, it turns out that with both drive bays plugged in and turned on … the server won’t POST. It just sits there at a black screen forever, no BIOS logo.
Luckily, there's a workaround: plug the drives in to a USB hub, so they're not directly connected to the root hub. It seems that the BIOS will helpfully disregard them in that configuration, leading to normal POSTing. Right now I'm using a fairly old USB3 hub, and it's shaving 10 megs a second off my write performance and 30 megs a second off my read performance (via btrfs→openssh→ssh→osxfuse→sshfs→Blackmagic Disk Speed Test), but it's still a good 3x faster than USB 2.0, and a better hub might have better results.
Hopefully many of the ephemera here will have been useful to readers, but by far the most important lesson to take away from this unfortunate experience is this:
Make sure you have wholly independent backups.
The issue that I was facing here was not that disk hardware is unreliable, or that consumer disks fail.
Btrfs was compensating quite nicely for both the failure rate of the physical drives (by having raid1 level redundancy) and the errors introduced by USB weirdness, resets, and general unreliability (by checksumming all reads and relying on replicas when checksums failed).
I've said it already but I must stress that numerous physical drives had already failed in this filesystem already and the recovery process from that failure was as seamless as advertised.
Ideally, “wholly independent” means:
If you can do all of that, then great, but cascading hardware/software failures or malware infections are both dramatically more likely than your whole house burning down or blowing up, so the most important aspect of this is to have backups on a separate volume that can be (and often is) disconnected and moved somewhere else when there's a problem.
While the advice in this post is all still quite correct – wholly independent backups are your only realistic hope of long-term data integrity – there’s an interesting quirk here that will be of interest to anyone trying to set up a large home storage array.
Finally, after all these failures, I split my storage array up into 3 pieces so that they'd be independent. But I was still experiencing an unusually high rate of physical device failure. Even as long ago as 2007, hard drives should be replaced at somewhere between 2% and 10% annually. So, if I had 10 drives, I should replace maybe 1 of them per year. But I was replacing them at a rate of maybe 60%-70% per year.
As a final hail mary, I went out and bought a UPS, and plugged all of the enclosures into it, and…
…I haven’t lost a single disk in the intervening 2 years.
After having a couple of months of good experience with this setup (i.e. after going for about 3x my previous mean time between failures with zero failures) I went back and looked at the numerous reviews on various JBOD enclosures. The bad reviews almost all list issues which are power-related; "turning off randomly" is highly correlated with data loss.
So my working hypothesis here is that most consumer-grade JBOD enclosures are simply not conditioning their power adequately to support hard disks, and require an external UPS to ensure even a baseline level of data integrity.