As we become a more technologically evolved society, our reliance on data increases. E-Mail, web access, electronic documents, bank accounts, you name it. The loss of any one of these can have devastating consequences, from loss of productivity, to loss of home, health, or even, in extreme cases, life.
Unfortunately, I get to experience this first hand. At the beginning of the week, there was a failure on the shared system I access at work. Initially it seemed this was merely a permissions issue, we had just lost access to the files for a short time. However, as time passed, we learned that the reality of the situation was much worse.
Like most companies, we rely heavily on shared drive access for collaboration and storage. Of course, this means that the majority of our daily work exists on those shared drives, making them pretty important. Someone noticed this at some point and decided that it was a really good idea to back them up on a regular basis. Awesome, so we’re covered, right? Well, yeah.. sort of, but not really.
Backups are a wonderful invention. They ensure that you don’t lose any data in the event of a critical failure. Or, at the very least, they minimize the amount of data you lose.. Backups don’t run on a constant basis, so there’s always some lag time in there… But, regardless, they do keep fairly up-to-date records of what was on the drive.
To make matters even better, we have a procedure for backups which includes keeping them off-site. Off-site storage ensures that we have backups in the event of something like a fire or a flood. This usually means there’s a bit of time between a failure and a restore because someone has to go get those backups, but that’s ok, it’s all in the name of disaster recovery.
So here we are with a physical drive failure on our shared drive. Well, that’s not so bad, you’d think, it’s a RAID array, right? Well, no. Apparently not. Why don’t we use RAID arrays? Not a clue, but it doesn’t much matter right now, all my work from that past year is inaccessible. What am I supposed to do for today?
No big deal, I’ll work on some little projects that don’t need shared drive access, and they’ll fix the drive and restore our files. Should only take a few hours, it’ll be finished by tomorrow. Boy, was I wrong…
Tomorrow comes and goes, as does the next day, and the next. Little details leak out as time goes on. First we have a snafu with the wrong backup tapes being retrieved. Easily fixed, they go get the correct ones. Next, we receive reports of intermittent corruption of files, but it’s nothing to worry about, it’s only a few files here and there. Of course, we still have no access to anything, so we can’t verify any of these reports. Finally, they determine that the access permissions were corrupted and they need to fix them. Once completed, we re-gain access to our files.
A full work week passes before we finally have drive access back. Things should go back to normal now, we’ll just get on with our day-to-day business. *click* Hrm.. Can’t open the file, it’s corrupt. Oh well, I’ll just have to re-write that one.. It’s ok though, the corruption was limited. *click* That’s interesting.. all the files in this directory are missing.. Maybe they forgot to restore that directory.. I’ll have to let them know… *click* Another corrupt file… Man, my work is piling up…
Dozens of clicks later, the full reality hits me… I have lost hundred of hours of work. Poof, gone. Maybe, just maybe, they can do something to restore it, but I don’t hold much hope… How could something like this happen? How could I just lose all of that work? We had backups! We stored them off-site!
So, let this be a lesson to you. Backups are not the perfect solution. I don’t know all the details, but I can guess what happened. Tape backup is pretty reliable, I’ve used it myself for years. I’ve since graduated to hard drive backup, but I still use tapes as a secondary backup solution. There are problems with tape, though. Tapes tend to stretch over time, ruining the tape and making them unreliable. Granted, they do last a while, but it can be difficult to determine when a tape has gone bad. Couple that with a lack of RAID on the server and you have a recipe for disaster.
In addition to all of this, I would be willing to bet that they did not test backups on a regular basis. Random checks of data from backups is an integral part of the backup process. Sure, it seems pointless now, but imagine how pointless it’ll be after hours of restoring files, you find that they’re all corrupt. Random checks aren’t so bad when you think of it that way…
So I’ve lost a ton of data, and a ton of time. Sometimes, life just sucks. Moving forward, I’ll make my own personal backup of files I deem important, and I’ll check them on a regular basis too…