recovery | Stephen Frein

Your disaster recovery plans need to address not only failing over to standby resources, but also how you’re going to make your way back when the time comes.

Many organizations practice failover testing, wherein they engage a backup / secondary site in the event of trouble with their primary location / infrastructure. They run disaster recovery tests, and if they are able to continue their processing with their backup plan, they think they’re in pretty good shape. In a way, they are. However, few organizations think much about failing back – that is, to come back from their standby location / alternate hardware and resume production operations using their normal set-up. It is a non-trivial thing to do, as it requires data flows and perhaps even operational process adjustments that the organization almost never gets to practice. Further, it’s almost certainly part of what you’ll have to do (eventually) in the event of a mishap. Yet, most disaster recovery plans don’t even address failback – it’s just assumed that the organization will be able to do it once the primary site is ready again. It’s worth making this return trip part of your disaster recovery plan.

I once worked for an organization that was big on disaster recovery / business continuity. They were pretty good at the business continuity part, and could move the work done (paying people) from one location to another pretty seamlessly in the event of crippling snowstorms, fires, and the like. (If you think that sort of thing is easy, then you probably haven’t done it.) However, their ability to do this was facilitated by the fact that all sites shared a common data center in a separate location.

Given this dependency, they also spent a fair amount of time planning for data center outages. They had regular back-ups to a remote location equipped with standy hardware, and would run twice-yearly simulations of losing the primary site and continuing operations using the backup site. After a few attempts, they had most of the kinks worked out, and understood the timings and dependencies associated with such failover pretty well. Here I should mention that failing over a data center is much more complicated than failing over a single server, just as swapping out one piece of an engine is much easier than swapping out all of its parts and making sure they still align with one another.

One weekend, the feared “big event” arrived – our primary data center was flooded by some kind of pipe failure. Operations stopped dead. Yet, we didn’t fail over to our backup site, as almost everyone would have expected, because somebody asked a simple question – how do we come back?

The secondary site was clearly a temporary – understaffed and underpowered to do what we needed to do on a long-term basis. Using it was always assumed to be a sprint – a temporary burst of effort that would soon subside. (Imagine the cost of having it be otherwise, and the meter on said cost going up for year after uneventful year.) Such an arrangement requires some plan for ending the sprint, and coming back to the normal production site as expediently as possible.

We had never practiced coming back.

Fortunately for us, the outage happened at a low period in our monthly cycle, where we could skate by with a few days of downtime. (If it had happened a week earlier or a week later, then millions of people wouldn’t have gotten paid, and you would have heard about it on the six o’clock news.) So, rather than move things over to a temporary platform and then moving them back just in time for our big rush, we just waited for the primary site to be made available again. In many ways, this was bad, but the higher-ups decided that it was better than the risks we would run by failing back without strong, tested processes right before our busy time. (The risks would have been nasty things like overpayments, missed payments, erroneous payments, etc.)

So, we sat tight, twiddled some thumbs, and waited out the outage while the people at the primary data center worked like lunatics in order to restore it in time. Our often-practiced disaster recovery plan had proven to be a failure, not because we couldn’t handle a disaster, but because we couldn’t handle returning to normal in a reliable and predictable way.

So, if your disaster recovery plans include failing over to an alternate location or even alternate hardware, make sure they also specify what happens after the clouds lift, and practice those things too. The middle of a disaster is not the time to be sorting out those kinds of details.

Stephen Frein

. : Software, technology, management, and plenty of et cetera : .

Tag Archives: recovery

But Can You Failback?