Disaster Recovery – Expecting the Unexpected

Shortly after our most recent DR exercise, I took a trip to the US and UK, narrowly avoiding (in both time and location) a devastating hurricane in Texas, a failed terrorist attack in London, and wildfires in California. It’s being close to this sort of thing that brings home to you how important it is to have a plan for when things go wrong.

But it’s not just the plan. The plan is just the start, and whilst it’s important, critical even, the plan alone might not be enough to get you through when things go wrong.

One of the TV shows I grew up watching back in the 80’s was Gerry Anderson’s (of “Thunderbirds” fame) series “Terrahawks”. The pilot episodes of this show were entitled “Expect the Unexpected”, and I can remember being fascinated by this phrase at the time, wondering how that would be possible.

30-odd years on and this phrase has stuck with me. It is still a powerful and thought provoking phrase, and it’s really relevant when talking about failures of any kind – but particularly so for business continuity planning.

It reminds us that, no matter how hard, or carefully, or long we plan, there are always other things that may throw us off course – from a situation or failure that we hadn’t considered, to one so unlikely we thought it would never happen. The unexpected also shows up on those things you think you have covered, that you have already tested time and time again.

So how can we possibly foretell things that really should go wrong, but invariably do at the worst possible moment? How do we expect the unexpected?

Many years ago we used to carefully choreograph our backup plan testing – making sure that we tested everything, but did it in a way that didn’t disrupt the business at all if possible, such as testing with a “representative sample” of our business or testing each component individually but not all at the same time. We also put a lot of effort into making sure we and any of our partners were well prepared, so that our testing would run smoothly and we’d end up with a successful outcome.

This isn’t unusual, and working with our recovery partners it seems like this is generally the norm in many companies – do everything possible to ensure a successful exercise – plan, prepare, test, tick the box, get back to BAU. This also helps to keep the testing as efficient and non-disruptive as possible – so win-win right?

The thing is that this approach can really hide the unexpected situations from the testing. Those things that might happen in real life because we’re under pressure, or communication is harder, or because we don’t have some information that was back in the office we’ve just evacuated – or a combination of all of these things and more – little annoyances that can build to be a real issue.

Bringing out the unexpected really needs as realistic a scenario as possible, and that’s how Telnet have for the last few years approached our disaster recovery testing. We have asked our partners to “not prepare” for anything on the night – don’t cleanly shut down telco links, don’t pre-test restored server images and don’t stay in the office waiting for us to start our testing. If roadworks means we can’t get started because key staff can’t get there in time – then this is something that might happen in a real emergency – and we need to know about it.

In a strange way, rather than hoping for everything to go right in our testing, we’re almost wishing for things to go wrong – for us to find those little nuggets of “unexpected” so we’ll think about them for the next test, or in a real emergency. Ticking all the boxes first time is probably the worst outcome, as it almost certainly means we’ve missed something!

For us this time it was something that we test regularly, that “always works”, that gave us our “unexpected”. It didn’t work and we ultimately discovered that there was a dependency on another computer server that nobody (not even our software vendor) knew about. We only found out about this because of a delay in starting that server (the one didn’t even think was needed) by our third-party support team.

In the end, the Telnet team solved the problem on the night, and managed to recover the contact centre back to full operation in our DR environment. If we’d have been “too prepared” though and had all our recovery environment “ready to go” to save time, this critical issue would never have been discovered. We’d have ticked all the boxes, but never found something that, in a real scenario, could have caused us major issues.

So, as well as having that plan, you really do need to “expect the unexpected”, and this means having a team who can solve problems on the fly, and who communicate well. It also means that you need to make sure that you have tools, documentation, and resources – even those you don’t think you will need – available – just in case you do.

If you want any more information on our DR Testing, particularly if you are one of our clients, then feel free get in touch with me or your account manager.

Steve Hennerley
CTO – Telnet Services Ltd
October 2017