Do you have a plan for outages?

Yesterday’s outage at Amazon is a timely reminder of our need to have a contingency plan. Once services are moved to the cloud, we must take into consideration the impact to the business caused by the unavailability of the cloud services.

The concept of an always on cloud too easily fools us into thinking that the cloud service will always be there.  After all, surely the cloud service has high availability, redundant systems and data centers in place; or does it?  The Amazon outage yesterday only impacted me personally with a little inconvenience and frustration when I tried to refresh my Audible library and could not download the audio-book which I purchased despite having credits deducted from my account.  But what about business who were running apps and other services in the Amazon cloud?

Moving to the cloud has benefits on many levels and those benefits far outweigh the risks.  However, gaining all of the benefits does not permit us to abdicate responsibilities to assess risk.  As with any risk discussion, the concepts of impact and likelihood are still relevant.

Hence, the first thing we need to determine is the impact of losing the cloud service.  How severely is the business impacted?  Have we performed a business impact analysis (BIA) and quantified any of this?  How long can the business tolerate an outage before it suffers catastrophic harm?  (Meaning the business ceases to exist.)  Do we know the costs of an outage; by the minute, by the hour, by the day, etc?

Having quantified the impact, we must then consider likelihood in our planning.  Likelihood alone could be considered to be a certainty, and hence we need to time bound it so that the metric makes sense in comparison the the impact.  A frequently used measurement is a year, since most businesses budget annually.  Estimating the annual rate of occurrence (ARO) tells is the likelihood that any given event or outage will occur during any year.  (And it can be a number less than one.)

From here we develop our plan using the impact and the likelihood metrics we have determined.  For example, if we determine that a half day outage will have an impact of $50,000, and we expect to encounter such an outage once in any given year, then it makes absolutely no sense to spend more than that implementing a continuity plan solution.  Should we spend less than the $50,000, then we actually improve our profitability by the difference because we have partially mitigated the risk.  Or, we may simply decide to accept the risk and associated impact.

Unfortunately, the reality of a catastrophic disaster remains, albeit with a slim likelihood.  Yes, it is still possible for lightning to strike in the same place twice.  Despite our best and most prudent planning, we will never manage to mitigate all risk; enter insurance where we transfer the risk.

Yesterday’s outage encouraged me to think about cloud services which I take for granted.  Should you be reviewing your business contingency plans?



Comments are closed.