Wednesday, July 8, was what we have dubbed ‘triple glitching’ day – as a trio of high profile system outages caused disruptions across a range of industries. The New York Stock Exchange was dogged by technical difficulties that lasted the best part of the trading day, with a complete halt in trading for approximately four hours; United Airlines saw its fleet of aircraft grounded for two hours as it struggled to get systems back online following a network outage; while The Wall Street Journal had an outage that saw its site unavailable for almost an hour.
Stay current on your favourite topics
Although the cases are not inter-connected, for us that work in technology they demonstrate once again the growing dependence that all industries place on IT. That is no surprise to any of us. What feels different now is that when technology fails, it is headline news in the mainstream media – not just technology press. It puts the spotlight back on technology managers to consider the availability of their systems (see related blogs on optimising application availability and defining system requirements). Linked to availability – it also means thinking again about one of the key characteristics in managing IT operations: resilience.
Resilience is defined as something’s ability to return to its original form having been bent, compressed or stretched. The question is, how do you bestow this property on your critical business services and associated technologies? How do you ensure your technologies and processes can flex to accommodate external shocks, but return to their original form, without breaking? Or if they do break, that the impact is seen as a managed “degradation of service”, rather than a “business impacting outage”.
Most IT organisations focus on resilience after something has broken, but this approach will typically lead to tactical solutions, akin to adding extra hard disc only after you ran out and crashed your platform. Instead, firms should look to instil more of a culture of resilience into their operations.
So what are the key lessons to learn when trying to make sure your IT operations are more resilient?
- Becoming resilient is not something that can be accomplished through technology alone. Resilience requires proper planning, solution design, testing, integration and proper operation. Resilience is about changing culture, operating models and technology. It encompasses disciplines such as capacity management (to make sure there is always ample headroom for growth and spikes in utilisation), as well as operating best practices, like ensuring notifications are issued to the right people when any changes are made.
- Do not assume that achieving resilience will be easy. The ROI is not clearly demonstrable so getting funding may be your first challenge. Shift the focus to opportunity cost and franchise protection, and evaluate the cost of some reasonable scenarios and their probability of occurrence. Maybe even take one of the recent examples in the press and attempt to work out the cost of a similar outage to your organisation.
- Demonstrate the value of investments through things that you can measure. Start by defining, measuring, and reporting on Key Resilience indicators. Set targets that are achievable. These are easier to define then you might imagine, though obtaining the data may not be as easy.
- As with anything, do not attempt to boil the ocean. Know where you are, where you want to go, and build a reasonable plan to get there. Your technology platform cannot become resilient overnight. This is a journey. Lay a solid foundation, and ensure that you create a plan that will start by ticking off the high value, low effort items first. Build goodwill, show value, and then do more.