Organisations that have suffered from catastrophic IT outages often describe the cause of the disruption as ‘unexpected’ when in hindsight the issue could have been identified and mitigated. Outages – those not related to poorly executed IT change – fall, largely, into two categories. Firstly, those resulting from irregular failure events (such as location, power, hardware, etc.) and secondly, those resulting from internally or externally activated malicious activity. An example of the first would be the outage at VISA in 2018 where a single data centre component failed resulting in a 10-hour outage that impacted European VISA transactions. The second, WannaCry ransomware which, beyond the well-publicised impact on Britain’s NHS, affected computer systems in 150 countries worldwide. In both instances, critical systems availability was impacted even though one was caused by what might be considered a ‘traditional’ failure of IT infrastructure and the other a cyber attack.
Stay current on your favourite topics
Many firms derive a false sense of security from their Business Continuity and Disaster Recovery capabilities. Typically, these plans focus only on the more ‘traditional’ failure scenarios and not the recoverability of services following – or even during – a cyber attack. Planning resilient IT services requires full consideration of all eventualities, regardless of the cause, and must include cyber resilience planning.
Incorporating cyber incidents into resilience planning is advised by the National Institute of Standards and Technology (NIST). Guidance for the handling of security incidents (sp 800-61r2) should inform business continuity plans. NIST breaks down a cyber incident into phases: detection; analysis; containment; eradication, and then recovery. Throughout each phase, the affected IT services may be unavailable and, depending on the attack, data may be compromised. Compromised data may need to be restored from a point in time prior to the commencement of the attack and designated as ‘safe’.
Broadly speaking, there are four cyber attack scenarios that impact system availability:
- Malware infection: Once malware is discovered, the first step is to isolate the infected infrastructure and prevent any further lateral infection. Typically, hosts are shut down until the malware has been identified and the firm knows that no further infection is possible. This might require a new set of signatures and agent updates to the anti-malware solutions. Sometimes, signature updates and anti-malware policy updates can take hours to retrieve from a vendor and replicate out and it’s only then that the affected services can be fully recovered. In addition to this, some servers might need to be restored, adding a further delay. Often, business data is unaffected and only the OS directories need to be restored and these can be recovered more quickly. In this type of scenario, it is easy to see full recovery and service restoration taking several hours.
- Ransomware infection: A form of malware that is successful if it can encrypt a host’s accessible data before detection. By design, it encrypts data very quickly, often in hours. Besides paying the ransom (which is never advisable), recovery efforts should follow the same recovery path as malware but with the added complication that business data also need to be recovered from a known good point. Restore times, using traditional recovery from virtual tapes, , are directly related to the type of data and volume to be recovered. Large amounts of data comprised of many small files can take hours to restore using these technologies. In this recovery scenario, there may be an extended outage period that consists of the time to stall an attack, restore a server and then its data. If multiple hosts are impacted, then the capacity of restore services may also suffer and delay things further. It’s simple to estimate that recovery following a large ransomware attack and then recovering using traditional technologies may involve effort measured in days rather than hours.
- Denial (or Distributed Denial) of Service: An attack launched externally on the online digital services offered by an organisation. Attacks are on average ~4 hours in duration although they can last days, and it is common for the same firm to be attacked multiple time. Without DDOS protection, there is little that a firm can do besides wait for the attack to finish. With protection from a DDOS provider, recovery will be a function of how quickly an attack can be diagnosed and how effective the protection is. The decision to invoke DDOS protection involves scrubbing all traffic to a firm and may also impact traffic in unusual ways. In this scenario, diagnosis may require an hour or more, but invoking protection protocols may not fix all issues. As well as the initial response, some services may need to be fully recovered as a result of increased web traffic, including web servers or backend services.
- Information disclosure by a malicious insider: While not a typical cyber attack, the response to contain an unauthorised disclosure of information are very similar to the recovery path for malware. In both cases, it is necessary to isolate the threat actor. Compromised firms might first choose to shut down or prevent access to the system from which the leak has occurred. Once the leak is contained, the firm must then take steps to identify and prevent the same insider from causing further harm. During this time the system is unavailable, and the duration of the outage is directly related to the organisation’s ability to identify the attackers and to isolate them. Additional time might be needed to follow a forensic audit trail if a crime has been committed. Securing a conviction often requires specialist skills and equipment that must be shipped to site. Typical recovery and impact to a system are likely to take hours from first discovery.
The response to any of the above scenarios would not use tradition BCP/DR plans because it is unlikely that the failover of a service or even an entire site would improve recovery options. Specific technologies might reduce recovery times, for example, the ability to restore services quickly from snapshots. However, the reader should also note that these must be assessed for vulnerabilities themselves so as to avoid the position that CodeSpaces did in 2014 when data and cloud backups were encrypted and the firm never recovered.
Aligning BCP/DR planning and the recovery plans for cyber attacks needn’t be a difficult task. Some firms may need to connect component plans since these are often fragmented across departments such as Security, Incident management, Capacity Management, and Business Continuity. Connecting the dots may highlight gaps between the published RTO or RPO for a service and the recovery time following a cyber attack. Once the plans have been aligned, the organisation can improve resilience capability by:
- Closing gaps in the RTO and RPO for a service where plans show that recovery following a cyber attack would likely overrun these objectives. It may be that investment is required or that a service needs re-architecting to be more resilient.
- Rehearse, rehearse, rehearse. The plan, like any plan, should be rehearsed regularly enough to be familiar to its participants and stakeholders. A great deal of time is often wasted during an incident because roles and communications channels are unclear. This was especially prevalent in the recent TalkTalk cyber incident when communication between IT teams and the board was an issue that prolonged recovery.
- Threat intelligence analysis can be used to determine the types of attacks most likely to occur and which organisations are likely to be targeted. Where a firm is exposed to a higher level of risk then preemptive steps can be taken. Some recovery efforts can be pre-staged such as routing through your DDOS protection provider or increasing the frequency of data recovery points.
Firms are exposed to a wide range of outage risks. Planning resilience requires a holistic approach that covers significant threats to systems availability and business’s data. Traditional attitudes and solutions to minimising risk are no longer sufficient and firms are strongly advised to focus on plans that are broader in their scope (i.e. to include current and emerging cyber risks), tightly integrated across siloed functions, as well as being properly documented and rehearsed.