WorkOpinion

What can we learn from the worst IT outage ever?

How could one piece of software cause such chaos and how can we prevent it happening again?

A passenger stranded in Atlanta as a major IT outage on Friday grounded flights around the world. Photograph: Megan Varner/Getty Images

On Friday 19th July, what has been described as ‘the worst IT outage in the world’ brought critical business, transport, health and global communications infrastructure to their knees.

Airlines were grounded and airports forced to use paper ticketing and announcements. Broadcasters such as Sky News were temporarily off the air. Vital services including healthcare, emergency services and public transport apps – including Transport for Ireland’s app – were disrupted.

How could a seemingly small software update cause widespread global disruption that could take ‘some time’ to fix? Is this a rarity, or is it the “new normal” in our modern age of cloud computing? And what can we learn from older communication systems to avoid this chaos in the future?

The global IT outage was caused by two separate but possibly related big IT failures. On Thursday 18th and Friday 19th July, the outages on Microsoft’s cloud computing platform Azure and a defective update from cybersecurity company CrowdStrike cascaded through global IT systems, causing serious IT infrastructure failures as well as rendering individual Windows computers temporarily inoperable.

READ MORE

CrowdStrike ‘close to rolling out fix’ after IT outage crashes 8.5m Windows devices globallyOpens in new window ]

The timings and sequence of what happened is still being established but currently the main reason for the failure seems to have been a defective CrowdStrike Falcon Sensor update applied to Windows servers and computers.

Falcon Sensor is a combined firewall, antivirus and cybersecurity ecosystem designed to detect, monitor and block cybersecurity attacks in real time and is popular across different cloud computing services. The defective update that caused the problem was categorised as a minor content update, and so did not receive the standard rigorous testing before being rolled out across many of the world’s Windows servers and computers.

Security and reliability have been at the heart of our telecommunications systems from the earliest days of the electrical telegraph network in the mid-1800s

The result? Many “blue screens of death”, with the impacts rippling through the world. Thousands of flights and public transport journeys were cancelled or delayed. In the UK, Sky News was temporarily off air, the NHS and some GP appointment systems became inoperable and the London Stock Exchange was unable to issue news updates.

CrowdStrike ‘close to rolling out fix’ after IT outage crashes 8.5m Windows devices globallyOpens in new window ]

In Ireland, Dublin and Cork airports, Transport for Ireland and NCT test centres were affected.

And although this IT outage was not the result of a deliberate nor malicious cyberattack, it does also demonstrate the potentially catastrophic risks associated with failures in cybersecurity including phishing and malware attacks.

Ireland’s hospitality sector: ‘The customer feels they are not getting value for money’

Listen | 38:40

One of the last big IT outages was the ‘WannaCry’ ransomware attack in 2017 which affected over 200,000 computers in over 100 countries including medical devices and computers in the UK’s NHS with impacts on people’s lives and medical treatment. The attack was mitigated by the rapid discovery of a ‘kill switch’ by cybersecurity expert Marcus Hutchins that slowed down the spread of the malware.

Security and reliability have been at the heart of our telecommunications systems from the earliest days of the electrical telegraph network in the mid-1800s, right through to the development of the Internet in the atomic age.

As the electrical telegraph system developed into a vast global network over land and undersea in the late 1800s, Britain ensured the security of their messages by developing the “All-Red Line”, a network of electric telegraph cables stretching around the globe which only passed through British-controlled territories and colonies.

Lessons from the electric telegraph network in the Victorian Age and the advent of the Internet in the 1960s teach us that redundancy and decentralisation sit at the heart of resilient and reliable systems

Crucially, they also designed redundancy into this vast network, so if one cable was cut, a message could be sent through many other routes. In 1911, with the possibility of a war in Europe looming, the British Committee on Imperial Defence concluded that it would be essentially impossible for Britain to be isolated from the telegraph network due to the redundancy built into the network: 49 cables would need to be cut for Britain to be cut off, 15 for Canada, and five for South Africa.

From the 1950s through to the 1970s, people working in universities and the military began to consider what a global information-sharing library or network might look like.

They wanted to design a decentralised system with sufficient redundancy, resilience and security to survive multiple points of failure caused by a nuclear attack. Their work and ideas inform and underpin the Internet and later world wide web we still use today, but some of their lessons seem to have been forgotten in the cloud computing era of today.

European travellers, shoppers hit hard by tech outageOpens in new window ]

Major IT outages have been thankfully rare so far with the most recent example being the ‘WannaCry’ ransomware attack in 2017. But increased adoption of cloud computing has only increased the risk and probability. What can be done to ensure any future – hopefully rare – outages caused by accident or by design are less impactful?

Lessons from the electric telegraph network in the Victorian Age and the advent of the Internet in the 1960s teach us that redundancy and decentralisation sit at the heart of resilient and reliable systems.

Today, that can be translated into rigorous testing and authentication of third-party updates, diversification across different platforms and cloud computing services as well as something that has existed since the advent of electronic computing in 1940s – good old-fashioned backups.

Elizabeth Bruton is a historian of technology, former curator at the Science Museum, London and an Honorary Research Fellow at the History of Science Museum, University of Oxford.