Global outage caused by CrowdStrike is a flashing red light

Someday a boring piece of technology – overloaded, neglected or poorly installed – will cause a genuine disaster

Passengers stranded during the IT outage last week. CrowdStrike probably failed to do its due diligence, programmers said. Trying the patch out on a variety of Windows machines before sending it out to customers could have helped detect the issue. Photograph: Joe Raedle/Getty Images

For a couple of years now, the artificial intelligence community has been warning that there is a chance their work will go south and humanity will end in a conflagration worthy of a superhero movie.

Last Friday brought a pointed reminder that disaster is at least as likely to creep in quietly, perhaps from a piece of technology so mundane that hardly anyone knows it exists.

Our lives are built on systems piled on systems. As we board airplanes, cross bridges, pay bills, download updates, track our children at camp and generally try to make it through the day, we take them for granted.

Until they fail.

READ MORE

This past week’s global software outage, immediately proclaimed as the biggest in history, was not caused by terrorists or AI or rogue hackers demanding billions in ransom. It wasn’t even done as a lark by some off-the-charts smart teenager. Those are the Hollywood versions. Instead, it was a routine upgrade that somehow went off the rails.

CrowdStrike, a Texas company, specialises in protecting corporate clients from cyberthreats. It has been very successful at this. This time, though, the threat came from CrowdStrike itself, a problem for which it seemed unprepared.

The trouble began with a small Windows software update that CrowdStrike sent to its customers last Thursday night. For some reason, this crashed every computer it touched. “Your PC ran into a problem,” users were cheerily informed. “It looks like Windows didn’t load correctly,” messages announced. The backdrop was the colour of a perfect sky, also known as the “blue screen of death”.

Any system can fail, and usually in unexpected ways. The Great Blackout of 1965, another contender for the greatest technology stumble of all time, shut off the electrical grid for 30 million people on the US east coast. Silicon Valley couldn’t be blamed because Silicon Valley barely existed, but the culprit – a bad relay at a Canadian power station that caused a cascade of issues that broke the system – was equally mundane.

Living in the modern world is an act of faith. Most of the time, we don’t think about it. Then the airplane we’re on shakes with turbulence. Or we read about how a door blew off. Or how planes crashed. Or – and this happened to people on thousands of flights last Friday – we can’t even get on the plane. Delta grounded flights for hours. Ryanair’s online check-in was unavailable, leading to large queues building at Irish airports. NCT centres were turning people away, unable to fill appointments due to the glitch. It was worldwide pandemonium.

Planes are, for obvious reasons, a central theatre of anxiety when technology is having a breakdown. But even those who weren’t trying to travel were upset last Friday. The computers couldn’t manage to get out of the passive voice to assign responsibility for their collapse, much less fix themselves, and the humans, at least initially, were not much better.

“It’s a mess,” Brody Nisbet, an executive at CrowdStrike, wrote on social platform X as he suggested a possible workaround. “I’ve no further actionable help to provide at the minute.” He added a disappointed-face emoji.

The message was later deleted.

CrowdStrike probably failed to do its due diligence, programmers said. Trying the patch out on a variety of Windows machines before sending it out to customers could have helped detect the issue.

“They should have had a test machine to emulate some of their clients’ old boxes and they would have seen the blue screen of death,” said Matt Mitchell, a hacker and founder of CryptoHarlem, a cybersecurity education and advocacy organisation.

CrowdStrike is not some tiny start-up. Founded in 2011, it has 8,000 employees and a stock market valuation that was heading to $100 billion, at least before the outage caused some investors to jump ship. CrowdStrike shares closed down 11 per cent last Friday.

The scale of the outage was significant, affecting an estimated 8.5 million machines according to Microsoft.

“When your software or your system reaches that level of magnitude, it becomes from my perspective an essential service,” said Puneet Kukreja, cybersecurity lead for EY UK and Ireland. “You just expect the water to be in your taps, you don’t go and check the water for ion levels to see if it’s drinkable or not.”

If the company doesn’t have the name recognition of some bigger tech firms, it has its share of arrogance. A portion of its website is devoted to trash-talking its competitors. “Microsoft’s security products can’t even protect Microsoft. How can they protect you?” CrowdStrike asks. Avoid Palo Alto Networks, it demands: “Don’t settle for a high-cost platform that’s hard to use, hard to deploy, and hard to manage.”

A message last Friday from CrowdStrike CEO George Kurtz seemed to minimise the outage, calling it “a defect found in a single content update for Windows hosts”. People complained that Kurtz was slow to offer an apology. (Hours later, he said, “I want to sincerely apologise directly to all of you for today’s outage.”) CrowdStrike did not respond to a request for further comment.

Information technology workers at affected companies were faced with a choice: walk around to each offline machine and remove the bit of flawed code, or wait and hope for a solution from CrowdStrike.

“The workaround works if you can walk to every laptop, type on the keyboard and reboot it manually,” said Mikko Hypponen, a security expert and chief research officer at WithSecure, a cybersecurity company. “The problem that this poses is that, normally, large enterprises, which is what CrowdStrike customers are, maintain their fleet” with centralised controls.

In other words, the traditional way to fix a balky computer – turning it off and then turning it on again – was still the only solution, even as the computers themselves are now increasingly wove into worldwide networks. But the Travellers trapped at the airport could not reboot those screens that were preventing them from flying.

What Kurtz called “a defect found in a single content update” is a modern-day threat. Only a few years ago, software updates were more complicated, more tedious. Every computer system was not linked to every other system, which meant failures were more contained.

“When it comes to cybersecurity, we talk about defence in depth – having a moat and then archers and a gate around the castle. We talk about having it set up where there is no single point of failure. But we are creating a situation where there is a single point of failure,” Mitchell said.

The most recent outage showed just how vulnerable companies are and underlined the need to have plans in place to deal with such events to help minimise disruption.

“In today’s interconnected digital world, where businesses rely heavily on seamless integration of various IT systems and networks, the potential for disruptions is a constant reality,” said Kurkreja. “Systems going down due to cyber incidents, technical failures, or other unforeseen events can have widespread impacts, affecting not just one organisation but potentially its partners and customers as well across its supply chain. Therefore, IT resilience has become a fundamental aspect of business operations, enabling companies to quickly recover and maintain continuity in the face of such disruptions.”

Irish cybersecurity provider Integrity 360′s CTO Richard Ford said questions needed to be answered.

“We should also think about the current approach to security, and the unification of technologies and vendors,” he said, warning that a lot of eggs were being put in one basket. “Will the impact of this incident cause the more risk-averse organisations to distribute their security controls and risk across more vendors and segment areas of the business? Potentially, but all will definitely be acutely aware of the trust we put into vendors and our recovery plans.”

People took the 1965 blackout in stride. The CrowdStrike outage disrupted but it has not yet been linked to any deaths. People had the weekend to complete their interrupted journeys. If CrowdStrike is lucky, the trouble will be forgotten within days if not hours.

Someday, though, the rest of us may not be so lucky, and some piece of boring technology – overloaded, neglected or poorly installed – will cause a genuine disaster.

“It highlights the chaos that can happen when organisations within those sectors suffer an incident, be that an IT problem or as the result of a malicious attack,” said security expert Brian Honan. “There is no such thing as 100 per cent security. It’s all down to risk management, identifying the risks, and managing those risks as best you can to minimise the impact on the ledger, if something happened.”

A software breakdown that causes a societal breakdown is probably better odds than AI bringing about world peace. The more networked the world gets, the greater the danger.

It would be a stupid way to go, as the poets anticipated long ago. “This is the way the world ends / Not with a bang but a whimper,” TS Eliot wrote. These days, of course, he would add a thumbs-down emoji. – This article originally appeared in The New York Times