CrowdStrike outage: Importance of tech resilience

Image related to CrowdStrike outage: Importance of tech resilience

By Dan Byrne, 9 September 2024

It was a remarkably quick news story – making global headlines on a Friday morning and becoming old news by the Saturday evening. Still, the genuine fear and panic caused by the CrowdStrike outage cannot be denied. 

Ultimately, companies cannot let these events pass without seriously questioning their resilience in the tech world.
 
The CrowdStrike outage demonstrated that companies that fail to ask these questions will face potentially devastating disruptions, even to the point of complete business paralysis. Boards and other corporate leaders must take full responsibility for addressing this scenario.

CrowdStrike outage: a quick recap

  • The outage was caused by a faulty update distributed by cybersecurity company CrowdStrike. A single corrupt file within the update caused computer systems to crash and require manual resets.
  • It mainly affected big businesses using Windows 10 and Windows 11 operating systems. News channels focused their attention on airlines, banks and healthcare.
  • CrowdStrike released a fix within hours, but the need to manually reset machines meant that the effects would take weeks to disappear completely. 
  • Some companies, such as US airline Delta, struggled to recover from the outage, cancelling up to 7,000 flights well into the following week.

What can we take from this?

It’s been described as the biggest tech outage in modern IT history – “historic in scale”, so corporate leaders cannot let it pass without learning a thing or two. 

Here are our main takeaways:

Your company’s ability to cope will contribute to your reputation

CrowdStrike is a well-respected cybersecurity firm trusted by many big companies worldwide. Yes, its quality control failed horribly in this instance, but companies will not be judged for using its products in the first instance. 

They will, however, be judged on how they respond when things go wrong. 

Some companies were able to keep running with minimal confusion despite being badly hit. They went around obstacles and used backup protocols to ensure their business could function as normally as possible. As the crisis unfolded, we saw images of whiteboards in airports, replacing the usual departure/arrival screens. Yes, they were primitive, but they worked!

Other companies struggled long after CrowdStrike released the fix. The best example is Delta Airlines, which cancelled flights days after everyone else returned to normal. 

Delta has attributed the delay to the outage affecting so many of its systems at once – specifically its crew management system, which couldn’t cope with the rescheduling required in the days after. 

Here lies the real challenges of emergencies like this because we can say, to a certain extent, that Delta simply got caught in a perfect storm. It had established IT infrastructure and a partnership with CrowdStrike, and it just so happened that the specific faulty update affected its systems more than most. 

But try telling that to terminals full of disgruntled passengers and story-hungry news crews. 

Ultimately, whatever negative attention CrowdStrike receives, Delta will still see increased reputational risk because of what happened. Indeed, it has already earned itself an official investigation by the US Department of Transport.

Tech requires constant questions

Boards and other corporate leaders must do their best to understand the implications of a company’s tech decisions. 

It’s true that not every director will have a tech background, but a little education won’t hurt if you’re unsure of the basics. You just need enough to enable you and your board to ask the right questions about tech strategy. 

CrowdStrike’s record and global partnership mean the decision to use their services doesn’t automatically mean a lack of due diligence. 

However, the outage should prompt boards to realise the importance of questions in all aspects of tech. Who is this company? What is their track record? What safeguards do they have? Do I understand what they provide, what access they will have to our systems, and what risks are involved if things go wrong? 

Crucially, corporate boards should become curious about the risk of everything going wrong at once – as was the case for many during the CrowdStrike outage. 

Cybersecurity is about defence, and when it comes to defence, you need backups for complete resilience. Putting all your trust in one provider doesn’t exactly scream “resilience”, especially when that provider runs into problems.

In summary

The CrowdStrike outage was primarily a quality control failure in a single but pivotal cybersecurity firm. 

Realistically, however, the many affected companies will be judged on their own response to the crisis, especially if they can’t recover fast enough. 

So, this event is an essential lesson in the art of resilience and asking questions.



 

This article has been republished with permission from the Corporate Governance Institute, a global educational technology company specialising in training and certifying the next generation of company directors and board members.