Crowdstrike Meltdown: A global IT outage

By Saif Haque

London: In a stunning turn of events, a global IT outage has disrupted travel, financial services, hospitals and everyday life. The culprit is a seemingly innocuous software update from cybersecurity firm Crowdstrike.

The chaos began when Crowdstrike released an update to its Falcon antivirus software, designed to protect Microsoft Windows devices. Unfortunately, this update contained a critical flaw—a “defect” that triggered widespread crashes on Windows systems. The issue didn’t affect other operating systems, but it had a significant impact on Microsoft’s Windows ecosystem.

The fallout has been unprecedented:

Flights grounded: Airlines worldwide faced disruptions, with grounded flights and frustrated passengers.
Financial Services hit: Banking services experienced downtime, affecting transactions and customer access.
Media pause: Even broadcaster Sky News had to suspend live programming due to the outages.

The Austin (Texas)-based Crowdstrike isn’t a household name, but it plays a crucial role in cybersecurity. As a provider of security services, it typically responds to hack attacks. However, this time, its flawed software update caused the problem, and the company acknowledged that this was “not a security incident or cyberattack”.

With nearly 24,000 global customers, each representing massive organisations with millions of customers, fixing the issue seems to have become a headache for IT departments worldwide.

Crowdstrike acknowledged that it was aware of reports of crashes on Windows hosts related to the Falcon Sensor. It, though, claimed Mac and Linux hosts were not impacted.

Yet, it has been almost a couple of days and the company is yet to resolve the issue. On July 18, 2024, the company informed that it was “actively working with customers impacted by a defect found in a single content update for Windows hosts. However, a day later, on July 19, 2024, it was still struggling with the issue. It claimed today that its engineers were “working hard to provide comprehensive and continuous updates with our global customers as quickly as possible”.

While CrowdStrike Engineering claims to have identified a content deployment related to this issue and reverted those changes, the company stated today that if hosts were still crashing and unable to stay online to receive the Channel File Changes, the following steps can be used to work around this issue:

Reboot the host to give it an opportunity to download the reverted channel file. If the host crashes again, then:

Boot Windows into Safe Mode or the Windows Recovery Environment
- NOTE: Putting the host on a wired network (as opposed to WiFi) and using Safe Mode with Networking can help remediation.

Microsoft’s Response

Microsoft, whose devices bore the brunt of the meltdown, took “mitigation action” to address the lingering impact. The disruption threatened the holiday season, impacting travel and leisure stocks, hospital appointments, scheduled operations.

There are many lessons to be learnt from the episode. Some of them include:

Thorough Testing: Software updates must undergo rigorous testing to catch defects before deployment.
Communication: Swiftly acknowledging and addressing issues prevents widespread panic.
Dependency Awareness: Understand the interconnectedness of digital infrastructure to prevent unexpected consequences.

In conclusion, the Crowdstrike meltdown serves as a stark reminder of our reliance on technology and the need for robust safeguards. Let’s learn from this incident and fortify our systems against future disruptions.