All our servers and company laptops went down at pretty much the same time. Laptops have been bootlooping to blue screen of death. It’s all very exciting, personally, as someone not responsible for fixing it.
Apparently caused by a bad CrowdStrike update.
Edit: now being told we (who almost all generally work from home) need to come into the office Monday as they can only apply the fix in-person. We’ll see if that changes over the weekend…
Wow, I didn’t realize CrowdStrike was widespread enough to be a single point of failure for so much infrastructure. Lot of airports and hospitals offline.
The Federal Aviation Administration (FAA) imposed the global ground stop for airlines including United, Delta, American, and Frontier.
Flights grounded in the US.
I see a lot of hate ITT on kernel-level EDRs, which I wouldn’t say they deserve. Sure, for your own use, an AV is sufficient and you don’t need an EDR, but they make a world of difference. I work in cybersecurity doing Red Teamings, so my job is mostly about bypassing such solutions and making malware/actions within the network that avoids being detected by it as much as possible, and ever since EDRs started getting popular, my job got several leagues harder.
The advantage of EDRs in comparison to AVs is that they can catch 0-days. AV will just look for signatures, a known pieces or snippets of malware code. EDR, on the other hand, looks for sequences of actions a process does, by scanning memory, logs and hooking syscalls. So, if for example you would make an entirely custom program that allocates memory as Read-Write-Execute, then load a crypto dll, unencrypt something into such memory, and then call a thread spawn syscall to spawn a thread on another process that runs it, and EDR would correlate such actions and get suspicious, while for regular AV, the code would probably look ok. Some EDRs even watch network packets and can catch suspicious communication, such as port scanning, large data extraction, or C2 communication.
Sure, in an ideal world, you would have users that never run malware, and network that is impenetrable. But you still get at avarage few % of people running random binaries that came from phishing attempts, or around 50% people that fall for vishing attacks in your company. Having an EDR increases your chances to avoid such attack almost exponentionally, and I would say that the advantage it gives to EDRs that they are kernel-level is well worth it.
I’m not defending CrowdStrike, they did mess up to the point where I bet that the amount of damages they caused worldwide is nowhere near the amount damages all cyberattacks they prevented would cause in total. But hating on kernel-level EDRs in general isn’t warranted here.
Kernel-level anti-cheat, on the other hand, can go burn in hell, and I hope that something similar will eventually happen with one of them. Fuck kernel level anti-cheats.
Honestly kind of excited for the company blogs to start spitting out their
disaster recoverycrisis management stories.I mean - this is just a giant test of
disaster recoverycrisis management plans. And while there are absolutely real-world consequences to this, the fix almost seems scriptable.If a company uses IPMI (
CalledBranded AMT and sometimes vPro by Intel), and their network is intact/the devices are on their network, they ought to be able to remotely address this.
But that’s obviously predicated on them having already deployed/configured the tools.Reading into the updates some more… I’m starting to think this might just destroy CloudStrike as a company altogether. Between the mountain of lawsuits almost certainly incoming and the total destruction of any public trust in the company, I don’t see how they survive this. Just absolutely catastrophic on all fronts.
If all the computers stuck in boot loop can’t be recovered… yeah, that’s a lot of cost for a lot of businesses. Add to that all the immediate impact of missed flights and who knows what happening at the hospitals. Nightmare scenario if you’re responsible for it.
This sort of thing is exactly why you push updates to groups in stages, not to everything all at once.
Looks like the laptops are able to be recovered with a bit of finagling, so fortunately they haven’t bricked everything.
And yeah staged updates or even just… some testing? Not sure how this one slipped through.
Not sure how this one slipped through.
I’d bet my ass this was caused by terrible practices brought on by suits demanding more “efficient” releases.
“Why do we do so much testing before releases? Have we ever had any problems before? We’re wasting so much time that I might not even be able to buy another yacht this year”
Testing in production will do that
deleted by creator
Why is it bad to do on a Friday? Based on your last paragraph, I would have thought Friday is probably the best week day to do it.
Most companies, mine included, try to roll out updates during the middle or start of a week. That way if there are issues the full team is available to address them.
Don’t we blame MS at least as much? How does MS let an update like this push through their Windows Update system? How does an application update make the whole OS unable to boot? Blue screens on Windows have been around for decades, why don’t we have a better recovery system?
Crowdstrike runs at ring 0, effectively as part of the kernel. Like a device driver. There are no safeguards at that level. Extreme testing and diligence is required, because these are the consequences for getting it wrong. This is entirely on crowdstrike.
This didn’t go through Windows Update. It went through the ctowdstrike software directly.
Why do people run windows servers when Linux exists, it’s literally a no brainer.
They run Windows and all this third party software because they would rather pay subscriptions and give up control of their business than retain skilled staff. It has nothing todo with Linux vs Windows. Linux won’t stop doors falling off Boeing planes. It is the myopia of modern business culture.
Because all software runs from Linux right…
It could if more people just used Linux
I was quite surprised when I heard the news. I had been working for hours on my PC without any issues. It pays off not to use Windows.
It’s not a flaw with Windows causing this.
The issue is with a widely used third party security software that installs as a kernel level driver. It had an auto update that causes bluescreening moments after booting into the OS.
This same software is available for Linux and Mac, and had similar issues with specific Linux distros a month ago. It just didn’t get reported on because it didn’t have as wide of an impact.
Still a MS issue. Both testing and rollout procedures were inadequate
My Windows gaming PC is completely fine right now, because I don’t use crowd strike. Microsoft didn’t have anything to do with crowd strikes’ rollout or support.
I love Linux and use it as my daily driver for everything besides some online games. There are plenty of legitimate reasons to criticize Microsoft and Windows, but crowd strike breaking stuff isn’t one of them, at least in my opinion.
A few years ago when my org got the ask to deploy the CS agent in linux production servers and I also saw it getting deployed in thousands of windows and mac desktops all across, the first thought that came to mind was “massive single point of failure and security threat”, as we were putting all the trust in a single relatively small company that will (has?) become the favorite target of all the bad actors across the planet. How long before it gets into trouble, either because if it’s own doing or due to others?
I guess that we now know
No bad actors did this, and security goes in fads. Crowdstrike is king right now, just as McAfee/Trellix was in the past. If you want to run around without edr/xdr software be my guest.
If you want to run around without edr/xdr software be my guest.
I don’t think anyone is saying that… But picking programs that your company has visibility into is a good idea. We use Wazuh. I get to control when updates are rolled out. It’s not a massive shit show when the vendor rolls out the update globally without sufficient internal testing. I can stagger the rollout as I see fit.
Irrelevant but I keep reading “crowd strike” as “counter strike” and it’s really messing with me
Think of it as ClownStrike, they will be known as a bunch of clowns after this.
My work PC is affected. Nice!
Plot twist: you’re head of IT
>Make a kernel-level antivirus
>Make it proprietary
>Don’t test updates… for some reason??I mean I know it’s easy to be critical but this was my exact thought, how the hell didn’t they catch this in testing?
Completely justified reaction. A lot of the time tech companies and IT staff get shit for stuff that, in practice, can be really hard to detect before it happens. There are all kinds of issues that can arise in production that you just can’t test for.
But this… This has no justification. A issue this immediate, this widespread, would have instantly been caught with even the most basic of testing. The fact that it wasn’t raises massive questions about the safety and security of Crowdstrike’s internal processes.
From what I’ve heard and to play a devil’s advocate, it coincidented with Microsoft pushing out a security update at basically the same time, that caused the issue. So it’s possible that they didn’t have a way how to test it properly, because they didn’t have the update at hand before it rolled out. So, the fault wasn’t only in a bug in the CS driver, but in the driver interaction with the new win update - which they didn’t have.
How sure are you about that? Microsoft very dependably releases updates on the second Tuesday of the month, and their release notes show if updates are pushed out of schedule. Their last update was on schedule, July 9th.
I’m not. I vaguely remember seeing it in some posts and comments, and it would explain it pretty well, so I kind of took it as a likely outcome. In hindsight, You are right, I shouldnt have been spreading hearsay. Thanks for the wakeup call, honestly!
Lots of security systems are kernel level (at least partially) this includes SELinux and AppArmor by the way. It’s a necessity for these things to actually be effective.
This is why you create restore points if using windows.
Those things never worked for me… Problems always persisted or it failed to apply the restore point. This is from the XP and Windows 7 days, never bothered with those again. To Microsoft’s credit, both W7 and W10 were a lot more stable negating the need for it.
I can’t say about XP or 7 but they’ve definitely saved my bacon on Win10 before on my home system. And the company I work for has them automatically created and it made dealing with the problem much easier as there was a restore point right before the crowdstrike update. No messing around with the file system drivers needed.
I’d really recommend at least creating one at a state when your computer is working ok, it doesn’t hurt anything even if it doesn’t work for you for whatever reason. It’s just important to understand that it’s not a cure all, it’s only designed to help with certain issues (primarily botched updates and file system trouble).
The amount of servers running Windows out there is depressing to me
I dunno, but doesn’t like a quarter of the internet kinda run on Azure?
I’ve had my PC shut down for updates three times now, while using it as a Jellyfin server from another room. And I’ve only been using it for this purpose for six months or so.
I can’t imagine running anything critical on it.
Windows server, the OS, runs differently from desktop windows. So if you’re using desktop windows and expecting it to run like a server, well, that’s on you. However, I ran windows server 2016 and then 2019 for quite a few years just doing general homelab stuff and it is really a pain compared to Linux which I switched to on my server about a year ago. Server stuff is just way easier on Linux in my experience.
The thought of a local computer being unable to boot because some remote server somewhere is unavailable makes me laugh and sad at the same time.
I don’t think that’s what’s happening here. As far as I know it’s an issue with a driver installed on the computers, not with anything trying to reach out to an external server. If that were the case you’d expect it to fail to boot any time you don’t have an Internet connection.
Windows is bad but it’s not that bad yet.
It’s just a fun coincidence that the azure outage was around the same time.
Yep, and it’s harder to fix Windows VMs in Azure that are effected because you can’t boot them into safe mode the same way you can with a physical machine.
Foof. Nightmare fuel.
A lot of people I work with were affected, I wasn’t one of them. I had assumed it was because I put my machine to sleep yesterday (and every other day this week) and just woke it up after booting it. I assumed it was an on startup thing and that’s why I didn’t have it.
Our IT provider already broke EVERYTHING earlier this month when they remote installed" Nexthink Collector" which forced a 30+ minute CHKDSK on every boot for EVERYONE, until they rolled out a fix (which they were at least able to do remotely), and I didn’t want to have to deal with that the week before I go in leave.
But it sounds like it even happened to running systems so now I don’t know why I wasn’t affected, unless it’s a windows 10 only thing?
Our IT have had some grief lately, but at least they specified Intel 12th gen on our latest CAD machines, rather than 13th or 14th, so they’ve got at least one win.
Your computer was likely not powered on during the time window between the fucked update pushing out and when they stopped pushing it out.
That makes sense, although I must have just missed it, for people I work with to catch it.
This is going to be a Big Deal for a whole lot of people. I don’t know all the companies and industries that use Crowdstrike but I might guess it will result in airline delays, banking outages, and hospital computer systems failing. Hopefully nobody gets hurt because of it.
Big chunk of New Zealands banks apparently run it, cos 3 of the big ones can’t do credit card transactions right now
cos 3 of the big ones can’t do credit card transactions right now
Bitcoin still up and running perhaps people can use that