How Cloudflare Accidentally Broke a Quarter of the Internet — Twice in One Month
What You'll Find In This Article
- •Understand why centralized internet services like Cloudflare can cause widespread outages when things go wrong
- •Learn the difference between 'fail-open' and 'fail-closed' systems and why it matters for reliability
- •Recognize why having safety improvements 'in progress' doesn't protect you from the next incident
- •Appreciate the importance of gradual rollouts and quick rollback capabilities for any system changes
Imagine if the company that helps run a huge chunk of the internet accidentally flipped the wrong switch and knocked out service for millions of websites. That's essentially what happened to Cloudflare in December 2025 — and it wasn't hackers or a sophisticated attack. It was their own internal update gone wrong. For about 20-25 minutes, roughly 28% of all the web traffic flowing through Cloudflare just... stopped working. Users around the world saw error messages instead of the websites they were trying to visit. The embarrassing part? This was the second major outage in less than a month, and the safety measures they promised after the first incident hadn't been fully set up yet. Cloudflare is now rolling out a new program called 'Fail Small' — essentially redesigning their systems so that when something goes wrong, it breaks a little instead of breaking everything.
The Problem
Cloudflare is like a massive security guard and traffic director for the internet. About 20% of all websites use their services to load faster and stay protected from attacks. So when Cloudflare has a bad day, a significant portion of the internet has a bad day too.
On December 5th, 2025, Cloudflare's team was trying to fix a security hole — a vulnerability in something called React Server Components (don't worry about the technical name; just know it was a legitimate security concern). Their fix was supposed to make things safer. Instead, it made things much, much worse.
For about 20-25 minutes, roughly 28% of Cloudflare's traffic — that's millions of website visits — hit a wall. Instead of seeing the pages they wanted, users got HTTP 500 errors, which is the internet's way of saying "something's broken and we don't know what."
The Solution Explained
The root cause wasn't complicated: Cloudflare pushed a change to their security filtering system (their Web Application Firewall) that affected more customers than they anticipated. Think of it like updating the locks on some doors in a building, but accidentally jamming all the doors shut instead.
What made this particularly painful was timing. Just three weeks earlier, on November 18th, Cloudflare had another major outage caused by a similar mistake — a single change that spiraled out of control. After that incident, they promised improvements. But those improvements weren't fully in place when the December outage hit.
It's like promising to install smoke detectors after a small kitchen fire, then having another fire before you've finished putting them up.
How It Actually Works
Cloudflare's response is a program they're calling "Code Orange: Fail Small." The name tells you everything about their new philosophy: when something goes wrong, make sure it fails in a small, contained way rather than taking everything down.
Here's what they're changing:
Stricter testing before going live: Changes will go through more rigorous staging environments that better simulate real-world conditions before being pushed to all customers.
Faster rollbacks: If something does go wrong, they want to be able to undo it in seconds, not minutes. When you're handling 28% of the internet's traffic, every second counts.
Granular kill switches: Instead of having one big "off" button, they're building smaller switches that can turn off specific features without affecting everything else. It's like being able to shut off water to one bathroom instead of the whole building.
Fail-open behavior: This is a key concept. Currently, when their security system encounters an error, it blocks traffic ("fail-closed"). They're moving toward "fail-open," meaning if the security check breaks, traffic still flows through. Yes, this means briefly less security, but it's better than no service at all.
Real Examples
Think of Cloudflare like a highway system. They manage the roads that connect users to websites. In this incident, they were trying to fix a pothole (the security vulnerability), but their repair crew accidentally blocked multiple lanes on a major highway during rush hour.
The "Fail Small" approach is like saying: "From now on, if a repair goes wrong, we'll only block one lane instead of the whole highway. Traffic might slow down, but it won't stop completely."
For everyday users, this outage meant that if you tried to visit certain websites during those 20-25 minutes on December 5th, you got an error page. For businesses running on those websites, it meant lost sales, frustrated customers, and panicked support teams — all because of an update that was supposed to make things more secure.
Check if your website or business uses Cloudflare (look for Cloudflare in your hosting or security settings)
Sign up for Cloudflare status updates at cloudflarestatus.com to get notified of future outages
Document what services you depend on Cloudflare for (security, speed, DNS)
Create a simple backup plan: what will you tell customers if your site goes down due to a provider outage?
Bookmark Cloudflare's incident reports page to understand what happened during outages
PROMPT:
"Does my website or business rely on Cloudflare, and how would I know if they're having an outage?"