Facebook Engineering Explains "Worst Outage We’ve Had in Over Four Years"

Facebook was down for two and a half hours earlier today for many of its 500-some million users around the world, in what the company describes as “the worst outage we’ve had in over four years.” As part of the downtime, social plugins such as the Like button, and the developer platform, were also not accessible. The site also went down yesterday, but apparently for less time and fewer people.

As the engineering team details in a post this afternoon following the outage, a cache configuration problem cascaded into a major system failure, and ended up with Facebook having to turn off the site for many if not all users. The company tells us it doesn’t “have exact numbers, but this very widespread.” From the post:

The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.

This got the site back up and running today, and for now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.

While Facebook has had occasional site performance problems, in general it has managed to stay up for almost all users almost all of the time, with performance improving in the most recent years.