Yesterday, a Gmail outage slowed down the media industry for over an hour and generally just annoyed everyone who relies on the free email service for personal or business communications.
In a blog post last night, Ben Treynor, Gmail’s VP of Engineering and “site reliability czar,” said the blackout was actually a result of recent changes meant to improve service:
“This morning (Pacific Time) we took a small fraction of Gmail’s servers offline to perform routine upgrades. This isn’t in itself a problem — we do this all the time, and Gmail’s web interface runs in many locations and just sends traffic to other locations when one is offline.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system ‘stop sending us traffic, we’re too slow!’. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server.”
Thankfully, Treynor says the Gmail team is now working hard to make sure that outages like yesterday’s — which lasted for 100 scary minutes — remain a “rarity.”
Earlier: Hey, What Happened To Gmail?