Twitter Discusses Monday's Fail Whale Sighting

Twitter may be having its issues, but at least its executives are up front about it. A post on the Twitter Blog titled Reliability, by spokesman Matt Graves, directed readers to a post on the Twitter Engineering Blog by engineering program manager Jean-Paul Cozzatti, Twitter & Performance: An update, which attempts to explain the recurring visits by the fail whale.
Graves wrote:

When you can’t update your profile photo, send a Tweet, or even sign on to Twitter, it’s frustrating. We know that, and we’ve had too many of these issues recently.

As we said last month, we are working on long-term solutions to make Twitter a more reliable and stable platform. It’s our No. 1 priority. The bulk of our engineering efforts are currently focused on this issue, and we have moved resources from other projects to focus on it.

For much more background, J.P. Cozzatti from our engineering team discusses our efforts and recent issues today in a post on the Twitter Engineering blog. In a separate, but closely related post on the Engineering blog, we discuss something we’ve been working toward for some time: We’re moving into our own dedicated data center this fall. This will be a big step forward.

Highlights from Cozzatti’s post:

On Monday, a fault in the database that stores Twitter user records caused problems on both and our API. The short, non-technical explanation is that a mistake led to some problems that we were able to fix without losing any data.

As we said last month, keeping pace with record growth in Twitter’s user base and activity presents some unique and complex engineering challenges. We frequently compare the tasks of scaling, maintaining, and tweaking Twitter to building a rocket in midflight.

While we’re continuously improving the performance, stability, and scalability of our infrastructure and core services, there are still times when we run into problems unrelated to Twitter’s capacity. That’s what happened this week.

On Monday, our users database, where we store millions of user records, got hung up running a long-running query; as a result, most of the table became locked. The locked users table manifested itself in many ways: Users were unable to sign up, sign in, update their profile or background images, and responses from the API were malformed, rendering the response unusable to many of the API clients. In the end, this affected most of the Twitter ecosystem: our mobile, desktop, and Web-based clients, the Twitter support and help system, and

To remedy the locked table, we force-restarted the database server in recovery mode, a process that took more than 12 hours (the database covers records for over 124 million users — that’s a lot of records). During the recovery, the users table and related tables remained unavailable. Unfortunately, even after the recovery process completed, the table remained in an unusable state. Finally, yesterday morning, we replaced the partially locked user db with a copy that was fully available (in the parlance of database admins everywhere, we promoted a slave to master), fixing the database and all of the related issues. David Cohen is editor of Adweek's Social Pro Daily.