Details on Facebook Chat Architecture

For those interested in building scalable systems, Eugene Lutuchy, lead engineer on Facebook Chat, has posted details on many of the key engineering decisions his team made designing the Chat back-end infrastructure.

Letuchy writes, “When your feature’s userbase will go from 0 to 70 million practically overnight, scalability has to be baked in from the start.” That’s an understatement! While we’ve gotten several reports of Facebook Chat breaking under Firefox 3.0 RC1, on the whole Facebook Chat’s rollout has been very, very solid.

Some highlights from the report:

  • The most resource-intensive operation performed in a chat system is not sending messages. It is rather keeping each online user aware of the online-idle-offline states of their friends, so that conversations can begin.
  • Another challenge is ensuring the timely delivery of the messages themselves. The method we chose to get text from one user to another involves loading an iframe on each Facebook page, and having that iframe’s Javascript make an HTTP GET request over a persistent connection that doesn’t return until the server has data for the client. The request gets reestablished if it’s interrupted or times out. This isn’t by any means a new technique: it’s a variation of Comet, specifically XHR long polling, and/or BOSH.
  • Having a large-number of long-running concurrent requests makes the Apache part of the standard LAMP stack a dubious implementation choice.
  • For Facebook Chat, we rolled our own subsystem for logging chat messages (in C++) as well as an epoll-driven web server (in Erlang) that holds online users’ conversations in-memory and serves the long-polled HTTP requests. Both subsystems are clustered and partitioned for reliability and efficient failover. Why Erlang? In short, because the problem domain fits Erlang like a glove. Erlang is a functional concurrency-oriented language with extremely low-weight user-space “processes”, share-nothing message-passing semantics, built-in distribution, and a “crash and recover” philosophy proven by two decades of deployment on large soft-realtime production systems.
  • Having Thrift available freed us to split up the problem of building a chat system and use the best available tool to approach each sub-problem.