Unless you’ve been hiding in a cave, you know Facebook announced a social inbox today, promising a real-time conversation that combines the messaging communications already available on the social network with texting, instant messaging, email and chat. The social network’s new service will encompass more than 20 additional infrastructure services.
Now you can see why the company needed to buy its own data center. Even before today’s announcement, the magnitude of data that Facebook has been handling has required continually expanding infrastructure. The rate at which the company needs to increase database capacity almost defies description. One of the company’s database software engineering experts, Kannan Muthukkaruppan — not surprisingly, he’s an Oracle alumnus — quantifies it in a post today:
The current Messages infrastructure handles over 350 million users sending over 15 billion person-to-person messages per month. Our chat service supports over 300 million users who send over 120 billion messages per month.
This large amount of data showed two different patterns, or groups, one considerably larger than the other. The smaller of the two encompassed more essential and time-sensitive data, while the larger one was what Chief Executive Officer Mark Zuckerberg called “noise.” These contrasts ultimately informed Facebook’s decisions on exactly how to expand its storage capacity and accommodate new features.
Testing of several open source, distributed database architectures resulted in what might seem to be an ironic choice: HBase, which is modeled after Google’s BigTable, HBase. It’s non-relational and programmed in Java. Wikipedia describes the technology:
HBase features compression, in-memory operation, and Bloom filters on a per-column basis… Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST or Thrift gateway APIs.
Upgrading to HBase saved Facebook from having to completely replace its application servers that run databases: Apache Cassandra, which Facebook developed and made available to the open source community in 2008, handles in-box searches. MySQL houses core data such as log-ins. But both have limits in scalability and performance, said Muthukkaruppan:
MySQL proved to not handle the long tail of data well; as indexes and data sets grew large, performance suffered. We found Cassandra’s eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure.
Facebook is already running an open source release of HBase, which includes:
- very good scalability and performance
- a simpler consistency model than Cassandra
- auto load balancing and failover
- compression support
- multiple shards per server
HBase’s underlying file system, HDFS, includes:
- end-to-end checksums
- automatic rebalancing
Facebook’s staff already had experience developing and running HDFS from data processing with Hadoop. The social network’s engineers continue to polish HBase and share improvements with the open source community.
So, the new social inbox announced today uses a home-grown application server that interfaces with many other services running at Facebook. For instance, deciding whether a message should travel over SMS or chat requires a whole service in itself that the engineers refer to as delivery. Other components include:
- Haystack stores attachments
- There’s a user discovery service written on top of Apache ZooKeeper
- Another service handles email account verification
Readers, what are your expectations for social inbox based on the underlying technology described above?