Twitter is just one more of the many tech companies who are turning to custom data management solutions due to the high volume of user-generated data. As an alternative to the fairly common practice of ‘sharding’ a database (splitting it up over several computers), Twitter came up with a solution they call Gizzard, which they’ve made available for other developers.
Because of the massive amount of data collected daily, several social media-related websites are producing custom data-store solutions. SimpleGeo, for example, uses Giselle DB for their geodata infrastructure services. Giselle DB is a fork of the Cassandra solution, one of the growing number of “NoSQL” data management tools. (Note: Twitter does have a Cassandra-related code available amongst their roughly 30 other Open Source projects.) Such solutions are believed by some Web application architects to be the most efficient way to handle scaling, such as with social games, which also amass user-generated data.
Some of the NoSQL data-store solutions using a process called “sharding” which, as mentioned above, requires partitioning data across multiple physical locations, usually over several computers. Sometimes portions of data are replicated, sometimes not. (This is in essence similar to various RAID disk array configurations.) As Nick Kallen of Twitter points out, sharding is not an easy task to do efficiently, especially for systems that are continually acquiring new data and have to simultaneously scale up. Data has to remain consistent, if it’s being replicated in several places.
Twitter has created other custom distributed data-stores, however, as Kallen puts it, “[M]ost of the available open-source projects are either too immature or too limited to deal with the variety of problems that exist on the web.” Hence Gizzard, a “Scala framework” for creating custom, fault-tolerant distributed databases. Gizzard acts as a template that while not able to solve all data storage problems, will help solve some. Its place is between a database and Web client code, and manages how data should be split it up and replicated. Read Kallen’s post for an indepth explanation of the Gizzard middleware framework. The Gizzard code is available on GitHub, as is an data storage example called Rowz.
If you’re a 3rd-party developer of Twitter apps or large scale applications in general, make sure to register for our upcoming Social Developer Summit. If you’re curious about where to focus your Twitter platform development skills, read our earlier post.ReadWriteWeb, GigaOm]