Facebook Using Hadoop for Large Scale Internal Analytics

Facebook’s engineering team has posted some details on the tools it’s using to analyze the huge data sets it collects. One of the main tools it uses is Hadoop, an open source project that makes it easier to analyze vast amounts of data.

Some interesting tidbits from the post:

  • Some of these early projects have matured into publicly released features (like the Facebook Lexicon) or are being used in the background to improve user experience on Facebook (by improving the relevance of search results, for example).
  • Facebook has multiple Hadoop clusters deployed now – with the biggest having about 2500 cpu cores and 1 PetaByte of disk space. We are loading over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and have hundreds of jobs running each day against these data sets. The list of projects that are using this infrastructure has proliferated – from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality.
  • Over time, we have added classic data warehouse features like partitioning, sampling and indexing to this environment. This in-house data warehousing layer over Hadoop is called Hive and we are looking forward to releasing an open source version of this project in the near future.