How Pinterest Manages 20 Terabytes of Data a Day

Handling the huge data load is expensive but Pinterest puts the data to work, so it essentially pays for itself.


It’s not enough to just be a social network anymore — just ask Myspace and Bebo. Now social networking startups need to be a data companies in addition to connecting people and being an advertising platform for brands. To that end, Pinterest has released some information about how it does its big data business.

A major part of Pinterest’s success both in scaling its network and maintaining stability is through the use of an open source project called Hadoop. According to the project’s official website, Hadoop is “a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.”

This system allows Pinterest to handle the 20 terabytes of new data the network generates every day. Adding in layers of flexibility and tools such as ‘ephemeral clusters’ — or banks of computing power that will come on and off line as needed — Pinterest has mastered its data load. Most of the information in Pinterest’s blog post on the subject is very technically dense, but the message is clear: Pinterest doesn’t want to miss out on a single data point.

The blog post notes that all this back end work is done with the goal of serving Pinners with “the most relevant and recent content… through features such as Related Pins, Guided Search and image processing. It also powers thousands of daily metrics and allows us to put every user-facing change through rigorous experimentation and analysis.”

If Pinterest were to simply process this data with minimal analysis, the company would be essentially leaving money on the table.

Through rigorous analysis, social companies are able to discover what users want, and maybe even what users are doing that the creators may not have intended. Pinterest, and others like Foursquare, realize that big data is now a core concern of any business model online.

Handling data is an expensive prospect, especially when it’s 10 terabytes a day. But if a company were able to put that data back to work effectively, the investment would more than pay for itself.