Facebook introduced audience insights in May as a way for brands to learn more about the audiences they target with their ads on the social network in order to refine their strategies, but what technical challenges did it face, and how were they overcome? Software engineers Deniz Demir, Islam AbdelRahman, Liang He and Yingsheng Gao answered those questions in a post on Facebook’s engineering blog.
Please see the blog post for all of the details, but the basics follow:
Audience insights needs to process queries over tens of terabytes of data in a few hundred milliseconds. These queries can include complex analytical computations. For example, you may want to better engage people between 18 and 21 years old in the U.S. and learn what they are interested in. AI can analyze all of your audience’s page likes, as well as the other pages each of these pages has liked, in order to come up with a set of pages that have high affinity with your audience. AI often needs to go through billions of page likes to compute affinity analytics for your audience.
AI is powered by a query engine with a hybrid integer store that organizes data in memory and on flash disks so that a query can process terabytes of data in real time on a distributed set of machines. AI query engine is a fan-out distributed system with an aggregation tier and a leaf (data) tier. The aggregator sends a query request to all data nodes, which then execute the query and send back the local results to be aggregated.
AI has two types of data: attributes for people such as age, gender, country, etc.; and social connections such as page likes, custom audiences and interest clusters. These data sets are made up of anonymized integers. No real user IDs are kept in the system. Data is sharded by user, and there are total of 1,024 shards distributed to 168 data nodes in one production cluster. For each shard, 35 gigabytes of raw data are copied to a data node to build the shard, which takes 8 GB in memory and 4 GB in embedded storage on flash disk. The total amount of data copied to one cluster is more than 35 TB, whose indices take a total of 8 TB in memory and 4TB on embedded storage. A small amount of metadata is replicated to all nodes. The query processor accesses the data in memory much more frequently than the data on embedded storage.
The AI query engine is implemented with C++ and uses Facebook’s internal frameworks and libraries. Looking forward, the next step for our query engine is to support time-series data sets and organize data in a way that it could run real-time queries on an order-of-magnitude larger data sets. We hope to increase the ratio of data on embedded storage to keep the total memory requirement under control as it grows to petabyte-scale datasets.
Wenrui Zhao, Maxim Sokolov, Scott Straw and Ajoy Frank were also credited with working on the audience insights query engine.