VIDEOS: Data @Scale Boston

Facebook hosted Data @Scale Boston at the Liberty Hotel last week, and the social network provided videos of the event’s presentations in the Data @Scale Boston group and on YouTube.

DataAtScaleBoston650Facebook hosted Data @Scale Boston at the Liberty Hotel last week, and the social network provided videos of the event’s presentations in the Data @Scale Boston group and on YouTube.

The videos are:

Welcome: Ryan Mack, Facebook Boston site lead.

@Scale is a series of technical conferences for engineers who build or maintain systems that are designed for scale.

Query Evaluation Using Dynamic Code Generation: Magnus Bjornsson, senior director of engineering, Oracle.

Typical query evaluation in a database uses a static evaluator, which is built to handle all types of queries. For performance reasons, it has become more and more common in recent years to dynamically build the evaluator based on the query itself (using JIT compilation). In this talk, I’ll talk about the approach that we at Oracle/Endeca took in our own columnar, in-memory data store to dynamically generate the query evaluator at query time.

Data Movement for Distributed Execution: Derrick Rice, software engineer, HP Vertica.

Networked data-movement techniques are critical to scalability in distributed computing. As data sets have grown and analytics have increased in complexity, traditional approaches have run into some surprising problems. Exhaustion of ephemeral ports and OS buffers. Deadlock and unexpected performance degradations. Extraordinary overhead costs. Congestion control challenges and firmware corner cases.

This talk will introduce HP Vertica’s data-transmission layer and the challenges encountered in the context of its distributed execution engine. We will share our journey from a naive implementation to a topology-aware data flow. We will also look at what can be learned from other technologies and ask, “What’s next?” Looking forward, operating at scale will continue to reveal problems and require new techniques.

Scaling to Over 1,000,000 Requests per Second: Beth Logan, senior director of optimization, DataXu.

DataXu’s decisioning technology handles more than 1 million ad requests per second. To put this into context, Google Search handles 5,000 to 10,000 transactions per second, and Twitter handles 5,000 to 7,000 transactions per second. Behind this statistic is an incredible architecture that has enabled us to scale. We use a blend of open-source and homegrown tools to place ads, record their impact and learn and deploy our decisioning models automatically, all while running 24×7 in over 30 countries worldwide. In this presentation, we will dive into some of these tools and discuss the challenges we faced and the tradeoffs we made.

Cold Storage at Facebook: Ritesh Kumar, software engineer, Facebook.

Cold storage is an internally used Exabyte-scale archival storage system developed completely in-house at Facebook. We discuss some of the salient design features of the cold-storage stack and how it fits into the specific low-power hardware requirements for cold storage and its unique workload characteristics. We will discuss multiple aspects of the software stack including methods to practically keep storage very durable and highly efficient, and handling realistic operations such as handling incremental cluster growth and tolerating a myriad of hardware failures at scale.

Fractal Tree Indexing in MySQL and MongoDB: Tim Callaghan, vice president of engineering, Tokutek.

As transactional and indexed reporting data sets continue to grow, traditional B-tree indexing struggles to keep up, especially when the working set of data cannot fit in RAM (random-access memory). Fractal tree indexes were purpose built to overcome this limitation, while retaining the read properties we expect for our queries. We’ll start by covering the theoretical differences between the two indexing technologies. We’ll end the talk by discussing the benefits that fractal tree indexes bring to MySQL (TokuDB) and MongoDB (TokuMX). “Benchmarks or it doesn’t count,” so expect to see a few.

Scalable Collaborative Filtering on Top of Apache Giraph: Maja Kabiljo, software engineer, Facebook.

Apache Giraph is a highly performant distributed platform for doing graph and iterative computations. Collaborative filtering is a well-known recommendation technique that is often solved with matrix-factorization based algorithms. This talk will detail our scalable implementation of SGD and ALS methods for collaborative filtering on top of Giraph. We will describe our novel methods for distributing the problem and the related Giraph extensions that allows us to scale to more than 1 billion people and tens of millions of items. We will also review various additions required for handling Facebook’s data (for example, implicit and skewed item data). Finally, to complete our easy-to-use and holistic approach to scalable recommendations at Facebook, we detail our approach to quickly finding top-k recommendations per user.