How Big Data Spawned Thousands of Cambridge Analyticas

Opinion: A look at the big picture reveals a more unsettling reality

Getting to the heart of the matter will require a deeper discussion of regulatory steps fotojog/iStock

It’s innate in human nature to expect a basic degree of protection—from bodily harm, from emotional injury, from assaults on our livelihoods. In free societies, citizens also expect to be protected from wrongful intrusions into their private lives, whether by governmental bodies, multinational corporations or social networks.

At the heart of the Facebook/Cambridge Analytica firestorm is an aggrieved sense that this fundamental right was violated—by an unscrupulous consulting firm, by a social network that failed to prevent the theft of millions of users’ data and by government regulators who failed to prevent such abuses.

Some 270,000 people voluntarily divulged their personal data when they installed the Cambridge Analytica-developed “This Is Your Digital Life” application, but Cambridge Analytica then used that information to access the personal data of up to 87 million more unwitting Facebook users.

How could the social network have allowed what CEO Mark Zuckerberg now admits was “a major breach of trust?” For all of the company’s current contrition, critics have castigated Facebook for its failure to enforce clear protocols preventing user data from being accessed by third parties, particularly when the latter are looking to leverage that data to influence users’ behavior: their shopping habits, where they’ll take their next vacation and even whom they’ll vote for.

While it comes from a place of outrage, this narrative offers its proponents a measure of comfort: All we need to do is ratchet up our vigilance against those seeking to exploit our data, and similar fiascoes will be averted. But a look at the big picture reveals a more unsettling reality.

The widening reach of data science

Data models excel at supervised learning—using data from a small population sample to extrapolate and build a model with predictive power for the general population. The problem does not singularly lie with Facebook neglecting the safety of our data. It is much bigger and rooted in the “data hasty”—those who willingly share just about everything about themselves and, in the process, allow any data scientist on the planet to utilize that data for social engineering.

Take location data: Suppose you are alert about your privacy, always turning off your mobile device’s GPS feature. Once you conduct a connection—whether from a cell tower nearby or using a public Wi-Fi network—clever deduction systems can retrace your exact position, based on which other people were at the same location with their GPS enabled.

For those who wonder how Facebook almost always succeeds in “guessing” where you want to check into, it’s not a matter of luck. Each time a user checks into a place, the location is mapped in 3D (including altitude) in the background, and the venue type is also catalogued. IDs from nearby routers are also traced, along with their exact signal strength. Even if you do not check in, Facebook can easily monitor where you are.

Meanwhile, consider the several mobile apps that you only use from your home—remote controls for air conditioning or TVs, for instance. Tagged data indicating that a certain location is someone’s home can offer valuable assistance in detecting behavioral patterns of people at home and predicting others’ home locations.

The same goes for fitness and dating apps: People tend to provide personal information without hesitation, since it is justifiably required by the app. Age, gender, ethnicity and marital status are all provided willingly, and later on translated to models allowing prediction of such features in the general population by using a simple supervised learning classification algorithm. Global information and data companies use such methods to corroborate the data they have to assure its accuracy.

Now, what’s more dangerous: Having a data breach that is quickly fixed and assured not to reoccur, or a de-facto situation in which a willing minority provides quality, tagged data, risking the privacy-focused majority in the process?

The implications of these questions are vast. While much of the post-Cambridge Analytica conversation has revolved around regulatory measures to ensure user consent and remedy breaches of privacy, getting to the heart of the matter will require a deeper discussion of regulatory steps to address companies’ ability to use data for making smart deductions. Full transparency and real protection will trump ex-post-facto remedies.

Recent advancements in artificial intelligence, especially in the field of artificial neural networks (deep learning), is taking this to record heights, enabling the accurate modeling of any type of behavior. This applies for almost anything. Do you want to predict someone’s first choice for a vacation destination? All you need is correctly tagged data over a small population in order to build a reliable model.

“Give me a place to stand and, with a lever, I will move the whole world,” Archimedes once proclaimed. In the data science world, accurately tagged data serves as that lever. Access to such data allows models to be precise and perform predictions on a huge scale.

While Facebook was caught breaching users’ trust, the truth is that the evolution of data science is going to make it even easier to do so in the future. All conversation about regulation should go well beyond this isolated incident and focus on best practices for all tagged data that ensures protection for the privacy of the majority.

Itai Blitzer is head of data at data-driven advertising technology company Matomy.