AI Doesn't Have to Be Trained on Data Scraping: Synthetic Data Explainer

The cost of testing and implementing synthetic data begins at $100,000 and goes into the low millions

Mark your calendar for Mediaweek, October 29-30 in New York City. We’ll unpack the biggest shifts shaping the future of media—from tv to retail media to tech—and how marketers can prep to stay ahead. Register with early-bird rates before sale ends!

Generative artificial intelligence models are only as strong as the data they are trained on.

However, much of the high-quality, human-created data available on the open web needed to train all these models is either copyrighted or tainted by racial biases and misinformation.

AI firms are negotiating million-dollar deals with publishers or resorting to scraping the open web—rankling frustrated publishers, which have filed lawsuits.

AI firms such as Anthropic (for its chatbot Claude), Meta, Google and Microsoft are turning to synthetic data—where AI models interact with real data to produce additional or different data—to counter this.

“If you do it right with just a little bit of additional information, it may be possible to get an infinite data generation engine,” Dario Amodei, Anthropic’s CEO, told Squawk Box.  

By 2030, most of the data used in AI will be artificially generated by rules, statistical models, simulations or other techniques, per a Gartner report. Here’s your primer.

Ok, so what is synthetic data?

When AI systems create artificial data, we’re talking about synthetic data that mimics the statistical characteristics of real data—like customer purchases—without revealing anyone’s identity.

“It doesn’t contain any real-world measurements or observations,” said Jason Snyder, chief technology officer at Momentum Worldwide.

Synthetic data isn’t a novel concept—it’s been around for decades and was used in the 1980s for simulating road conditions to train autonomous vehicles.

And what’s new about this?

Now, gen AI has made synthetic data generation more accessible and user-friendly, democratizing the process and letting people more easily create synthetic datasets.

Synthetic data aims to mimic what’s already out there and create new datasets that can address gaps and avoid bias and privacy concerns. Or, if you’re working with a small dataset to train models, you can generate larger synthetic datasets based on real data to introduce new variations for better model training.

“It focuses on creating new datasets of structured information, like tables, medical records or financial transactions,” said Snyder.

Sounds great! It can prevent undermining publisher business models, right?

Sort of. For synthetic data to exist, models still need access to real data.

This means AI firms are still reliant on publishers’ data to be able to train their models on synthetic data further, said Andrew Frank, Gartner vice president distinguished analyst.

Is there an advocate?

Like all disruptive technology, the growing adoption of synthetic data has sparked some debate.

On one side, advocates see it as a crucial tool in safeguarding privacy amid heightened concerns.

However, some AI experts are raising alarms about potential risks associated with these techniques, warning of model collapse.

“The models that are trained exclusively on synthetic data, they’re not going to perform well on unpredictable real-world scenarios ever,” said Snyder.

Synthetic data can also inadvertently amplify existing biases or introduce new ones if the original datasets contain biases, leading to a virtual cycle of producing data sets rife with troublesome issues.

Sounds nuanced. Who is already using synthetic data?

One of its many use cases includes creating artificial focus groups for brands to simulate conversations with potential customers to inform campaigns, saving time and money.

“Any marketing agency or brand that’s building AI models or training AI models is most likely using or needing to use synthetic data,” said Michael Olaye, ‪senior vp and managing director of strategy and innovation at R/GA.

Financial institute Six is working with synthetic data platform Syntheticus, which uses gen AI to generate high-quality datasets with similar statical properties as Six’s original data. This overcomes compliance concerns by offering substitutes for original data that may contain sensitive information. It also generate new revenue streams by extracting valuable insights for informed business decisions while maintaining data security and privacy standards.

Momentum Worldwide is using synthetic data to create virtual audiences based on a brand’s pseudo-anonymized first-party data to reach missing audience segments, such as minority groups.

By training its gen AI model with a blend of synthetic and real-world data, Momentum Worldwide can develop new narratives and audience personas.

To best remove bias, a cross-departmental team including strategists, DEI (diversity, equity and inclusion) experts, third-party stakeholders such as academics and influencers, and data scientists and analysts generate synthetic data sets based on campaign KPIs (key performance indicators).

So, what’s the catch?

The cost of testing and implementing synthetic data begins at $100,000 and goes into the low millions, and it scales depending on the scope and complexity of the project, said Chirag Shah, professor in the Information School at the University of Washington.

The scientists involved with creating synthetic datasets make conscious calculated decisions by regularly testing and checking the quality of the output, said Olaye.

Is it here to stay?

Given the evolving privacy landscape and increasing regulations around data usage, the adoption of synthetic datasets is becoming more inevitable.

“Everyone going forward at some point will need to use synthetic data just because of the privacy laws that are coming in,” said Olaye.

Enjoying Adweek's Content? Register for More Access!