Publications Are Using Realistic Voice AI to Narrate Content

Across the websites of local newspapers such as the Miami Herald, The Sacramento Bee and The Kansas City Star, a new embedded narration option now offers to read each article aloud with a cadence, inflection and sometimes even an emotional tone that one would expect from a real human speaker. But the voice reading the articles on these McClatchy media properties is no human—though sometimes it imitates one convincingly enough to fool focus group participants. Rather, it is a neural network trained on thousands of hours of human speech patterns to yield a text-to-speech tool that represents a turning point for a technology long known for its stilted monotone and awkward pacing and pronunciation. McClatchy, which partnered with startup Trinity Audio on the tool in April, is not the only newspaper turning to this technology. The largest U.S. newspaper chain, Gannett, and Canadian national newspaper The Globe and Mail have embarked on similar projects through partnerships with Amazon Polly. Trinity Audio has built its audio production and monetization software around Amazon Polly’s AI. The newspapers are betting they can do more on a bigger, cheaper scale using machine learning. While the notion of a text-to-speech tool might call to mind robotic recitation, deep learning has advanced the technology well past Microsoft Sam in the last couple years, spurred by advances in processing power and compression that make training voice models easier, analysts say. "Text-to-speech has dramatically progressed in the last two to three years," said Holger Mueller, a principal analyst and vp at Constellation Research. "The benefits are, of course, more ways of consumption for people and faster, easier and cheaper production." Jeff Moriarty, Gannett's svp of consumer products, said his company has been exploring the possibilities of text-to-speech for years, but the latest AI advancements feel like a major milestone for the technology. "The quality is good enough that people actually have a harder time distinguishing the difference than ever," Moriarty said. Other applications of AI text-to-speech Google has launched a tool called Duplex, an AI voice that calls restaurants and hair salons to book reservations on behalf of users. The tool sounds so realistic that it initially faced controversy for not disclosing upfront that service staff were speaking with AI. Jay-Z's record label Roc Nation recently served a YouTube channel with a cease-and-desist order for using neural networks to deepfake the rapper's voice. The channel, Vocal Synthesis, specializes in using AI to realistically imitate the voices of celebrities. Research group OpenAI released a tool called Jukebox that is able to generate songs complete with lyrics and backing instrumentals in any of dozens of genres instantaneously. That was not the case when Greg Doufas, chief technology and digital officer at The Globe and Mail, first looked into the possibility of text-to-speech tech a few years ago when Amazon's virtual assistant Alexa was beginning to take off. "[The technology] wasn't there yet. It wasn't something we could put our brand behind," Doufas said. Doufas was initially skeptical when Amazon first pitched the paper's executives on the tech giant's AI narration about a year ago—until they heard it for themselves. The newspaper has since embedded an audio player in most of its online articles that can read the story in English, French or Mandarin in a male or female voice. The tool also lets readers organize playlists of articles they want to hear from a car radio or headphones while on the go or through a smart speaker while doing chores at home. Doufas said the paper recently began to push the software more aggressively and expand it to new formats, such as newsletters, after seeing a 20% average lift in pageviews and 70% more time spent on articles with the player in the first year of implementation. McClatchy has also expanded to a companywide implementation after a pilot test at The Sacramento Bee and Raleigh News & Observer found a 168% increase in time spent on the news site, an 89% boost in story pageviews and a 95% increase in visits per user. Now, the company is looking to parlay that engagement into programmatically placed midroll ads. McClatchy senior director of audio and video Jon Forsythe said the company experimented with 15- and 30-second mid- and preroll ads and saw little effect on how people used the tool. Aside from McClatchy, Trinity Audio claims to have about 500 publisher clients signed on to use its Amazon Polly-based text-to-speech and monetization software, mostly in the U.S. but also across Latin America and Europe. Trinity Audio co-founder and CEO Ron Jaworski said a particular focus for the company is multicultural sites that need to serve up translations of content to multilingual audiences. Indeed, McClatchy saw some of its highest engagement rates with the tool from Spanish-language properties and in Spanish-speaking markets. But while text-to-speech tech has come a long way, there are still areas where it needs to improve in order to sound more realistic. "Text-to-speech will become indistinguishable from any given human reading text," said Mike Gualtieri, a Forrester analyst who specializes in AI. "There are a huge number of factors that make [speech] human—pauses, intonation, volume and many other properties of sound. Another key factor is the subtext."

AI News

News Outlets Are Using Realistic AI to Convey Emotions and Narrate News

The technology comes as media companies double down on audio strategies

Patrick Kulp

About

Subscriptions

Events

Publications

WORK SMARTER - LEARN, GROW AND BE INSPIRED.

Subscribe today!

Patrick Kulp