Across the websites of local newspapers such as the Miami Herald, The Sacramento Bee and The Kansas City Star, a new embedded narration option now offers to read each article aloud with a cadence, inflection and sometimes even an emotional tone that one would expect from a real human speaker.
But the voice reading the articles on these McClatchy media properties is no human—though sometimes it imitates one convincingly enough to fool focus group participants. Rather, it is a neural network trained on thousands of hours of human speech patterns to yield a text-to-speech tool that represents a turning point for a technology long known for its stilted monotone and awkward pacing and pronunciation.
McClatchy, which partnered with startup Trinity Audio on the tool in April, is not the only newspaper turning to this technology. The largest U.S. newspaper chain, Gannett, and Canadian national newspaper The Globe and Mail have embarked on similar projects through partnerships with Amazon Polly. Trinity Audio has built its audio production and monetization software around Amazon Polly’s AI.
The newspapers are betting they can do more on a bigger, cheaper scale using machine learning. While the notion of a text-to-speech tool might call to mind robotic recitation, deep learning has advanced the technology well past Microsoft Sam in the last couple years, spurred by advances in processing power and compression that make training voice models easier, analysts say.
“Text-to-speech has dramatically progressed in the last two to three years,” said Holger Mueller, a principal analyst and vp at Constellation Research. “The benefits are, of course, more ways of consumption for people and faster, easier and cheaper production.”
Jeff Moriarty, Gannett’s svp of consumer products, said his company has been exploring the possibilities of text-to-speech for years, but the latest AI advancements feel like a major milestone for the technology. “The quality is good enough that people actually have a harder time distinguishing the difference than ever,” Moriarty said.
Other applications of AI text-to-speech
- Google has launched a tool called Duplex, an AI voice that calls restaurants and hair salons to book reservations on behalf of users. The tool sounds so realistic that it initially faced controversy for not disclosing upfront that service staff were speaking with AI.
- Jay-Z’s record label Roc Nation recently served a YouTube channel with a cease-and-desist order for using neural networks to deepfake the rapper’s voice. The channel, Vocal Synthesis, specializes in using AI to realistically imitate the voices of celebrities.
- Research group OpenAI released a tool called Jukebox that is able to generate songs complete with lyrics and backing instrumentals in any of dozens of genres instantaneously.
That was not the case when Greg Doufas, chief technology and digital officer at The Globe and Mail, first looked into the possibility of text-to-speech tech a few years ago when Amazon’s virtual assistant Alexa was beginning to take off. “[The technology] wasn’t there yet. It wasn’t something we could put our brand behind,” Doufas said.
Doufas was initially skeptical when Amazon first pitched the paper’s executives on the tech giant’s AI narration about a year ago—until they heard it for themselves.
The newspaper has since embedded an audio player in most of its online articles that can read the story in English, French or Mandarin in a male or female voice. The tool also lets readers organize playlists of articles they want to hear from a car radio or headphones while on the go or through a smart speaker while doing chores at home.