Half of Top News Sites Blocked OpenAI's Crawlers in 2023

Reuters Institute study looked at print, digital-native publishers and broadcasters across 10 countries

Mark your calendar for Mediaweek, October 29-30 in New York City. We’ll unpack the biggest shifts shaping the future of media—from tv to retail media to tech—and how marketers can prep to stay ahead. Register with early-bird rates before sale ends!

At the end of 2023, nearly one-half (48%) of the top news websites, based on reach, across 10 countries blocked OpenAI‘s crawlers, while nearly one-quarter (24%) blocked Google’s AI crawler, according to a study by Reuters Institute.

Reuters Institute analyzed the robots.txt of the 15 online news sources with the widest reach, including titles like The New York Times, BuzzFeed News, The Wall Street Journal, The Washington Post, CNN and NPR, across countries including Germany, India, Spain, the U.K. and the U.S.

In the absence of clear regulatory frameworks governing generative artificial intelligence‘s use of copyrighted material, many large publishers have taken matters into their own hands, taking AI firms to court, updating terms of service, blocking crawlers or making deals to protect premium content, data and revenues.

The study grouped outlets into three categories: legacy print publications, television and radio broadcasters and digital-born outlets.

Over one-half (57%) of the websites of legacy print publications, such as The New York Times, blocked OpenAI’s crawlers by the end of 2023, compared with 48% of television and radio broadcasters and 31% of digital-born outlets.

Similarly, 32% of print outlets blocked Google’s crawlers, while 19% of broadcasters and 17% of digital-born outlets did the same.

“The Reuters study highlights a fundamental challenge for generative AI: its dependence on authentic content generated by real people who see it as a threat to their livelihoods,” said Gartner VP distinguished analyst Andrew Frank.

Meanwhile, a recent study by Cornell University found that when new AI models are trained on data derived from prior models rather than human input, they tend to ‘model collapse’ or degenerate, leading to increased errors and misinformation in the generated output.

“This suggests that large language model developers need to find ways to compensate people who create or report true content, not just for the sake of society, but also for their own commercial interests,” said Frank.

Website crawlers are deployed for many reasons. Crawlers like Google’s Googlebot index publisher websites in the tech giant’s search results. Meanwhile, OpenAI’s crawler, GPTBot, collects data across the internet to train its large language models such as ChatGPT. This lets AI tools generate accurate, contemporaneous data—a capability that news publishers especially are uniquely positioned to provide: LLMs overweigh premium publishers’ content by a factor of between 5 and 100. AI-powered solutions are emerging as alternatives to traditional search engines.

No reversal once blocked

News outlets in the Global North were more inclined to block AI crawlers compared with those in the Global South, per the study.

In the U.S., 79% of the top online news websites blocked OpenAI, while only 20% did so in Mexico and Poland. Meanwhile, 60% of the news sites in Germany blocked Google’s crawlers, while 7% did so in Poland and Spain.

Almost every website that blocked Google AI also blocked OpenAI (97%). Although the study does not provide a definitive explanation for this trend, it suggests that OpenAI’s release of its crawler before Google’s may have contributed to it.

Meanwhile, in most countries, some publishers blocked both sets of crawlers immediately upon their release. OpenAI launched its AI crawlers in early August last year, followed by Google in September. Once the decision to block was made, no website reversed its stance by unblocking either an OpenAI or Google AI crawler, per the study.