Twitter Just Released Two Huge Datasets Related to Election Interference Activity on Its Platform

3,841 accounts affiliated with Russia’s Internet Research Agency are covered, along with 770 that likely originated in Iran

Just part of the 1.24 gigabits of tweet information and 296 GB of media that is now available Twitter
Headshot of David Cohen

Twitter has released examples in the past of accounts and content used by the Internet Research Agency, the Russian government-linked organization that used social media posts to influence the 2016 U.S. presidential election. On Wednesday, Twitter released everything.

The social network made two sizable datasets available Wednesday—one regarding the IRA, and the other related to accounts that potentially originated in Iran and were engaging in coordinated behavior aimed at the upcoming midterm elections in the U.S. Those accounts were removed in August.

Twitter legal, policy and trust and safety lead Vijaya Gadde and head of site integrity Yoel Roth revealed in a blog post that the datasets comprise 3,841 IRA-affiliated accounts and 770 that potentially originated in Iran, and combined, they contain more than 10 million tweets, as well as over 2 million images, GIFs, videos and Periscope broadcasts, with some account activity dating back to 2009.

The IRA dataset contains 1.24 gigabits of tweet information and 296 GB of media in 302 archives, while the one attributed to Iran contains 168 megabits of tweet information and 65.7 GB of media in 52 archives. All of this is available for download here.

More content from the datasets released by Twitter
More content from the datasets released by Twitter

Twitter said the datasets include all public, nondeleted tweets and media, and deleted tweets, which are not included, make up less than 1 percent of the overall activity of the accounts that are included.

They added that not all accounts identified by Twitter as being connected to these efforts actively tweeted, so the number of accounts represented in the datasets may be lower than the totals they reported in their blog post.

Twitter said identifying fields such as user ID and screen name were hashed for accounts with fewer than 5,000 followers in order to reduce the potential impact on authentic or compromised accounts, adding that those seeking access to the unhashed versions of the datasets can apply via the form at the bottom of this page.

For people who believe their account or accounts were erroneously included in these datasets, Twitter suggested that they log into their accounts and file an appeal here.

Gadde and Roth wrote that the data is being made available to help encourage “open research and investigation of these behaviors from researchers and academics around the world,” adding that early access was provided to “a small group of researchers with specific expertise in these issues.” David Cohen is editor of Adweek's Social Pro Daily.