If Twitter is sitting on a goldmine of data, someone needs to create a better method of digging it up. The Library of Congress was given access to the entire firehose of tweets in 2010, but researchers at the institution are still scratching their heads over how to organize and display the collection, which currently totals 170 billion tweets dating back to 2006.
Deputy librarian of congress Robert Dizard Jr. told the Washington Post:
“People expect fully indexed — if not online searchable — databases, and that’s very difficult to apply to massive digital databases in real time. The technology for archival access has to catch up with the technology that has allowed for content creation and distribution on a massive scale. Twitter is focused on creating and distributing content; that’s the model. Our focus is on collecting that data, archiving it, stabilizing it and providing access; a very different model.”
Currently, the library uses the data company Gnip, based in Colorado, to manage the transfer of tweets to the archive. This does nothing to filter or make searchable the data that the tweets contain.
Gnip is one of three companies, including Datasift and Topsy, that have access to the firehose for commercial use. Twitter has stipulated that the public must view the library’s collection in person so as not to compete with these vendors. And for privacy reasons, the library has chosen not to archive deleted tweets.
Despite its limitations, a publicly available archive of tweets is a valuable gift — and a tremendous public service — as the social network grows in cultural significance.
Public figures including President Barack Obama and Pope Benedict XVI have used the microblogging site to make announcements, share pictures, and engage with the public. Journalists and citizens alike have used the forum to quickly and openly share information, as with the Arab Spring activists who drew attention to civil unrest in Egypt. The vast stream of tweets can be analyzed for cultural trends and public sentiment surrounding historical events. More recently, Nielsen started pairing Twitter data with its current system for rating television programs.
But without a third party to organize the data, the cache of tiny historical records may never see the light of day.
Image by Verticalarray.