Data centers, APIs and what they mean to journalism

For journalists, creating a database means sifting through tons of raw, often unorganized data, presenting it in an indexable way and sometimes finding the stories buried deep in the data. This is part of the long tradition of journalism: synthesizing information before it is presented to the public. The latest trend of posting raw data to the web means the public can examine news and statistics without filter and find their own stories without having a group of journalists figuring it out for them.

The online presentation of raw data has taken many forms. Mainstream news organizations like the New York Times, the Guardian and Advertising Age have created online data centers where large collections of numbers and statistics are available to the public to peruse at their leisure or, better yet, to mashup into their own databases and visualizations.

This is, of course, a part of a larger trend on the web of making data available for anyone who wants to view or use it., a project of the US government, houses data on everything from tax information to natural disaster statistics and makes the information available in a various digital formats including CSV and XML. The recently announced DataSF, a collection of data published by the city and county of San Francisco, California, has more than 100 datasets available for public use — everything from bridge locations and bodies of water to crime statistics and public works projects.

Posting raw data has its advantages over traditional journalism in that it gets the public involved and uncovers stories that even a team of journalists could not discover themselves. Earlier this year, the Guardian posted more than 450,000 pages of data on UK government officials’ expenses and asked the public for help in finding interesting tidbits or information. Based on the public’s findings, the staff created a series of stories that delineated outlandish expenses like £2000 to dredge a moat at a private estate.

The datasets presented by news organizations are often publicly available numbers and statistics that are can be found on- or offline. The difference is the data has been cleaned up and made available in a digital format that takes less time to sift through and understand. Datasets aren’t limited to third party information either: NPR recently made more than 80,000 of its transcripts available via its recently announced Transcript API. The API allows developers to mashup the transcripts in ways that are yet to be seen.

But is posting raw data journalism? Where is the editing, the reporting, and all the values that are the bedrock of newsrooms everywhere? The core of a journalist’s job is to spread the news and to inform the public. While posting raw data may not involve some of the traditional values of journalism, it is still sharing the news and telling the story. Even better, this system for sharing content lets the public decide for themselves what is news without the filter of a news outlet to decide for them. This process encapsulates the core values of online journalism: collaboration, openness and stepping outside of traditional means of delivering the news.