Data Journalism: Collecting Data for Your Story

By Lillian Pierson

A data-journalism piece is only as good as the data that supports it. To publish a compelling story, you must find compelling data on which to build. That isn’t always easy, but it’s easier if you know how to use scraping and autofeeds to your advantage.

Scraping data

Web-scraping involves setting up automated programs to scour and extract the exact and custom datasets that you need straight from the Internet so you don’t have to do it yourself. The data you generate from this process is commonly called scraped data. Most data journalists scrape source data for their stories because it’s the most efficient way to get datasets for unique stories. Datasets that are easily accessible have usually already been exploited and mined by teams of data journalists who were looking for stories. To generate unique data sources for your data-driven story, scrape the data yourself.

If you find easy-to-access data, beware that most of the stories in that dataset have probably been told by a journalist who discovered that data before you.

To illustrate how you’d use data scraping in data journalism, imagine the following example: You’re a data journalist living in a U.S. state that directly borders Mexico. You’ve heard rumors that the local library’s selection of Spanish-language children’s books is woefully inadequate. You call the library, but its staff fear negative publicity and won’t share any statistics with you about the topic.

Because the library won’t budge on its data-sharing, you’re forced to scrape the library’s online catalog to get the source data you need to support this story. Your scraping tool is customized to iterate over all possible searches and keep track of the results. After scraping the site, you discover that 25 percent of children’s books at the library are Spanish-language books. Spanish-speakers make up 45 percent of the primary-school population; is this difference significant enough to form the basis of a story? Maybe, maybe not.

To dig a little deeper and possibly discover a reason behind this difference, you decide to scrape the catalog once a week for several weeks, and then compare patterns of borrowing. When you find that a larger proportion of Spanish books are being checked out, this indicates that there is, indeed, a high demand for children’s books in Spanish. This finding, coupled with the results from your previous site scrape, give you all the support you need to craft a compelling article around the issue.

Setting up data alerts

To generate hot stories, data journalists must have access to the freshest, newest data releases that are coming from the most credible organizations. To stay on top of what datasets are being released where, data journalists subscribe to alert systems that send them notifications every time potentially important data is released. These alert systems often issue notifications via RSS feeds or via email. It’s also possible to set up a custom application like DataStringer to send push notifications when significant modifications or updates are made to source databases.

After you subscribe to data alerts and form a solid idea about the data-release schedule, you can begin planning for data releases in advance. For example, if you’re doing data journalism in the business analytics niche and know that a particularly interesting quarterly report is to be released in one week, you can use the time you have before its release to formulate a plan on how you’ll analyze the data when it does become available.

Many times, after you’re alerted to important new data releases, you still need to scrape the source site in order to get that data. In particular, if you’re pulling data from a government department, you’re likely to need to scrape the source site. Although most government organizations in western countries are legally obligated to release data, they aren’t required to release it in a format that’s readily consumable. Don’t expect them to make it easy for you to get the data you need to tell a story about their operations.