10 Phenomenal Resources for Open Data

By Lillian Pierson

Open data is part of a larger trend toward a less restrictive, more open understanding of the idea of intellectual property, a trend that’s been gaining tremendous popularity over the past decade. Open data is data that has been made publicly available and is permitted to be used, reused, built on, and shared with others. Open data is part of the open movement.

Beyond open data, this aptly named open movement also includes open-source software, open hardware, open-content creative work, open access to scientific journals, and open science — all committed to the notion that content (including raw data from experiments) should be shared freely.

The distinguishing feature of open licenses is that they have copyleft instead of copyright. With copyleft, the only restriction is that the source of the work must be identified, sometimes with the caveat that derivative works can’t be copyrighted with a more restrictive license than the original. If the second condition is in force, successfully commercializing the work itself becomes difficult, although people often find plenty of other indirect, creative avenues of commercialization.

Be aware that sometimes work that’s labeled as open may not fit the accepted definition. You’re responsible to check the licensing rights and restrictions of the open data you use.

People often confuse open licenses with Creative Commons licenses. Creative Commons is a not-for-profit organization that’s dedicated to encouraging and spreading creative works by offering a legal framework through which usage permissions can be granted and obtained, so that sharing parties are safe from legal risks when building on and using work and knowledge that’s been openly shared. Some Creative Commons licenses are open, and some explicitly forbid derivative works and/or commercialization.

As part of more recent open government initiatives, governments around the world began releasing open government data. Governments generally provide this data so that it can be used by volunteer analysts and civic hackers — programmers who work collaboratively to build open-source solutions that use open data to solve social problems — in an effort to benefit society at large. In 2013, the G8 nations (France, the United States, the United Kingdom, Russia, Germany, Japan, Italy, and Canada) signed a charter committing themselves to open data, prioritizing the areas of national statistics, election results, government budgets, and national maps.

The open government movement promotes government transparency and accountability, nurtures a well-informed electorate, and encourages public engagement. To put it in computing terms, open government facilitates a read/write relationship between a government and its citizenry.

Digging through data.gov

The Data.gov program was started by the Obama administration to provide open access to nonclassified U.S. government data. Data.gov data is being produced by all departments in the executive branch — the White House and all Cabinet-level departments — as well as datasets from other levels of government. By mid-2014, you could search for over 100,000 datasets by using the Data.gov search. The website is an unparalleled resource if you’re looking for U.S. government-derived data on the following indicators:

  • Economic
  • Environmental
  • STEM industry
  • Quality of life
  • Legal

Checking Out Canada Open Data

If you take a look at the Canada Open Data website, the nation’s strong commitment to data is overwhelmingly evident. At the Canada Open Data website, you can find over 200,000 datasets. Among the 25 most popular offerings on the Canada Open Data site are datasets that cover the following indicators:

  • Environmental
  • Citizenship
  • Quality of life

Diving into data.gov.uk

The United Kingdom got off to a late start in the open government movement. Data.gov.uk was started in 2010, and by mid-2014, only about 20,000 datasets were yet available. Like Data.gov, data.gov.uk is also powered by the CKAN data catalog.

Although data.gov.uk is still playing catch-up, it has an impressive collection of Ordnance Survey maps old enough — 50 years or more — to be out of copyright. If you’re looking for world-renowned, free-to-use survey maps, data.gov.uk is an incredible place for you to explore. Beyond its stellar survey maps, data.gov.uk is a useful source for data on the following indicators:

  • Environmental (data.gov.uk’s most prolific theme)
  • Government spending
  • Societal
  • Health
  • Education
  • Business and economic

Checking Out U.S. Census Bureau Data

The U.S. Census is held every ten years, and since 2010, the data has been made freely available. Statistics are available down to the level of the census block — which aggregates by 30-person counts, on average. The demographics data provided by the U.S. Census Bureau can be extremely helpful if you’re doing marketing or advertising research and need to target your audience according to the following classifications:

  • Age
  • Average annual income
  • Household size
  • Gender or race
  • Level of education

Knowing NASA Data

Since its inception in 1958, NASA has made public all its nonclassified project data. It has been in the open-data game so long that NASA has tons of data! NASA datasets have been growing even faster with recent improvements in satellite and communication technology. In fact, NASA now generates 4 terabytes of new earth-science data per day — that’s equivalent to over a million MP3 files. Many of NASA’s projects have accumulated data into the petabyte range.

NASA’s open data portal is called data. This portal is a source of all kinds of wonderful data, including data about

  • Astronomy and space (of course!)
  • Climate
  • Life sciences
  • Geology
  • Engineering

Wrangling World Bank Data

The World Bank is an international financial institution run by the United Nations. It provides loans to developing countries to pay for capital investment that will lead (one hopes) to poverty reduction and some surplus so that the recipient nations can repay the loan amounts over time. Because World Bank officers need to make well-informed decisions about which countries would be more likely to repay their loans, they’ve gathered an enormous amount of data on member nations. They’ve made this data available to the public at the World Bank Open Data page.

If you’re looking for data to buttress your argument in a truly interesting data-journalism piece that’s supported by global statistics, the World Bank should be your go-to source. No matter the scope of your project, if you need data about what’s happening in developing nations, the World Bank is the place to go. You can use the website to download entire datasets or simply view the data visualizations online. You can also use the World Bank’s Open Data API to access what you need.

World Bank Open Data supplies data on the following indicators (and many, many more):

  • Agriculture and rural development
  • Economy and growth
  • Environment
  • Science and technology
  • Financial sector
  • Poverty income

World Bank Data also includes microdata — sample surveys of households and businesses in developing countries. You can use microdata to explore variations in your datasets.

Getting to Know Knoema Data

Knoema (pronounced “no-mah”) purports to be the largest repository of public data on the web. The Knoema platform houses a staggering 500+ databases, in addition to its 150 million time series — 150 million collections of data on attribute values over time, in other words. Knoema includes, but isn’t limited to, all these data sources:

  • Government data from industrial nations
  • National public data from developing nations
  • United Nations data
  • International organization data
  • Corporate data from global corporations

Knoema is an outstanding resource if you’re looking for international data on agriculture, crime statistics, demographics, economy, education, energy, environment, food security, foreign trade, health, land use, national defense, poverty, research and development, telecommunications, tourism, transportation, or water.

In addition to being an incredible data source, Knoema is a multifaceted tasking platform. You can use the Knoema platform to make dashboards that automatically track all your favorite datasets. You can use the platform’s data visualization tools to quickly and easily see your data in a tabular or map format. You can use the Knoema Data Atlas to drill down among categories and/or geographic regions and quickly access the specific datasets you need. As an individual, you can upload your own data and use Knoema as a free hosting service. Above and beyond all of this, Knoema even offers the Knoema Market — a place where you can go to get paid just for being part of data-driven projects.

Although a lot of Knoema’s data is pretty general, you can still find some surprisingly specific data as well. If you’re having a hard time locating data on a specific topic, you might have luck finding it on the Knoema platform. The figure illustrates just how specific Knoema data can be.

data-science-insects
The index of insect records in Knoema’s search.

Queuing Up with Quandl Data

Quandl is a Toronto-based website that aims to be a search engine for numeric data. Unlike most search engines, however, its database isn’t automatically generated by spiders that crawl the web. Rather, it focuses on linked data that’s updated via crowdsourcing — updated manually via human curators, in other words.

Because most financial data is in numeric format, Quandl is an excellent tool for staying up-to-date on the latest business informatics. As you can see, a search for Apple returns over 4,700 datasets from 11 different sources with time series at the daily, weekly, monthly, quarterly, or annual level. Many of these results are related to the United Nations’ agricultural data. If you’re looking for data on Apple Computers, you can narrow the scope of your search by replacing the Apple search term with the company’s stock abbreviation, AAPL.

data-science-apple
The index of Apple records in Quandl search.

The Quandl database includes links to over 10 million datasets (although it uses a generous metric in declaring what distinguishes one dataset from another). Quandl links to 2.1 million UN datasets and many other sources, including datasets in the Open Financial Data Project, the central banks, real estate organizations, and well-known think tanks.

Exploring Exversion Data

Modeled after GitHub — the cloud-hosted platform across which programmers can collaboratively share and review code — Exversion aims to provide the same collaborative functionality around data that GitHub provides around code. The Exversion platform offers version control functionality and hosting services to which you can upload and share your data. To illustrate how Exversion works, imagine a platform that would allow you to first fork (or copy) a dataset and then make the changes you want. Exversion would be there to keep track of what has changed from the original set and every change that you make to it. Exversion also allows users to rate, review, and comment on datasets.

Datasets hosted on the Exversion platform are either provided by a user or created by a spider that crawls and indexes open data to make it searchable from a single application programming interface (API). Like GitHub, with a free user account, all the data you upload to Exversion is public. If you’re willing to pay for an account, you can create your own, private data repositories. Also, with the paid account, you get the option to share your data with selected users for collaborative projects.

Mapping OpenStreetMap Spatial Data

OpenStreetMap (OSM) is an open, crowd-sourced alternative to commercial mapping products such as Google Maps and ESRI ArcGIS Online. In OSM, users create, upload, or digitize geographic data into the central repository.

The OSM platform is quite robust. Governments and private companies have started contributing to, and pulling from, the shared datasets. Even corporations as big as Apple are relying on OSM data. OSM now has over 1 million registered users. To illustrate how a person can create data in OSM, imagine that someone links the GPS systems on her mobile phone to the OSM application. Because of this authorization, OSM can automatically trace the routes of roads while the person travels. Later, this person (or another OSM user) can go into the OSM online platform to verify and label the routes.

The data in OSM isn’t stored as maps, but as geometric and text representations — points, lines, polygons, and map annotation — so all of OSM’s data can be quickly downloaded from the website and easily assembled into a cartographic representation via a desktop application.