10 Public Datasets and Where to Find Them

By Jason Williamson

Public datasets are very large datasets that are freely available for you to either download or connect to via the cloud. There are several well-curated websites with the latest information on public datasets and how to use them, including the following:

There are all kinds of data sets to sort through, from genome projects to weblogs to emails from notorious corporations. Here are ten public datasets and where you can go to get started:

  • 1000 Genome Project (200TB): The 1000 Genomes Project is sponsored by Amazon and the National Center for Biotechnology Information. This dataset contains datasets from over 2,600 people from 26 different populations from around the world.

  • Complete Genomes Public Data (50TB): This is sequenced genome data from Complete Genomics, a company that provides genome sequencing services.

  • Earth Observing-1 Mission (80.5TB): NASA has opened up the bird’s-eye view of the Earth. This is data gathered by the Advanced Land Imager (ALI). This data is used to better understand how Earth events like volcanoes, wildfires, and floods evolve over time and affect our planet.

  • Common Crawl Corpus (541TB): Have you ever wanted to get your hands on crawl data for billions of web pages with trillions of links? Here’s your chance. The Common Crawl Corpus provides a rich set of tools, examples, and projects you can jump into today.

  • Marvel Universe Social Graph (1GB): This is a fun look at the social connectedness of the Marvel world of characters. The founders claim that analysis of this social world is remarkably close to our own.

  • Enron Emails (210GB): These emails — all 1.2 million with almost 500,000 attachments — were released as a part of Federal Energy Regulatory Commission’s investigation into the infamous firm.

  • Million Song Sample Dataset (500GB): Are you looking for datasets on a million popular songs? Look no further. The Million Song Dataset contains some audio features and metadata for a million popular songs.

  • Project Gutenberg (742GB): Project Gutenberg makes over 46,000 books available for analysis. These books are now on the public domain because their copyrights have expired.

  • U.S. Census Datasets (1.8TB): Every ten years, the United States must take a census. The main purpose of this is to ensure proper allocation of congressional seats.

  • NOAA National Climatic Data Center (3.3 TB): Don’t believe in global warming or climate change? Validate it (or invalidate it) yourself. This dataset contains data on over 150 years of weather from many sources ranging from weather stations to airport readings to satellite data.

    You can look at things like dew points, wind speed, and temperature. It may be interesting to look for correlations between this dataset and the Million Song Sample. Is there a link between weather and hit records? Sounds like a great big data question for someone to answer. . . .