Scraping, Collecting, and Handling Data Science Tools

By Lillian Pierson

Whether you need data to support a business analysis or an upcoming journalism piece, web-scraping can help you track down interesting and unique data sources. In web-scraping you set up automated programs and then let them scour the web for the data you need. Here are szome free tools that you can use to scrape data or images, including import.io, ImageQuilts, and DataWrangler.

Scraping data with import.io

Have you ever tried to copy and paste a table from the web into a Microsoft Office document and then not been able to get the columns to line up correctly? Frustrating, right? This is exactly the pain point that import.io was designed to address.

import.io — pronounced “import-eye-oh” — is a free desktop application that you can use to painlessly copy, paste, clean, and format any part of a web page with only a few clicks of the mouse. You can even use import.io to automatically crawl and extract data from multi-page lists.

Using import.io, you can scrape data from a simple or complicated series of web pages:

  • Simple: Access the web pages through simple hyperlinks that appear on Page 1, Page 2, Page 3.

  • Complicated: Fill in a form or choose from a drop-down list, then submit your scraping request to the tool.

import.io’s most impressive feature is its capability to observe your mouse clicks to learn what you want, and then offer you ways that it can automatically complete your tasks for you. Although import.io learns and suggests tasks, it doesn’t take action on those tasks until after you’ve marked the suggestion as correct. Consequently, these human-augmented interactions lower the risk that the machine will draw an incorrect conclusion due to over-guessing.

Collecting images with ImageQuilts

ImageQuilts is a Chrome extension developed in part by the legendary Edward Tufte, one of the first great pioneers in data visualization — he popularized the use of the data-to-ink ratio to judge the effectiveness of charts.

The task ImageQuilts performs is deceptively simple to describe but very complex to implement. ImageQuilts makes collages of tens of images and pieces them all together into one “quilt” that’s comprised of multiple rows of equal height. This task can be complex because the source images are almost never the same height. ImageQuilts scrapes and resizes the images before stitching them together into one output image.

The image quilt shown was derived from a “Labeled for Reuse” Google Images search of the term data science.

image0.jpg

ImageQuilts even allows you to choose the order of images or to randomize them. You can use the tool to drag and drop any image to any place, remove an image, zoom all images at the same time, or zoom each image individually.

You can even use the tool to covert between image colors — from color to grayscale or inverted color (which is handy for making contact sheets of negatives, if you’re one of those rare people who still processes analog ­photography).

Wrangling data with DataWrangler

DataWrangler is an online tool that’s supported by the University of Washington Interactive Data Lab (at the time DataWrangler was developed, this group was called the Stanford Visualization Group). This same group developed Lyra, an interactive data visualization environment that you can use to create complex visualizations without programming experience.

If your goal is to sculpt your dataset — or clean things up by moving things around like a sculptor would (split this part in two, slice off that bit and move it over there, push this down so that everything below it gets shifted to the right, and so on) — DataWrangler is the tool for you.

You can do manipulations with DataWrangler similar to what you can do in Excel using Visual Basic. For example, you can use DataWrangler or Excel with Visual Basic to copy, paste, and format information from lists on the Internet.

DataWrangler even suggests actions based on your dataset and can repeat complex actions across entire datasets — actions such as eliminating skipped rows, splitting data from one column into two, or turning a header into column data. DataWrangler can also show you where your dataset is missing data.

Missing data can indicate a formatting error that needs to be cleaned up.