Build on Basic Scatterplots

By Meta S. Brown

Data miners often take advantage of special features to pack more information into simple charts. Labels, overlays, and interactive selection are hallmarks of data-mining applications, special features that allow you to be more productive.

Mileage decreases as horsepower increases, as seen in the following figure.


Mileage increases with time, as you can see, a scatterplot of mileage versus model year. It would be helpful to get these two ideas into one graph.


Common data-mining approaches for integrating more than two variables in a graph include

  • Labels: Labels are values of a string or categorical variable that have been superimposed on the scatterplot. The following figure shows a scatterplot labeled with the model year of the car.


    Datasets with many points or long labels can make these charts unreadable, though! The solution is to use only a sample of the data. Setup for this kind of sampling is shown in the following figure.


  • Overlays: With overlays, values of a categorical variable define the points’ shape or color. The following figure shows the setup for a scatterplot to overlay model year on the mileage-versus-horsepower scatterplot.


    The exported overlay scatterplot appears in the following image. It may be easier to read color overlays than point shape overlays. The setup is usually much the same.


Another thing to keep in mind with scatterplots: You may have multiple points falling on the very same spot! If so, you may not be able to tell a point for one case from a point for 100 cases. The remedy is to check for an option to make multiple instances visible. Look for point size or jitter (moves points slightly off their true locations to make all of them visible) options.

Interactive scatterplots are great time-savers for data miners.

Say that you see an interesting group of cases in a graph, and you want to further investigate just those cases. If you’re looking at just one or two points, you might get the information you want by hovering, but that’s not satisfactory when you are interested in more than a couple of points.

Data selection tools in interactive scatterplots give you more power to select data. The following figure shows the same graph setup, but with a group of points selected by clicking and dragging the mouse around them. This is not just a visual feature.


You can export the selected points as a new dataset. This is very handy and fast!


If the points you need don’t fit nicely into a rectangular selection, you have other options. Refer to the Zoom/Select area. You can see a button with a rectangle for rectangular selection and another with a roundish shape for free-form selection.

Here’s a free-form selection example using data on the nicotine content of cigarettes sold in different parts of the world. This scatterplot shows nicotine per cigarette for samples from the six United Nations regions. (This is a nontraditional use of a scatterplot, because region is not a continuous variable; it’s categorical. Data miners often use traditional tools in nontraditional ways.)


The points within a region don’t fall in a perfect vertical line. Small shifts (jitter) to the left and right are made for readability and appearance only. A few cigarettes have exceptionally high levels of nicotine, and you want to select those cases.

A drop-down menu offers selection options. Polygon selection lets you mark a free-form area on the scatterplot.


To mark, click on the graph to make a starting point, and then click again and again around the group of points you want until you have made the shape you need.


A right-click indicates that you have completed the selection; this is visible from the highlight on the graph.