Find the Outliers in Your Infographics Data
In analyzing data for your infographics, you should be aware that some data points — known as outliers — lay so far outside the norm as to call attention to themselves. In the most severe cases, they can even skew data and create a misleading picture of the subject. You need to recognize when you have an outlier and then decide what to do about it.
This table contains a simple example to demonstrate this idea. The two datasets represent a student’s grades, for eight weeks, on two weekly exams; the numbers are the percent correct on the exam. The dataset on the left (the first exam) doesn’t contain an outlier, but the dataset on the right (the second exam) does. The one outlier is shown in bold.
|Week||Grades (no outlier)||Grades (one outlier)|
The average in the middle column paints quite an accurate picture of that student’s achievement in regular testing. The single (bold) outlier (50%) in the dataset on the right throws a wrench into the works, though, dropping the student’s average by four percentage points and skewing the data.
What does a data journalist do in such a case? Here are a few options:
Throw out the outlier. If you’re using only the average in your graphic and are concerned that it’s misleading, eliminate the outlier as an aberration and then calculate the average without that week, as shown in the figure.
In this example, throwing out the outlier would mean this student’s average test score jumps up to 87%, which (as the first column shows) is a better representation of achievement over the term.
If you go with this option, be sure to add a footnote explaining everything: in this case, the deletion of a data point. Always be as transparent as possible.
Show the data as-is. Whether you’re using just the average in your graphic or plotting all the data in a chart, you can always present the data exactly as it came to you, as shown in the following figure.
In this case, you should add a footnote calling out the outlier so that your reader is fully aware of it.
Construct a “line of best fit.” This option applies only if you’re going to create a chart showing all the data. A line of best fit — also called a linear regression — is a visual average of your data: literally the line that represents your scattered data points best.