You will run across situations where your data driven marketing distribution seems to go on forever. Income data is like this. The vast majority of households have incomes in a fairly narrow range. The percentage of households making less than a million dollars accounts for almost everyone. No matter how high you go, \$10 million, \$100 million, even \$500 million, you will still not have accounted for every single household.

This situation is commonly called a long-tailed distribution. These distributions make averages extremely misleading. The reason is that data way out yonder in the distribution contributes a lot more to the average than data at the bottom.

A simple calculation will illustrate this point. Suppose you have 100 people making \$50K and 1 person making \$10 million. This gives a total of \$15 million. That comes out to an average income of just over \$148,500. This is three times what anyone really makes. And this misrepresentation is being caused by one data point.

A long-tailed distribution is one instance where ignoring data is a good idea. When performing analysis on these types of distributions, it’s all right to throw out the extreme data points, called outliers. If you don’t want to throw them out completely, then at least cap them at some reasonable level so they don’t muddy up the works.

You’ll find that very wide distributions arise frequently in looking at behavioral data. Looking at some annual pass data for an entertainment company, some people used the pass only once. The vast majority used it less than ten times but there were passes that were used well over 200 times.

In situations like that, it’s impossible to graph the whole distribution in a meaningful way. If you group the data into wide ranges, you don’t see the meaningful variation at the bottom.

The better alternative is to cap the distribution at some fairly early value and create a create a bar for “everything else”.

Now you can see that there is actually a bi-modal distribution at the lower end. Lots of customers use their pass only once, and there’s another spike centered around five uses.

The bubble on the right isn’t really a bubble. By continuing to graph out the entire distribution it would go on for several pages and no page would have more than a handful of customers represented on it. But this not-really-a-bubble does give you a sense of how many customers are using their passes a lot.

This distribution suggests that, if you were this entertainment company, you’d have two different marketing opportunities. First, you’d want to get the single-use customers to come back. You’d need to figure out why these folks aren’t returning and try to overcome those barriers. Second, you’d want to maximize your revenues from the second group.

You might do this by communicating special events or keeping them informed of what’s new. The high-use customers probably don’t need a lot of additional database marketing attention.

In this example, you’ve identified three distinct groups of customers. And you’ve done that by looking at only one variable. Now you can dig deeper into the data about each group separately and develop marketing campaigns to address each one.

Understanding the way data varies among your customers or over time helps you to identify marketing opportunities. It allows you to group your customers together in meaningful ways.