Basics of Data Clusters in Predictive Analysis - dummies

Basics of Data Clusters in Predictive Analysis

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

A dataset (or data collection) is a set of items in predictive analysis. For instance, a set of documents is a dataset where the data items are documents. A set of social network users’ information (name, age, list of friends, photos, and so on) is a dataset where the data items are profiles of social network users.

Data clustering is the task of dividing a dataset into subsets of similar items. Items can also be referred to as instances, observation, entities or data objects. In most cases, a dataset is represented in table format — a data matrix. A data matrix is a table of numbers, documents, or expressions, represented in rows and columns as follows:

  • Each row corresponds to a given item in the dataset.

    Rows are sometimes referred to as items, objects, instances, or observations.

  • Each column represents a particular characteristic of an item.

    Columns are referred to as features or attributes.

Applying data clustering to a dataset generates groups of similar data items. These groups are called clusters — collections of similar data items.

Similar items have a strong, measurable relationship among them — fresh vegetables, for example, are more similar to each other than they are to frozen foods — and clustering techniques use that relationship to group the items.

The strength of a relationship between two or more items can be quantified as a similarity measure: A mathematical function computes the correlation between two data items. The results of that computation, called similarity values, essentially compare a particular data item to all other items in the dataset. Those other items will be either more similar or less similar in comparison to that specific item.

Computed similarities play a major role in assigning items to groups (clusters). Each group has an item that best represents it; this item is referred to as a cluster representative.

Consider a dataset that consists of several types of fruits in a basket. The basket has fruits of different types such as apples, bananas, lemons, and pears. In this case, fruits are the data items. The data clustering process extracts groups of similar fruits out of this dataset (basket of different fruits).


The first step in a data clustering process is to translate this dataset into a data matrix: One way to model this dataset is to have the rows represent the items in the dataset (fruits); and the columns represent characteristics, or features, that describe the items.

For instance, a fruit feature can be the fruit type (such as a banana or apple), weight, color, or price. In this example dataset, the items have three features: fruit type, color, and weight.

In most cases, applying a data clustering technique to the fruit dataset as described above allows you to

  • Retrieve groups (clusters) of similar items. You can tell that your fruit is of N number of groups. After that, if you pick a random fruit, you will be able to make a statement about that item as being part of one of the N groups.

  • Retrieve cluster representatives of each group. In this example, a cluster representative would be picking one fruit type from the basket and putting it aside. The characteristics of this fruit are such that that fruit best represents the cluster it belongs to.

When you’re done clustering, your dataset is organized and divided into natural groupings.

Data clustering reveals structure in the data by extracting natural groupings from a dataset. Therefore discovering clusters is an essential step toward formulating ideas and hypotheses about the structure of your data and deriving insights to better understand it.

Data clustering can also be a way to model data: It represents a larger body of data by clusters or cluster representatives.

In addition, your analysis may seek simply to partition the data into groups of similar items — as when market segmentation partitions target-market data into groups such as

  • Consumers who share the same interests (such as Mediterranean cooking)

  • Consumers who have common needs (for example, those with specific food allergies)

Identifying clusters of similar customers can help you develop a marketing strategy that addresses the needs of specific clusters.

Moreover, data clustering can also help you identify, learn, or predict the nature of new data items — especially how new data can be linked with making predictions. For example, in pattern recognition, analyzing patterns in the data (such as buying patterns in particular regions or age groups) can help you develop predictive analytics — in this case, predicting the nature of future data items that can fit well with established patterns.

The fruit basket example uses data clustering to distinguish between different data items. Suppose your business assembles custom fruit baskets, and a new, unknown fruit is introduced to the market. You want to learn or predict which cluster the new item will belong to if you add it to the fruit basket.

Because you’ve already applied data clustering to the fruit dataset, you have four clusters — which makes it easier to predict which cluster (specific type of fruit) is appropriate for the new item. All you have to do is compare the unknown fruit to the other four clusters’ representatives and identify which cluster is the best match.

Although this process may seem obvious for a person working with a small dataset, it’s not so obvious at a larger scale — when you have to cluster millions of items without examining each one. The complexity becomes exponential when the dataset is large, diverse, and relatively incoherent — which is why clustering algorithms exist: Computers do that type of work best.