How to Convert Raw Data into a Predictive Analysis Matrix

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

Before you can extract groups of similar data items from your dataset for your predictive analysis project, you might need to represent your data in a tabular format known as a data matrix. This is a preprocessing step that comes before data clustering.

How to create a predictive analysis matrix of terms in documents

Suppose the dataset that you’re about to analyze is contained in a set of Microsoft Word documents. The first thing you need to do is to convert the set of documents into a data matrix. Several commercial and open-source tools can handle that task, producing a matrix, in which each row corresponds to a document in the dataset. Examples of these tools include RapidMiner, and R text-mining packages.

A document is, in essence, a set of words. A term is a set of one or multiple words.

Every term that a document contains is mentioned either once or several times in the same document. The number of times a term is mentioned in a document can be represented by term frequency (TF), a numerical value.

We construct the matrix of terms in the document as follows:

  • The terms that appear in all documents are listed across the top row.

  • Document titles are listed down the leftmost column

  • The numbers that appear inside the matrix cells correspond to each term’s frequency.

For instance, Document A is represented as set of numbers (5,16,0,19,0,0.) where 5 corresponds to the number of times the term predictive analytics is repeated, 16 corresponds to the number to times computer science is repeated, and so on. This is the simplest way to convert a set of documents into a matrix.

Predictive Analytics Computer Science Learning Clustering 2013 Anthropology
Document A 5 16 0 19 0 0
Document B 8 6 2 3 0 0
Document C 0 5 2 3 3 9
Document D 1 9 13 4 6 7
Document E 2 16 16 0 2 13
Document F 13 0 19 16 4 2

Basics of predictive analysis term selection

One challenge in clustering text documents is determining how to select the best terms to represent all documents in the collection. How important a term is in a collection of documents can be calculated in different ways.

If, for example, you count the number of times a term is repeated in a document and compare that total with how often it recurs in the whole collection, you get a sense of the term’s importance relative to other terms.

Basing the relative importance of a term on its frequency in a collection is often known as weighting. The weight you assign can be based on two principles:

  • Terms that appear several times in a document are favored over terms that appear only once.

  • Terms that are used in relatively few documents are favored over terms that are mentioned in all documents.

If (for example) the term century is mentioned in all documents in your dataset, then you might not consider assigning it enough weight to have a column of its own in the matrix.

Similarly, if you’re dealing with a dataset of users of an online social network, you can easily convert that dataset into a matrix. User IDs or names will occupy the rows; the columns will list features that best describe those users.