Leveraging Singular Value Decomposition for Predictive Analytics

By Dr. Anasse Bari, Mohamed Chaouchi, Tommy Jung

You can leverage singular value decomposition for predictive analytics. Singular value decomposition (SVD) represents a dataset by eliminating the less important parts and generating an accurate approximation of the original dataset. In this regard, SVD and PCA are methods of data reduction.

SVD will take a matrix as an input and decompose it into a product of three simpler matrices.

An m by n matrix M can be represented as a product of three other matrices as follows:

M = U * S * V T

Where U is an m by r matrix, V is an n by r matrix, and S is an r by r matrix; where r is the rank of the matrix M. The * represents matrix multiplication. T indicates matrix transposition.

In a data matrix where fewer concepts can describe the data, or can relate the data matrix’s columns to its rows, then SVD is a very useful tool to extract those concepts. For example, a dataset might contains books’ ratings, where the reviews are the rows and books the columns. The books can be grouped by type or domain, such as literature and fiction, history, biographies, children’s or teen books. Those will be the concepts that SVD can help extract.

These concepts must be meaningful and conclusive. If you stick to only a few concepts or dimensions to describe a larger dataset, our approximation will not be as accurate. This is primarily why it’s important to only eliminate concepts that are less important and not relevant to the overall dataset.

Latent semantic indexing is a data mining and natural language processing technique that is used in document retrieval and word similarity. Latent semantic indexing employs SVD to group documents to the concepts that could consist of different words found in those documents. The universe of words can be very large, and various words can be grouped into a concept. SVD helps reduce the noisy correlation between those words and their documents, and it gives you a representation of that universe using far fewer dimensions than the original dataset.

It is easier to see that documents discussing similar topics can use different words to describe those same topics. A document describing lions in Zimbabwe and another document describing elephants in Kenya should be grouped together. So you rely on concepts (wildlife in Africa, in this case), not words, to group these documents. The relation between documents and their words is established with those concepts or topics.

SVD and PCA have been used in classification and clustering. Generating those concepts is just a form of classification and grouping the data. Both have also been used for collaborative filtering.