Machine Learning For Dummies
Book image
Explore Book Buy On Amazon
You can put text processing into use for machine learning with classification tasks. When you classify texts, you assign a document to a class because of the topics it discusses.

You can discover the topics in a document in different ways. The simplest approach is prompted by the idea that if a group of people talks or writes about a topic, the people tend to use words from a limited vocabulary because they refer or relate to the same topic. When you share some meaning or are part of the same group, you tend to use the same language.

Consequently, if you have a collection of texts and don’t know what topics the text references, you can reverse the previous reasoning — you can simply look for groups of words that tend to associate, so their newly formed group by dimensionality reduction may hint at the topics you’d like to know about. This is a typical unsupervised learning task.

This learning task is a perfect application for the singular value decomposition (SVD) family of algorithms because by reducing the number of columns, the features (which, in a document, are the words) will gather in dimensions, and you can discover the topics by checking high-scoring words. SVD and Principal Components Analysis (PCA) provide features to relate both positively and negatively to the newly created dimensions.

So a resulting topic may be expressed by the presence of a word (high positive value) or by the absence of it (high negative value), making interpretation both tricky and counterintuitive for humans. The Scikit-learn package includes the Non-Negative Matrix Factorization (NMF) decomposition class, which allows an original feature to relate only positively with the resulting dimensions.

This example starts with a new experiment after loading the 20newsgroups dataset, a dataset collecting newsgroup postings scraped from the web, selecting only the posts regarding objects for sale and automatically removing headers, footers, and quotes. You may receive a warning message to the effect of, WARNING:sklearn.datasets.twenty_newsgroups:Downloading dataset from …, with the URL of the site used for the download when working with this code.

import warnings


from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True,

categories = [''],

remove=('headers', 'footers', 'quotes'), random_state=101)

print ('Posts: %i' % len(

Posts: 585

The TfidVectorizer class is imported and set up to remove stop words (common words such as the or and) and keep only distinctive words, producing a matrix whose columns point to distinct words.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.95,

min_df=2, stop_words='english')

tfidf = vectorizer.fit_transform(

from sklearn.decomposition import NMF

n_topics = 5

nmf = NMF(n_components=n_topics, random_state=101).fit(tfidf)

Term frequency-inverse document frequency (TF-IDF) is a simple calculation based on the frequency of a word in the document. It is weighted by the rarity of the word between all the documents available. Weighting words is an effective way to rule out words that cannot help you to classify or identify the document when processing text. For example, you can eliminate common parts of speech or other common words.

As with other algorithms from the sklearn.decomposition module, the n_components parameter indicates the number of desired components. If you’d like to look for more topics, you use a higher number. As the required number of topics increases, the reconstruction_err_ method reports lower error rates. It’s up to you to decide when to stop given the trade-off between more time spent on computations and more topics.

The last part of the script outputs the resulting five topics. By reading the printed words, you can decide on the meaning of the extracted topics, thanks to product characteristics (for instance, the words drive, hard, card, and floppy refer to computers) or the exact product (for instance, comics, car, stereo, games).

feature_names = vectorizer.get_feature_names()

n_top_words = 15

for topic_idx, topic in enumerate(nmf.components_):

print ("Topic #%d:" % (topic_idx+1),)

print (" ".join([feature_names[i] for i in

topic.argsort()[:-n_top_words - 1:-1]]))

Topic #1:

drive hard card floppy monitor meg ram disk motherboard vga scsi brand

color internal modem

Topic #2:

00 50 dos 20 10 15 cover 1st new 25 price man 40 shipping comics

Topic #3:

condition excellent offer asking best car old sale good new miles 10 000

tape cd

Topic #4:

email looking games game mail interested send like thanks price package

list sale want know

Topic #5:

shipping vcr stereo works obo included amp plus great volume vhs unc mathes

gibbs radley

You can explore the resulting model by looking into the attribute components_ from the trained NMF model. It consists of a NumPy ndarray holding positive values for words connected to the topic. By using the argsort method, you can get the indexes of the top associations, whose high values indicate that they are the most representative words.

print (nmf.components_[0,:].argsort()[:-n_top_words-1:-1])

# Gets top words for topic 0

[1337 1749 889 1572 2342 2263 2803 1290 2353 3615 3017 806 1022 1938


Decoding the words’ indexes creates readable strings by calling them from the array derived from the get_feature_names method applied to the TfidfVectorizer that was previously fitted.

print (vectorizer.get_feature_names()[1337])

# Transforms index 1337 back to text


About This Article

This article is from the book:

About the book authors:

John Paul Mueller is a prolific freelance author and technical editor. He's covered everything from networking and home security to database management and heads-down programming.

Luca Massaron is a data scientist who specializes in organizing and interpreting big data, turning it into smart data with data mining and machine learning techniques.

This article can be found in the category: