Performing Classification Tasks for Machine Learning
You can put text processing into use for machine learning with classification tasks. When you classify texts, you assign a document to a class because of the topics it discusses.
You can discover the topics in a document in different ways. The simplest approach is prompted by the idea that if a group of people talks or writes about a topic, the people tend to use words from a limited vocabulary because they refer or relate to the same topic. When you share some meaning or are part of the same group, you tend to use the same language.
Consequently, if you have a collection of texts and don’t know what topics the text references, you can reverse the previous reasoning — you can simply look for groups of words that tend to associate, so their newly formed group by dimensionality reduction may hint at the topics you’d like to know about. This is a typical unsupervised learning task.
This learning task is a perfect application for the singular value decomposition (SVD) family of algorithms because by reducing the number of columns, the features (which, in a document, are the words) will gather in dimensions, and you can discover the topics by checking high-scoring words. SVD and Principal Components Analysis (PCA) provide features to relate both positively and negatively to the newly created dimensions.
So a resulting topic may be expressed by the presence of a word (high positive value) or by the absence of it (high negative value), making interpretation both tricky and counterintuitive for humans. The Scikit-learn package includes the Non-Negative Matrix Factorization (NMF) decomposition class, which allows an original feature to relate only positively with the resulting dimensions.
This example starts with a new experiment after loading the
20newsgroups dataset, a dataset collecting newsgroup postings scraped from the web, selecting only the posts regarding objects for sale and automatically removing headers, footers, and quotes. You may receive a warning message to the effect of,
WARNING:sklearn.datasets.twenty_newsgroups:Downloading dataset from …, with the URL of the site used for the download when working with this code.
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True,
categories = ['misc.forsale'],
remove=('headers', 'footers', 'quotes'), random_state=101)
print ('Posts: %i' % len(dataset.data))
TfidVectorizer class is imported and set up to remove stop words (common words such as the or and) and keep only distinctive words, producing a matrix whose columns point to distinct words.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.95,
tfidf = vectorizer.fit_transform(dataset.data)
from sklearn.decomposition import NMF
n_topics = 5
nmf = NMF(n_components=n_topics, random_state=101).fit(tfidf)
Term frequency-inverse document frequency (TF-IDF) is a simple calculation based on the frequency of a word in the document. It is weighted by the rarity of the word between all the documents available. Weighting words is an effective way to rule out words that cannot help you to classify or identify the document when processing text. For example, you can eliminate common parts of speech or other common words.
As with other algorithms from the
sklearn.decomposition module, the
n_components parameter indicates the number of desired components. If you’d like to look for more topics, you use a higher number. As the required number of topics increases, the
reconstruction_err_ method reports lower error rates. It’s up to you to decide when to stop given the trade-off between more time spent on computations and more topics.
The last part of the script outputs the resulting five topics. By reading the printed words, you can decide on the meaning of the extracted topics, thanks to product characteristics (for instance, the words drive, hard, card, and floppy refer to computers) or the exact product (for instance, comics, car, stereo, games).
feature_names = vectorizer.get_feature_names()
n_top_words = 15
for topic_idx, topic in enumerate(nmf.components_):
print ("Topic #%d:" % (topic_idx+1),)
print (" ".join([feature_names[i] for i in
topic.argsort()[:-n_top_words - 1:-1]]))
drive hard card floppy monitor meg ram disk motherboard vga scsi brand
color internal modem
00 50 dos 20 10 15 cover 1st new 25 price man 40 shipping comics
condition excellent offer asking best car old sale good new miles 10 000
email looking games game mail interested send like thanks price package
list sale want know
shipping vcr stereo works obo included amp plus great volume vhs unc mathes
You can explore the resulting model by looking into the attribute
components_ from the trained NMF model. It consists of a NumPy
ndarray holding positive values for words connected to the topic. By using the
argsort method, you can get the indexes of the top associations, whose high values indicate that they are the most representative words.
# Gets top words for topic 0
[1337 1749 889 1572 2342 2263 2803 1290 2353 3615 3017 806 1022 1938
Decoding the words’ indexes creates readable strings by calling them from the array derived from the
get_feature_names method applied to the
TfidfVectorizer that was previously fitted.
# Transforms index 1337 back to text