Performing Sentiment Analysis on Twitter

By John Paul Mueller, Luca Massaron

It seems as though everyone is using Twitter to make his or her sentiments known today. Of course, the problem is that no one really knows the commonality of those sentiments — that is, whether someone could derive any sort of trend from all those tweets out there.

The following example shows how to classify tweets as positive or negative sentiments automatically. The example uses specific sentiments that you can change to see different results.

from nltk.classify import NaiveBayesClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
positive_tweets = [('Flowers smell good', 'positive'),
                   ('Birds are beautiful', 'positive'),
                   ('This is going to be a great day',
                   ('I love my bff', 'positive'),
                   ('This restaurant has great food', 
negative_tweets = [('Onions smell bad', 'negative'),
                   ('Garbage is ugly', 'negative'),
                   ('Nothing has gone right today',
                   ('I hate my boss', 'negative'),
                   ('This restaurant has terrible food',
test_tweets = [('Singing makes me happy', 'positive'),
               ('Blue skies are nice', 'positive'),
               ('I love spring', 'positive'),
               ('Coughing makes me sad', 'negative'),
               ('Cloudy skies are depressing', 'negative'),
               ('I hate winter', 'negative')]
X_train, y_train = zip(*positive_tweets+negative_tweets)
X_test, y_test = zip(*test_tweets)
tfidfvec = TfidfVectorizer(lowercase=True)
vectorized = tfidfvec.fit_transform(X_train)
sentiment_classifier = LogisticRegression(),y_train)
vectorized_test = tfidfvec.transform(X_test)
prediction = 
probabilities = 
print 'correct labels : %s %s %s %s %s %s' % y_test
print 'prediction     : %s %s %s %s %s %s' % 
print 'proba positive : %0.6f %0.6f %0.6f %0.6f %0.6f 
           %0.6f' % tuple(probabilities)

The example begins by creating training data that consists of positive and negative tweets. Later in the example, the code creates X_train and y_train from these tweets. In order to test the resulting algorithm, you also need test data, which comes in the form of test_tweets. The example later creates X_test and y_test from these tweets. These elements constitute the data used to simulate tweets for the example.

Before the code can do anything with the tweets, the tweets must be vectorized and then classified. The problem of classification breaks the words down into the smallest pieces that the algorithm can use for identifying commonality between the training data and the testing data.

At this point, the trained algorithm is tested to ensure that the training has worked as expected. The following output shows the results of the training and testing process:

correct labels : positive positive positive negative 
           negative negative
prediction     : negative positive positive negative 
           positive negative
proba positive : 0.497926 0.555813 0.561923 0.497926 
           0.555813 0.434459