Using AI for Sentiment Analysis - dummies

Using AI for Sentiment Analysis

By John Paul Mueller, Luca Mueller

Sentiment analysis computationally derives from a written text using the writer’s attitude (whether positive, negative, or neutral), toward the text topic. This kind of analysis proves useful for people working in marketing and communication because it helps them understand what customers and consumers think of a product or service and thus, act appropriately (for instance, trying to recover unsatisfied customers or deciding to use a different sales strategy). Everyone performs sentiment analysis. For example, when reading text, people naturally try to determine the sentiment that moved the person who wrote it. However, when the number of texts to read and understand is too huge and the text constantly accumulates, as in social media and customer e-mails, automating sentiment analysis is important.

AI sentiment analysis
©Shutterstock/Ryzhi
Using AI for sentiment analysis

The upcoming example is a test run of RNNs using Keras and TensorFlow that builds a sentiment analysis algorithm capable of classifying the attitudes expressed in a film review. The data is a sample of the IMDb dataset that contains 50,000 reviews (split in half between train and test sets) of movies accompanied by a label expressing the sentiment of the review (0=negative, 1=positive). IMDb is a large online database containing information about films, TV series, and video games. Originally maintained by a fan base, it’s now run by an Amazon subsidiary. On IMDb, people find the information they need about their favorite show as well as post their comments or write a review for other visitors to read.

Keras offers a downloadable wrapper for IMDb data. You prepare, shuffle, and arrange this data into a train and a test set. In particular, the IMDb textual data offered by Keras is cleansed of punctuation, normalized into lowercase, and transformed into numeric values. Each word is coded into a number representing its ranking in frequency. Most frequent words have low numbers; less frequent words have higher numbers.

As a starter point, the code imports the imdb function from Keras and uses it to retrieve the data from the Internet (about a 17.5MB download). The parameters that the example uses encompass just the top 10,000 words, and Keras should shuffle the data using a specific random seed. (Knowing the seed makes it possible to reproduce the shuffle as needed.) The function returns two train and test sets, both made of text sequences and the sentiment outcome.

from keras.datasets import imdb
top_words = 10000
((x_train, y_train),
(x_test, y_test)) = imdb.load_data(num_words=top_words,
seed=21)

After the previous code completes, you can check the number of examples using the following code:

print("Training examples: %i" % len(x_train))
print("Test examples: %i" % len(x_test))

After inquiring about the number of cases available for use in the training and test phase of the neural network, the code outputs an answer of 25,000 examples for each phase. (This dataset is a relatively small one for a language problem; clearly the dataset is mainly for demonstration purposes.) In addition, the code determines whether the dataset is balanced, which means it has an almost equal number of positive and negative sentiment examples.

import numpy as np
print(np.unique(y_train, return_counts=True))

The result, array([12500, 12500]), confirms that the dataset is split evenly between positive and negative outcomes. Such a balance between the response classes is exclusively because of the demonstrative nature of the dataset. In the real world, you seldom find balanced datasets. The next step creates some Python dictionaries that can convert between the code used in the dataset and the real words. In fact, the dataset used in this example is preprocessed and provides sequences of numbers representing the words, not the words themselves. (LSTM and GRU algorithms that you find in Keras expect sequences of numbers as numbers.)

word_to_id = {w:i+3 for w,i in imdb.get_word_index().items()}
id_to_word = {0:'<PAD>', 1:'<START>', 2:'<UNK>'}
id_to_word.update({i+3:w for w,i in imdb.get_word_index().items()})
def convert_to_text(sequence):
return ' '.join([id_to_word[s] for s in sequence if s>=3])
print(convert_to_text(x_train[8]))

The previous code snippet defines two conversion dictionaries (from words to numeric codes and vice versa) and a function that translates the dataset examples into readable text. As an example, the code prints the ninth example: “this movie was like a bad train wreck as horrible as it was …”. From this excerpt, you can easily anticipate that the sentiment for this movie isn’t positive. Words such as bad, wreck, and horrible convey a strong negative feeling, and that makes guessing the correct sentiment easy.

In this example, you receive the numeric sequences and turn them back into words, but the opposite is common. Usually, you get phrases made up of words and turn them into sequences of integers to feed to a layer of RNNs. Keras offers a specialized function, Tokenizer which can do that for you. It uses the methods fit_on_text, to learn how to map words to integers from training data, and texts_to_matrix, to transform text into a sequence.

However, in other phrases, you may not find such revealing words for the sentiment analysis. The feeling is expressed in a more subtle or indirect way, and understanding the sentiment early in the text may not be possible because revealing phrases and words may appear much later in the discourse. For this reason, you also need to decide how much of the phrase you want to analyze.

Conventionally, you take an initial part of the text and use it as representative of the entire review. Sometimes you just need a few initial words — for instance the first 50 words — to get the sense; sometimes you need more. Especially long texts don’t reveal their orientation early. It is therefore up to you to understand the type of text you are working with and decide how many words to analyze using deep learning. This example considers only the first 200 words, which should suffice.

You have noticed that the code starts giving code to words beginning with the number 3, thus leaving codes from 0 to 2. Lower numbers are used for special tags, such as signaling the start of the phrase, filling empty spaces to have the sequence fixed at a certain length, and marking the words that are excluded because they’re not frequent enough. This example picks up only the most frequent 10,000 words. Using tags to point out start, end, and notable situations is a trick that works with RNNs, especially for machine translation.

from keras.preprocessing.sequence import pad_sequences
max_pad = 200
x_train = pad_sequences(x_train,
maxlen=max_pad)
x_test = pad_sequences(x_test,
maxlen=max_pad)
print(x_train[0])

By using the pad_sequences function from Keras with max_pad set to 200, the code takes the first two hundred words of each review. In case the review contains fewer than two hundred words, as many zero values as necessary precede the sequence to reach the required number of sequence elements. Cutting the sequences to a certain length and filling the voids with zero values is called input padding, an important processing activity when using RNNs like deep learning algorithms. Now the code designs the architecture:

from keras.models import Sequential
from keras.layers import Bidirectional, Dense, Dropout
from keras.layers import GlobalMaxPool1D, LSTM
from keras.layers.embeddings import Embedding
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(top_words,
embedding_vector_length,
input_length=max_pad))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(GlobalMaxPool1D())
model.add(Dense(16, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
print(model.summary())

The previous code snippet defines the shape of the deep learning model, where it uses a few specialized layers for natural language processing from Keras. The example also has required a summary of the model (model.summary() command) to determine what is happening with architecture by using different neural layers.

You have the Embedding layer, which transforms the numeric sequences into a dense word embedding. That type of word embedding is more suitable for being learned by a layer of RNNs. Keras provides an Embedding layer, which, apart from necessarily having to be the first layer of the network, can accomplish two tasks:

  • Applying pretrained word embedding (such as Word2vec or GloVe) to the sequence input. You just need to pass the matrix containing the embedding to its parameter weights.
  • Creating a word embedding from scratch, based on the inputs it receives.

In this second case, Embedding just needs to know:

  • input_dim: The size of the vocabulary expected from data
  • output_dim: The size of the embedding space that will be produced (the so-called dimensions)
  • input_length: The sequence size to expect

After you determine the parameters, Embedding will find the better weights to transform the sequences into a dense matrix during training. The dense matrix size is given by the length of sequences and the dimensionality of the embedding.

If you use The Embedding layer provided by Keras, you have to remember that the function provides only a weight matrix of the size of the vocabulary by the dimension of the desired embedding. It maps the words to the columns of the matrix and then tunes the matrix weights to the provided examples. This solution, although practical for nonstandard language problems, is not analogous to the word embeddings discussed previously, which are trained in a different way and on millions of examples.

The example uses Bidirectional wrapping — an LSTM layer of 64 cells. Bidirectional transforms a normal LSTM layer by doubling it: On the first side, it applies the normal sequence of inputs you provide; on the second, it passes the reverse of the sequence. You use this approach because sometimes you use words in a differovoverfittingent order, and building a bidirectional layer will catch any word pattern, no matter the order. The Keras implementation is indeed straightforward: You just apply it as a function on the layer you want to render bidirectionally.

The bidirectional LSTM is set to return sequences (return_sequences=True); that is, for each cell, it returns the result provided after seeing each element of the sequence. The results, for each sequence, is an output matrix of 200 x 128, where 200 is the number of sequence elements and 128 is the number of LSTM cells used in the layer. This technique prevents the RNN from taking the last result of each LSTM cell. Hints about the sentiment of the text could actually appear anywhere in the embedded words sequence.

In short, it’s important not to take the last result of each cell, but rather the best result of it. The code therefore relies on the following layer, GlobalMaxPool1D, to check each sequence of results provided by each LSTM cell and retain only the maximum result. That should ensure that the example picks the strongest signal from each LSTM cell, which is hopefully specialized by its training to pick some meaningful signals.

After the neural signals are filtered, the example has a layer of 128 outputs, one for each LSTM cell. The code reduces and mixes the signals using a successive dense layer of 16 neurons with ReLU activation (thus making only positive signals pass through). The architecture ends with a final node using sigmoid activation, which will squeeze the results into the 0–1 range and make them look like probabilities.

Having defined the architecture, you can now train the network to perform sentiment analysis. Three epochs (passing the data three times through the network to have it learn the patterns) will suffice. The code uses batches of 256 reviews each time, which allows the network to see enough variety of words and sentiments each time before updating its weights using backpropagation. Finally, the code focuses on the results provided by the validation data (which isn’t part of the training data). Getting a good result from the validation data means the neural net is processing the input correctly. The code reports on validation data just after each epoch finishes.

history = model.fit(x_train, y_train,
validation_data=(x_test, y_test),
epochs=3, batch_size=256)

Getting the results takes a while, but if you are using a GPU, it will complete in the time you take to drink a cup of coffee. At this point, you can evaluate the results, again using the validation data. (The results shouldn’t have any surprises or differences from what the code reported during training.)

loss, metric = model.evaluate(x_test, y_test, verbose=0)
print("Test accuracy: %0.3f" % metric)

The final accuracy, which is the percentage of correct answers from the deep neural network, will be a value of around 85—86 percent. The result will change slightly each time you run the experiment because of randomization when building your neural network. That’s perfectly normal given the small size of the data you are working with. If you start with the right lucky weights, the learning will be easier in such a short training session.

In the end, your network is a sentiment analyzer that can guess the sentiment expressed in a movie review correctly about 85 percent of the time. Given even more training data and more sophisticated neural architectures, you can get results that are even more impressive. In marketing, a similar tool is used to automate many processes that require reading text and taking action. Again, you could couple a network like this with a neural network that listens to a voice and turns it into text. (This is another application of RNNs, now powering Alexa, Siri, Google Voice, and many other personal assistants.) The transition allows the application to understand the sentiment even in vocal expressions, such as a phone call from a customer.