Deep Learning and Natural Language Processing - dummies

Deep Learning and Natural Language Processing

By John Paul Mueller, Luca Mueller

As a simplification, you can view language as a sequence of words made of letters (as well as punctuation marks, symbols, emoticons, and so on). Deep learning processes language best by using layers of RNNs, such as LSTM or GRU. However, knowing to use RNNs doesn’t tell you how to use sequences as inputs; you need to determine the kind of sequences. In fact, deep learning networks accept only numeric input values. Computers encode letter sequences that you understand into numbers according to a protocol, such as Unicode Transformation Format-8 bit (UTF-8). UTF-8 is the most widely used encoding.

Deep learning can also process textual data using Convolutional Neural Networks (CNNs) instead of RNNs by representing sequences as matrices (similar to image processing). Keras supports CNN layers, such as the Conv1D, which can operate on ordered features in time — that is, sequences of words or other signals. The 1D convolution output is usually followed by a MaxPooling1D layer that summarizes the outputs. CNNs applied to sequences find a limit in their insensitivity to the global order of the sequence. (They tend to spot local patterns.) For this reason, they’re best used in sequence processing in combination with RNNs, not as their replacement.

Natural Language Processing (NLP) consists of a series of procedures that improve the processing of words and phrases for statistical analysis, machine learning algorithms, and deep learning. NLP owes its roots to computational linguistics that powered AI rule-based systems, such as expert systems, which made decisions based on a computer translation of human knowledge, experience, and way of thinking. NLP digested textual information, which is unstructured, into more structured data so that expert systems could easily manipulate and evaluate it.

Deep learning has taken the upper hand today, and expert systems are limited to specific applications in which interpretability and control of decision processes are paramount (for instance, in medical applications and driving behavior decision systems on some self-driving cars). Yet, the NLP pipeline is still quite relevant for many deep learning applications.

Natural Language Processing: Defining understanding as tokenization

In an NLP pipeline, the first step is to obtain raw text. Usually you store it in memory or access it from disk. When the data is too large to fit in memory, you maintain a pointer to it on disk (such as the directory name and the filename). In the following example, you use three documents (represented by string variables) stored in a list (the document container is the corpus in nat

import numpy as np
texts = ["My dog gets along with cats",
"That cat is vicious",
"My dog is happy when it is lunch"]

After obtaining the text, you process it. As you process each phrase, you extract the relevant features from the text (you usually create a bag-of-words matrix) and pass everything to a learning model, such as a deep learning algorithm. During text processing, you can use different transformations to manipulate the text (with tokenization being the only mandatory transformation):

  • Normalization: Remove capitalization.
  • Cleaning: Remove nontextual elements such as punctuation and numbers.
  • Tokenization: Split a sentence into individual words.
  • Stop word removal: Remove common, uninformative words that don’t add meaning to the sentence, such as the articles the and a. Removing negations such as not could be detrimental if you want to guess the sentiment.
  • Stemming: Reduce a word to its stem (which is the word form before adding inflectional affixes). An algorithm, called a stemmer, can do this based on a series of rules.
  • Lemmatization: Transform a word into its dictionary form (the lemma). It’s an alternative to stemming, but it’s more complex because you don’t use an algorithm. Instead, you use a dictionary to convert every word into its lemma.
  • Pos-tagging: Tag every word in a phrase with its grammatical role in the sentence (such as tagging a word as a verb or as a noun).
  • N-grams: Associate every word with a certain number (the n in n-gram), of following words and treat them as a unique set. Usually, bi-grams (a series of two adjacent elements or tokens) and tri-grams (a series of three adjacent elements or tokens) work the best for analysis purposes.

To achieve these transformations, you may need a specialized Python package such as NLTK or Scikit-learn. When working with deep learning and a large number of examples, you need only basic transformations: normalization, cleaning, and tokenization. The deep learning layers can determine what information to extract and process. When working with few examples, you do need to provide as much NLP processing as possible to help the deep learning network determine what to do in spite of the little guidance provided by the few examples.

Keras offers a function, keras.preprocessing.text.Tokenizer, that normalizes (using the lower parameter set to True), cleans (the filters parameter contains a string of the characters to remove, usually these: ‘!”#$%&()*+,-./:;<=>?@[\]^_`{|}~ ‘) and tokenizes.

Natural Language Processing: Putting all the documents into a bag

After processing the text, you have to extract the relevant features, which means transforming the remaining text into numeric information for the neural network to process. This is commonly done using the bag-of-words approach, which is obtained by frequency encoding or binary encoding the text. This process equates to transforming each word into a matrix column as wide as the number of words you need to represent. The following example shows how to achieve this process and what it implies. As a first step, you prepare a basic normalization and tokenization using a few Python commands to determine the word vocabulary size for processing:

unique_words = set(word.lower() for phrase in texts for
word in phrase.split(" "))
print(f"There are {len(unique_words)} unique words")

The code reports 14 words. You now proceed to load the Tokenizer function from Keras and set it to process the text by providing the expected vocabulary size:

from keras.preprocessing.text import Tokenizer
vocabulary_size = len(unique_words) + 1
tokenizer = Tokenizer(num_words=vocabulary_size)

Using a vocabulary_size that’s too small may exclude important words from the learning process. One that’s too large may uselessly consume computer memory. You need to provide Tokenizer with a correct estimate of the number of distinct words contained in the list of texts. You also always add 1 to the vocabulary_size to provide an extra word for the start of a phrase (a term that helps the deep learning network). At this point, Tokenizer maps the words present in the texts to indexes, which are numeric values representing the words in text:

tokenizer.fit_on_texts(texts)
print(tokenizer.index_word)

The resulting indexes are as follows:

{1: 'is', 2: 'my', 3: 'dog', 4: 'gets', 5: 'along',
6: 'with', 7: 'cats', 8: 'that', 9: 'cat', 10: 'vicious',
11: 'happy', 12: 'when', 13: 'it', 14: 'lunch'}

The indexes represent the column number that houses the word information:

print(tokenizer.texts_to_matrix(texts))

Here’s the resulting matrix:

[[0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0.]
[0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]]

The matrix consists of 15 columns (14 words plus the start of phrase pointer) and three rows, representing the three processed texts. This is the text matrix to process using a shallow neural network (RNNs require a different format, as discussed later), which is always sized as vocabulary_size by the number of texts. The numbers inside the matrix represent the number of times a word appears in the phrase. This isn’t the only representation possible, though. Here are the others:

  • Frequency encoding: Counts the number of word appearances in the phrase.
  • one-hot encoding or binary encoding: Notes the presence of a word in a phrase, no matter how many times it appear.
  • Term frequency-Inverse Document Frequency (TF-IDF) score: Encodes a measure relative to how many times a word appears in a document relative to the overall number of words in the matrix. (Words with higher scores are more distinctive; words with lower scores are less informative.)

You can use the TF-IDF transformation from Keras directly. The Tokenizer offers a method, texts_to_matrix, that by default encodes your text and transforms it into a matrix in which the columns are your words, the rows are your texts, and the values are the word frequency within a text. If you apply the transformation by specifying mode='tfidf’, the transformation uses TF-IDF instead of word frequencies to fill the matrix values:

print(np.round(tokenizer.texts_to_matrix(texts,
mode='tfidf'), 1))

Note that by using a matrix representation, no matter whether you use binary, frequency, or the more sophisticated TF-IDF, you have lost any sense of word ordering that exists in the phrase. During processing, the words scatter in different columns, and the neural network can’t guess the word order in a phrase. This lack of order is why you call it a bag-of-words approach.

The bag-of-words approach is used in many machine learning algorithms, often with results ranging from good to fair, and you can apply it to a neural network using dense architecture layers. Transformations of words encoded into n_grams (discussed in the previous paragraph as an NLP processing transformation) provide some more information, but again, you can’t relate the words.

RNNs keep track of sequences, so they still use one-hot encoding, but they don’t encode the entire phrase, rather, they individually encode each token (which could be a word, a character, or even a bunch of characters). For this reason, they expect a sequence of indexes representing the phrase:

print(tokenizer.texts_to_sequences(texts))

As each phrase passes to a neural network input as a sequence of index numbers, the number is turned into a one-hot encoded vector. The one-hot encoded vectors are then fed into the RNN’s layers one at a time, making them easy to learn. For instance, here’s the transformation of the first phrase in the matrix:

[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]]

In this representation, you get a distinct matrix for each piece of text. Each matrix represents the individual texts as distinct words using columns, but now the rows represent the word appearance order. (The first row is the first word, the second row is the second word, and so on.)

Using this basic approach, data scientists are able to use deep learning for Natural Language Processing.