How to Split Strings in R - dummies

How to Split Strings in R

By Andrie de Vries, Joris Meys

A collection of combined letters and words is called a string. Whenever you work with text, you need to be able to concatenate words (string them together) and split them apart. In R, you use the paste() function to concatenate and the strsplit() function to split. In this section, we show you how to use both functions.

First, create a character vector called pangram, and assign it the value “The quick brown fox jumps over the lazy dog”, as follows:

> pangram <- "The quick brown fox jumps over the lazy dog"
> pangram
[1] "The quick brown fox jumps over the lazy dog"

To split this text at the word boundaries (spaces), you can use strsplit() as follows:

> strsplit(pangram, " ")
[1] "The"  "quick" "brown" "fox"  "jumps" "over" "the"  "lazy" "dog"

Notice that the unusual first line of strsplit()’s output consists of [[1]]. Similar to the way that R displays vectors, [[1]] means that R is showing the first element of a list. Lists are extremely important concepts in R; they allow you to combine all kinds of variables.

In the preceding example, this list has only a single element. Yes, that’s right: The list has one element, but that element is a vector.

To extract an element from a list, you have to use double square brackets. Split your pangram into words, and assign the first element to a new variable called words, using double-square-brackets ([[]]) subsetting, as follows:

words <- strsplit(pangram, " ")[[1]]
> words
[1] "The"  "quick" "brown" "fox"  "jumps" "over" "the"  "lazy" "dog"

To find the unique elements of a vector, including a vector of text, you use the unique() function. In the variable words, “the” appears twice: once in lowercase and once with the first letter capitalized. To get a list of the unique words, first convert words to lowercase and then use unique:

> unique(tolower(words))
[1] "the"  "quick" "brown" "fox"  "jumps" "over" "lazy"
[8] "dog"