How to Search for Individual Words in R - dummies

How to Search for Individual Words in R

By Andrie de Vries, Joris Meys

When you’re working with text, often you can solve problems if you’re able to find words or patterns inside text. R makes this easy to do. Imagine you have a list of the states in the United States, and you want to find out which of these states contains the word New.

To investigate this problem, you can use the built-in dataset states.names, which contains — you guessed it — the names of the states of the United States:

> head(state.names)
[1] "Alabama"  "Alaska"   "Arizona"
[4] "Arkansas"  "California" "Colorado"

Broadly speaking, you can find substrings in text in two ways:

  • By position: For example, you can tell R to get three letters starting at position 5.

  • By pattern: For example, you can tell R to get substrings that match a specific word or pattern.

    A pattern is a bit like a wildcard. In some card games, you may use the Joker card to represent any other card. Similarly, a pattern in R can contain words or certain symbols with special meanings.

Search by position in R

If you know the exact position of a subtext inside a text element, you use the substr() function to return the value. To extract the subtext that starts at the third position and stops at the sixth position of, use the following:

> head(substr(, start=3, stop=6))
[1] "abam" "aska" "izon" "kans" "lifo" "lora"

Search by pattern in R

To find substrings, you can use the grep() function, which takes two essential arguments:

  • pattern: The pattern you want to find.

  • x: The character vector you want to search.

Suppose you want to find all the states that contain the pattern New. Do it like this:

> grep("New",
[1] 29 30 31 32

The result of grep() is a numeric vector with the positions of each of the elements that contain the matching pattern. In other words, the 29th element of contains the word New.

New Hampshire

Phew, that worked! But typing in the position of each matching text is going to be a lot of work. Fortunately, you can use the results of grep() directly to subset the original vector:

[1] "New Hampshire" "New Jersey"
[3] "New Mexico"  "New York"

The grep() function is case sensitive — it only matches text in the same case (uppercase or lowercase) as your search pattern. If you search for the pattern “new” in lowercase, your search results are empty: