Advertisement
Online Test Banks
Score higher
See Online Test Banks
eLearning
Learning anything is easy
Browse Online Courses
Mobile Apps
Learning on the go
Explore Mobile Apps
Dummies Store
Shop for books and more
Start Shopping

How to Search for Individual Words in R

When you’re working with text, often you can solve problems if you’re able to find words or patterns inside text. R makes this easy to do. Imagine you have a list of the states in the United States, and you want to find out which of these states contains the word New.

To investigate this problem, you can use the built-in dataset states.names, which contains — you guessed it — the names of the states of the United States:

> head(state.names)
[1] "Alabama"  "Alaska"   "Arizona"
[4] "Arkansas"  "California" "Colorado"

Broadly speaking, you can find substrings in text in two ways:

  • By position: For example, you can tell R to get three letters starting at position 5.

  • By pattern: For example, you can tell R to get substrings that match a specific word or pattern.

    A pattern is a bit like a wildcard. In some card games, you may use the Joker card to represent any other card. Similarly, a pattern in R can contain words or certain symbols with special meanings.

Search by position in R

If you know the exact position of a subtext inside a text element, you use the substr() function to return the value. To extract the subtext that starts at the third position and stops at the sixth position of state.name, use the following:

> head(substr(state.name, start=3, stop=6))
[1] "abam" "aska" "izon" "kans" "lifo" "lora"

Search by pattern in R

To find substrings, you can use the grep() function, which takes two essential arguments:

  • pattern: The pattern you want to find.

  • x: The character vector you want to search.

Suppose you want to find all the states that contain the pattern New. Do it like this:

> grep("New", state.name)
[1] 29 30 31 32

The result of grep() is a numeric vector with the positions of each of the elements that contain the matching pattern. In other words, the 29th element of state.name contains the word New.

> state.name[29]
New Hampshire

Phew, that worked! But typing in the position of each matching text is going to be a lot of work. Fortunately, you can use the results of grep() directly to subset the original vector:

> state.name[grep("New", state.name)]
[1] "New Hampshire" "New Jersey"
[3] "New Mexico"  "New York"

The grep() function is case sensitive — it only matches text in the same case (uppercase or lowercase) as your search pattern. If you search for the pattern "new" in lowercase, your search results are empty:

> state.name[grep("new", state.name)]
character(0)
  • Add a Comment
  • Print
  • Share
blog comments powered by Disqus
Advertisement
Advertisement

Inside Dummies.com

Dummies.com Sweepstakes

Win an iPad Mini. Enter to win now!