Data Science Programming All-in-One For Dummies
Book image
Explore Book Buy On Amazon
Pattern matching in computers is as old as the computers themselves. In looking at various sources, you can find different starting points for pattern matching, such as editors. However, the fact is that you can’t really do much with a computer system without having some sort of pattern matching occur.

For example, the mere act of stopping certain kinds of loops requires that a computer match a pattern between the existing state of a variable and the desired state. Likewise, user input requires that the application match the user’s input to a set of acceptable inputs.

pattern matching ©Shutterstock/Wright Studio

Using pattern matching in data analysis

Developers recognize that function declarations also form a kind of pattern and that to call the function successfully, the caller must match the pattern. Sending the wrong number or types of variables as part of the function call causes the call to fail. Data structures also form a kind of pattern because the data must appear in a certain order and be of a specific type.

Where you choose to set the beginning for pattern matching depends on how you interpret the act. Certainly, pattern matching isn’t the same as counting, as in a for loop in an application. However, someone could argue that testing for a condition in a while loop matches the definition of pattern matching to some extent.

Many people look at editors as the first use of pattern matching because editors were the first kinds of applications to use pattern matching to perform a search, such as to locate a name in a document. Searching is most definitely part of the act of analysis because you must find the data before you can do anything with it.

The act of searching is just one aspect, however, of a broader application of pattern matching in analysis. The act of filtering data also requires pattern matching. A search is a singular approach to pattern matching in that the search succeeds the moment that the application locates a match.

Filtering is a batch process that accepts all the matches in a document and discards anything that doesn’t match, enabling you to see all the matches without doing anything else. Filtering can also vary from searching in that searching generally employs static conditions, while filtering can employ some level of dynamic condition, such as locating the members of a set or finding a value within a given range.

Filtering is the basis for many of the analysis features in declarative languages, such as SQL, when you want to locate all the instances of a particular data structure (a record) in a large data store (the database). The level of filtering in SQL is much more dynamic than in mere filtering because you can now apply conditional sets and limited algorithms to the process of locating particular data elements.

Regular expressions, although not the most advanced of modern pattern-matching techniques, offer a good view of how pattern matching works in modern applications. You can check for ranges and conditional situations, and you can even apply a certain level of dynamic control. Even so, the current master of pattern matching is the algorithm, which can be fully dynamic and incredibly responsive to particular conditions.

Working with pattern matching

Pattern matching in Python closely matches the functionality found in many other languages. Python provides robust pattern-matching capabilities using the regular expression (re) library. Here’s a good overview of the Python capabilities. The sections below detail Python functionality using a number of examples.

Performing simple Python matches

All the functionality you need for employing Python in basic RegEx tasks appears in the re library. The following code shows how to use this library:
import re
vowels = "[aeiou]"
print(re.search(vowels,
"This is a test sentence.").group())
The search() function locates only the first match, so you see the letter i as output because it’s the first item in vowels. You need the group() function call to output an actual value because search() returns a match object.

When you look at the Python documentation, you find quite a few functions devoted to working with regular expressions, some of them not entirely clear in their purpose. For example, you have a choice between performing a search or a match. A match works only at the beginning of a string. Consequently, this code:

print(re.match(vowels, "This is a test sentence."))
returns a value of None because none of the vowels appears at the beginning of the sentence. However, this code:
print(re.match("a", "abcde").group())
returns a value of a because the letter a appears at the beginning of the test string.

Neither search nor match will locate all occurrences of the pattern in the target string. To locate all the matches, you use findall or finditer instead. For example, this code:

print(re.findall(vowels, "This is a test sentence."))
returns a list like this:
['i', 'i', 'a', 'e', 'e', 'e', 'e']
Because this is a list, you can manipulate it as you would any other list.

Match objects are useful in other ways. For example, you can create a more complete search by using the start()and end()functions, as shown in the following code:

testSentence = "This is a test sentence."
m = re.search(vowels, testSentence)
while m:
print(testSentence[m.start():m.end()])
testSentence = testSentence[m.end():]
m = re.search(vowels, testSentence)
This code keeps performing searches on the remainder of the sentence after each search until it no longer finds a match, as shown here:
i
i
a
e
e
e
e
Using the finditer() function would be easier, but this code points out that Python does provide everything needed to create relatively complex pattern-matching code.

Doing more than pattern matching

Python’s regular expression library makes it quite easy to perform a wide variety of tasks that don’t strictly fall into the category of pattern matching. One of the most commonly used is splitting strings. For example, you might use the following code to split a test string using a number of whitespace characters:
testString = "This is\ta test string.\nYippee!"
whiteSpace = "[\s]"
print(re.split(whiteSpace, testString))
The escaped character, \s, stands for all space characters, which includes the set of [ \t\n\r\f\v. The split() function can split any content using any of the accepted regular expression characters, so it’s an extremely powerful data manipulation function. The output from this example looks like this:
['This', 'is', 'a', 'test', 'string.', 'Yippee!']
Performing substitutions using the sub() function is another forte of Python. Rather than perform common substitutions one at a time, you can perform them all simultaneously, as long as the replacement value is the same in all cases. Consider the following code:
testString = "Stan says hello to Margot from Estoria."
pattern = "Stan|hello|Margot|Estoria"
replace = "Unknown"
re.sub(pattern, replace, testString)
The output of this example is
Unknown says Unknown to Unknown from Unknown.
You can create a pattern of any complexity and use a single replacement value to represent each match. This is handy when performing certain kinds of data manipulation for tasks such as dataset cleanup prior to analysis.

About This Article

This article is from the book:

About the book authors:

John Mueller has published more than 100 books on technology, data, and programming. John has a website and blog where he writes articles on technology and offers assistance alongside his published books.

Luca Massaron is a data scientist specializing in insurance and finance. A Google Developer Expert in machine learning, he has been involved in quantitative analysis and algorithms since 2000.

John Mueller has published more than 100 books on technology, data, and programming. John has a website and blog where he writes articles on technology and offers assistance alongside his published books.

Luca Massaron is a data scientist specializing in insurance and finance. A Google Developer Expert in machine learning, he has been involved in quantitative analysis and algorithms since 2000.

This article can be found in the category: