1.7 Text Processing and Regular Expressions
The learning objectives for this section are to:
- Transform non-tidy data into tidy data
- Manipulate and transform a variety of data types, including dates, times, and text data
Most common types of data are encoded in text, even if that text is representing numerical values, so being able to manipulate text as a software developer is essential. R provides several built-in tools for manipulating text, and there is a rich ecosystem of packages for R for text based analysis. First let’s concentrate on some basic text manipulation functions.
1.7.1 Text Manipulation Functions in R
Text in R is represented as a string object, which looks like a phrase
surrounded by quotation marks in the R console. For example "Hello!"
and
'Strings are fun!'
are both strings. You can tell whether an object is a
string using the is.character()
function. Strings are also known as characters
in R.
You can combine several strings using the paste()
function:
paste("Square", "Circle", "Triangle")
1] "Square Circle Triangle" [
By default the paste()
function inserts a space between each word. You can
insert a different string between each word by specifying the sep
argument:
paste("Square", "Circle", "Triangle", sep = "+")
1] "Square+Circle+Triangle" [
A shortcut for combining all of the string arguments without any characters
in between each of them is to use the paste0()
function:
paste0("Square", "Circle", "Triangle")
1] "SquareCircleTriangle" [
You can also provide a vector of strings as an argument to paste()
. For
example:
<- c("Square", "Circle", "Triangle")
shapes paste("My favorite shape is a", shapes)
1] "My favorite shape is a Square" "My favorite shape is a Circle"
[3] "My favorite shape is a Triangle"
[
<- c("best", "worst")
two_cities paste("It was the", two_cities, "of times.")
1] "It was the best of times." "It was the worst of times." [
As you can see, all of the possible string combinations are produced when you
provide a vector of strings as an argument to paste()
. You can also collapse
all of the elements of a vector of strings into a single string by specifying
the collapse
argument:
paste(shapes, collapse = " ")
1] "Square Circle Triangle" [
Besides pasting strings together, there are a few other basic string
manipulation functions you should be made aware of. The nchar()
function
counts the number of characters in a string:
nchar("Supercalifragilisticexpialidocious")
1] 34 [
The toupper()
and tolower()
functions make strings all uppercase or
lowercase respectively:
<- c("CAPS", "low", "Title")
cases tolower(cases)
1] "caps" "low" "title"
[toupper(cases)
1] "CAPS" "LOW" "TITLE" [
1.7.2 Regular Expressions
Now that we’ve covered the basics of string manipulation in R, let’s discuss the more advanced topic of regular expressions. A regular expression is a string that defines a pattern that could be contained within another string. A regular expression can be used for searching for a string, searching within a string, or replacing one part of a string with another string. In this section I might refer to a regular expression as a regex, just know that they’re the same thing.
Regular expressions use characters to define patterns of other characters. Although that approach may seem problematic at first, we’ll discuss meta-characters (characters that describe other characters) and how you can use them to create powerful regular expressions.
One of the most basic functions in R that uses regular expressions is the
grepl()
function, which takes two arguments: a regular expression and a
string to be searched. If the string contains the specified regular expression
then grepl()
will return TRUE
, otherwise it will return FALSE
. Let’s take
a look at one example:
<- "a"
regular_expression <- "Maryland"
string_to_search
grepl(regular_expression, string_to_search)
1] TRUE [
In the example above we specify the regular expression "a"
and store it in a
variable called regular_expression
. Remember that regular expressions are just
strings! We also store the string "Maryland"
in a variable called
string_to_search
. The regular expression "a"
represents a single occurrence
of the character "a"
. Since "a"
is contained within "Maryland"
, grepl()
returns the value TRUE
. Let’s try another simple example:
<- "u"
regular_expression <- "Maryland"
string_to_search
grepl(regular_expression, string_to_search)
1] FALSE [
The regular expression "u"
represents a single occurrence of the character
"u"
, which is not a sub-string of "Maryland"
, therefore grepl()
returns
the value FALSE
. Regular expressions can be much longer than single
characters. You could for example search for smaller strings inside of a larger
string:
grepl("land", "Maryland")
1] TRUE
[grepl("ryla", "Maryland")
1] TRUE
[grepl("Marly", "Maryland")
1] FALSE
[grepl("dany", "Maryland")
1] FALSE [
Since "land"
and "ryla"
are sub-strings of "Maryland"
, grepl()
returns
TRUE
, however when a regular expression like "Marly"
or "dany"
is searched
grepl()
returns FALSE
because neither are sub-strings of "Maryland"
.
There’s a dataset that comes with R called state.name
which is a vector of
Strings, one for each state in the United States of America. We’re going to use
this vector in several of the following examples.
head(state.name)
1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
[6] "Colorado" [
Let’s build a regular expression for identifying several strings in this vector,
specifically a regular expression that will match names of states that both
start and end with a vowel. The state name could start and end with any vowel,
so we won’t be able to match exact sub-strings like in the previous examples.
Thankfully we can use metacharacters to look for vowels and other parts of
strings. The first metacharacter that we’ll discuss is "."
. The
metacharacter that only consists of a period represents any character other
than a new line (we’ll discuss new lines soon). Let’s take a look at some
examples using the peroid regex:
grepl(".", "Maryland")
1] TRUE
[grepl(".", "*&2[0+,%<@#~|}")
1] TRUE
[grepl(".", "")
1] FALSE [
As you can see the period metacharacter is very liberal. This metacharacter is most userful when you don’t care about a set of characters in a regular expression. For example:
grepl("a.b", c("aaa", "aab", "abb", "acadb"))
1] FALSE TRUE TRUE TRUE [
In the case above grepl()
returns TRUE
for all strings that contain an a
followed by any other character followed by a b
.
You can specify a regular expression that contains a certain number of
characters or metacharacters using the enumeration metacharacters. The +
metacharacter indicates that one or more of the preceding expression should b
present and *
indicates that zero or more of the preceding expression is
present. Let’s take a look at some examples using these metacharacters:
# Does "Maryland" contain one or more of "a" ?
grepl("a+", "Maryland")
1] TRUE
[
# Does "Maryland" contain one or more of "x" ?
grepl("x+", "Maryland")
1] FALSE
[
# Does "Maryland" contain zero or more of "x" ?
grepl("x*", "Maryland")
1] TRUE [
You can also specify exact numbers of expressions using curly brackets {}
.
For example "a{5}"
specifies “a exactly five times,” "a{2,5}"
specifies
“a between 2 and 5 times,” and "a{2,}"
specifies “a at least 2 times.” Let’s
take a look at some examples:
# Does "Mississippi" contain exactly 2 adjacent "s" ?
grepl("s{2}", "Mississippi")
1] TRUE
[
# This is equivalent to the expression above:
grepl("ss", "Mississippi")
1] TRUE
[
# Does "Mississippi" contain between 1 and 3 adjacent "s" ?
grepl("s{2,3}", "Mississippi")
1] TRUE
[
# Does "Mississippi" contain between 2 and 3 adjacent "i" ?
grepl("i{2,3}", "Mississippi")
1] FALSE
[
# Does "Mississippi" contain between 2 adjacent "iss" ?
grepl("(iss){2}", "Mississippi")
1] TRUE
[
# Does "Mississippi" contain between 2 adjacent "ss" ?
grepl("(ss){2}", "Mississippi")
1] FALSE
[
# Does "Mississippi" contain the pattern of an "i" followed by
# 2 of any character, with that pattern repeated three times adjacently?
grepl("(i.{2}){3}", "Mississippi")
1] TRUE [
In the last three examples I used parentheses ()
to create a capturing group.
A capturing group allows you to use quantifiers on other regular expressions.
In the last example I first created the regex "i.{2}"
which matches i
followed by any two characters (“iss” or “ipp”). I then used a capture group to
to wrap that regex, and to specify exactly three adjacent occurrences of
that regex.
You can specify sets of characters with regular expressions, some of which come
built in, but you can build your own character sets too. First we’ll discuss
the built in character sets: words ("\\w"
), digits ("\\d"
), and whitespace
characters ("\\s"
). Words specify any letter, digit, or a underscore, digits
specify the digits 0 through 9, and whitespace specifies line breaks, tabs, or
spaces. Each of these character sets have their own compliments: not words
("\\W"
), not digits ("\\D"
), and not whitespace characters ("\\S"
). Each
specifies all of the characters not included in their corresponding character
sets. Let’s take a look at a few exmaples:
grepl("\\w", "abcdefghijklmnopqrstuvwxyz0123456789")
1] TRUE
[
grepl("\\d", "0123456789")
1] TRUE
[
# "\n" is the metacharacter for a new line
# "\t" is the metacharacter for a tab
grepl("\\s", "\n\t ")
1] TRUE
[
grepl("\\d", "abcdefghijklmnopqrstuvwxyz")
1] FALSE
[
grepl("\\D", "abcdefghijklmnopqrstuvwxyz")
1] TRUE
[
grepl("\\w", "\n\t ")
1] FALSE [
You can also specify specific character sets using straight brackets []
. For
example a character set of just the vowels would look like: "[aeiou]"
. You can
find the complement to a specific character by putting a carrot ^
after the
first bracket. For example "[^aeiou]"
matches all characters except the
lowercase vowels. You can also specify ranges of characters using a hyphen -
inside of the brackets. For example "[a-m]"
matches all of the lowercase
characters between a
and m
, while "[5-8]"
matches any digit between 5 and
8 inclusive. Let’s take a look at some examples using custom character sets:
grepl("[aeiou]", "rhythms")
1] FALSE
[
grepl("[^aeiou]", "rhythms")
1] TRUE
[
grepl("[a-m]", "xyz")
1] FALSE
[
grepl("[a-m]", "ABC")
1] FALSE
[
grepl("[a-mA-M]", "ABC")
1] TRUE [
You might be wondering how you can use regular expressions to match a particular punctuation mark since many punctuation marks are used as metacharacters!
Putting two backslashes before a punctuation mark that is also a metacharacter
indicates that you are looking for the symbol and not the metacharacter meaning.
For example "\\."
indicates you are trying to match a period in a string.
Let’s take a look at a few examples:
grepl("\\+", "tragedy + time = humor")
1] TRUE
[
grepl("\\.", "http://www.jhsph.edu/")
1] TRUE [
There are also metacharacters for matching the beginning and the end of a string
which are "^"
and "$"
respectively. Let’s take a look at a few examples:
grepl("^a", c("bab", "aab"))
1] FALSE TRUE
[
grepl("b$", c("bab", "aab"))
1] TRUE TRUE
[
grepl("^[ab]+$", c("bab", "aab", "abc"))
1] TRUE TRUE FALSE [
The last metacharacter we’ll discuss is the OR metacharacter ("|"
). The OR
metacharacter matches either the regex on the left or the regex on the right
side of this character. A few examples:
grepl("a|b", c("abc", "bcd", "cde"))
1] TRUE TRUE FALSE
[
grepl("North|South", c("South Dakota", "North Carolina", "West Virginia"))
1] TRUE TRUE FALSE [
Finally we’ve learned enough to create a regular expression that matches all state names that both begin and end with a vowel:
- We match the beginning of a string.
- We create a character set of just capitalized vowels.
- We specify one instance of that set.
- Then any number of characters until:
- A character set of just lowercase vowels.
- We specify one instance of that set.
- We match the end of a string.
<- "^[AEIOU]{1}.+[aeiou]{1}$"
start_end_vowel <- grepl(start_end_vowel, state.name)
vowel_state_lgl head(vowel_state_lgl)
1] TRUE TRUE TRUE FALSE FALSE FALSE
[
state.name[vowel_state_lgl]1] "Alabama" "Alaska" "Arizona" "Idaho" "Indiana" "Iowa" "Ohio"
[8] "Oklahoma" [
Below is a table of several important metacharacters:
Metacharacter | Meaning |
---|---|
. | Any Character |
\w | A Word |
\W | Not a Word |
\d | A Digit |
\D | Not a Digit |
\s | Whitespace |
\S | Not Whitespace |
[xyz] | A Set of Characters |
[^xyz] | Negation of Set |
[a-z] | A Range of Characters |
^ | Beginning of String |
$ | End of String |
\n | Newline |
+ | One or More of Previous |
* | Zero or More of Previous |
? | Zero or One of Previous |
| | Either the Previous or the Following |
{5} | Exactly 5 of Previous |
{2, 5} | Between 2 and 5 or Previous |
{2, } | More than 2 of Previous |
1.7.3 RegEx Functions in R
So far we’ve been using grepl()
to see if a regex matches a string. There are
a few other built in reged functions you should be aware of. First we’ll review
our workhorse of this chapter, grepl()
which stands for “grep logical.”
grepl("[Ii]", c("Hawaii", "Illinois", "Kentucky"))
1] TRUE TRUE FALSE [
Then there’s old fashioned grep()
which returns the indices of the vector
that match the regex:
grep("[Ii]", c("Hawaii", "Illinois", "Kentucky"))
1] 1 2 [
The sub()
function takes as arguments a regex, a “replacement,” and a vector
of strings. This function will replace the first instance of that regex found
in each string.
sub("[Ii]", "1", c("Hawaii", "Illinois", "Kentucky"))
1] "Hawa1i" "1llinois" "Kentucky" [
The gsub()
function is nearly the same as sub()
except it will replace
every instance of the regex that is matched in each string.
gsub("[Ii]", "1", c("Hawaii", "Illinois", "Kentucky"))
1] "Hawa11" "1ll1no1s" "Kentucky" [
The strsplit()
function will split up strings according to the provided regex.
If strsplit()
is provided with a vector of strings it will return a list of
string vectors.
<- state.name[grep("ss", state.name)]
two_s
two_s1] "Massachusetts" "Mississippi" "Missouri" "Tennessee"
[strsplit(two_s, "ss")
1]]
[[1] "Ma" "achusetts"
[
2]]
[[1] "Mi" "i" "ippi"
[
3]]
[[1] "Mi" "ouri"
[
4]]
[[1] "Tenne" "ee" [
1.7.4 The stringr Package
The stringr
package, written by Hadley Wickham, is part of
the Tidyverse
group of R packages. This package takes a “data first” approach to functions
involving regex, so usually the string is the first argument and the regex is
the second argument. The majority of the function names in stringr
begin with
str_
.
The str_extract()
function returns the sub-string of a string that matches the
providied regular expression.
library(stringr)
<- paste(state.name, state.area, state.abb)
state_tbl head(state_tbl)
1] "Alabama 51609 AL" "Alaska 589757 AK" "Arizona 113909 AZ"
[4] "Arkansas 53104 AR" "California 158693 CA" "Colorado 104247 CO"
[str_extract(state_tbl, "[0-9]+")
1] "51609" "589757" "113909" "53104" "158693" "104247" "5009" "2057"
[9] "58560" "58876" "6450" "83557" "56400" "36291" "56290" "82264"
[17] "40395" "48523" "33215" "10577" "8257" "58216" "84068" "47716"
[25] "69686" "147138" "77227" "110540" "9304" "7836" "121666" "49576"
[33] "52586" "70665" "41222" "69919" "96981" "45333" "1214" "31055"
[41] "77047" "42244" "267339" "84916" "9609" "40815" "68192" "24181"
[49] "56154" "97914" [
The str_order()
function returns a numeric vector that corresponds to the
alphabetical order of the strings in the provided vector.
head(state.name)
1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
[6] "Colorado"
[str_order(state.name)
1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[
head(state.abb)
1] "AL" "AK" "AZ" "AR" "CA" "CO"
[str_order(state.abb)
1] 2 1 4 3 5 6 7 8 9 10 11 15 12 13 14 16 17 18 21 20 19 22 23 25 24
[26] 26 33 34 27 29 30 31 28 32 35 36 37 38 39 40 41 42 43 44 46 45 47 49 48 50 [
The str_pad()
function pads strings with other characters which is often
useful when the string is going to be eventually printed for a person to read.
str_pad("Thai", width = 8, side = "left", pad = "-")
1] "----Thai"
[str_pad("Thai", width = 8, side = "right", pad = "-")
1] "Thai----"
[str_pad("Thai", width = 8, side = "both", pad = "-")
1] "--Thai--" [
The str_to_title()
function acts just like tolower()
and toupper()
except
it puts strings into Title Case.
<- c("CAPS", "low", "Title")
cases str_to_title(cases)
1] "Caps" "Low" "Title" [
The str_trim()
function deletes whitespace from both sides of a string.
<- c(" space", "the ", " final frontier ")
to_trim str_trim(to_trim)
1] "space" "the" "final frontier" [
The str_wrap()
function inserts newlines in strings so that when the string
is printed each line’s length is limited.
<- paste(state.name[1:20], collapse = " ")
pasted_states
cat(str_wrap(pasted_states, width = 80))
Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida
Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine
Marylandcat(str_wrap(pasted_states, width = 30))
Alabama Alaska Arizona
Arkansas California Colorado
Connecticut Delaware Florida
Georgia Hawaii Idaho Illinois
Indiana Iowa Kansas Kentucky Louisiana Maine Maryland
The word()
function allows you to index each word in a string as if it were
a vector.
<- "It was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness"
a_tale
word(a_tale, 2)
1] "was"
[
word(a_tale, end = 3)
1] "It was the"
[
word(a_tale, start = 11, end = 15)
1] "of times it was the" [
1.7.5 Summary
String manipulation in R is useful for data cleaning, plus it can be fun! For prototyping your first regular expressions I highly recommend checking out http://regexr.com/. If you’re interested in what some people call a more “humane” way of constructing regular expressions you should check out the rex package by Kevin Ushey and Jim Hester. If you’d like to find out more about text analysis I highly recommend reading Tidy Text Mining in R by Julia Silge and David Robinson.