What is a Regular Expression?
A regular expression is a construct in Perl that can be used to determine whether a string of text contains a substring that has a particular form (or pattern, as they’re called more formally). You’ve actually already seen a sneak preview of regular expressions (or regexes for short). Recall this line from the getNumberInput function in the post on subroutines, which we used to determine whether the user input was actually a number:
In this post, we’ll look more closely at what that means and how it works, as well as other examples of regexes. I will refer back to this regex throughout the post as the “number example.”
The Basics
Regexes are delimited by forward slashes / /. The operator =~ is used to test whether a string contains a substring that matches the specified regex, and the operator !~ tests whether a string does not contain a substring that matches the specified regex.
The simplest regexes are literal text. For example, $string =~ /foo/ tests whether $string contains foo as a substring. Placing a lowercase i after the closing forward slash makes the search case-insensitive. So, for example, $string =~ /foo/i will be true if $string contains any of the following as substrings:
Like double-quoted strings, regexes are interpolated. This means that variable names appearing in a regex will be replaced with the variable’s contents. So, for example,
would output t, since fooBar is a substring of fooBarBaz.
Note that the following fourteen characters have special meaning within a regex. In order to use any of these as literal characters, they must be preceded by a backslash \:
Anchors
What if we want to look for something to be at the very start or very end of the string? We use an anchor. The carat (^) means “match the beginning of the string,” and the dollar sign ($) means “match the end of the string.” So $string =~ /^foo/ matches only if the first three characters of the string are foo.
Anchoring can also be seen in the number example. Note that the first character inside the opening forward slash is ^ and the last character before the closing forward slash is $. By using both the beginning-of-string and end-of-string anchors, I tell Perl that I want to check for an exact match with the entire string, rather than checking for a substring that matches my regex.
Character Classes
This is where the true power of regexes starts to make itself known. Instead of matching one specific character, I can specify that I want to match one of several possible characters by defining a character class. A character class is defined by placing the desired characters in square brackets [ ]. For example, $string =~ /[bcr]at/ will match if $string contains bat, cat, or rat as a substring. Character classes can also specify ranges of characters; for example, [A-Z] represents any uppercase letter. A character class can also be negated by placing a carat ^ immediately inside the opening square bracket. For example, [^A-Z] represents any character other than an uppercase letter.
Perl also provides the following predefined character classes that can be accessed via shorthand notation:
| Shorthand | Represents |
|---|---|
\d |
A digit (0-9) |
\D |
Anything other than a digit |
\w |
A “word character”—any character that can be validly used in a Perl identifier (letter, number, or underscore _) |
\W |
Anything other than a word character |
\s |
A whitespace character |
\S |
Anything other than a whitespace character |
. |
Any character (literally anything at all) |
Note the use of the \d character class in the number example above. This allows me to match any digit 0-9, which is good because I want to accept any validly formatted real number regardless of what digits it uses. Also note, that, although a . appears in the number example, I am not using the “any character” class; by preceding the . with a backslash, I’ve told Perl to interpret it as a literal character. The regex will thus be looking for the actual . character in $usrInput.
Quantifiers
Quantifiers are used to specify the number of times a particular element may appear in the matching substring. There are seven different quantifiers, which are shown in the table below applied to the letter a. Note that x and y in the below examples represent positive integers.
| Quantifier | Meaning |
|---|---|
a? |
Zero or one occurrences of a |
a* |
Zero or more occurrences of a |
a+ |
One or more occurrences of a |
a{x} |
Exactly x occurrences of a |
a{x,} |
At least x occurrences of a |
a{,x} |
At most x occurrences of a |
a{x,y} |
At least x but no more than y occurrences of a |
Note that the quantifiers bind only to the immediately preceding character or character class by default. For example, ab+ will match ab, abb, abbb, and so on. To apply a quantifier to more than one character or character class, we must define the characters we want the quantifier to apply to as a group by enclosing them in parentheses ( ). So, for example, (ab)+ will match ab, abab, ababab, and so on.
Also note that the quantifiers are greedy by default—they will attempt to match as large a substring as they possibly can while still allowing the entire pattern to produce a match. This is of particular concern with the constructs .* and .+, which will match the largest consecutive sequence of “anything at all” that they can get their grubby paws on. Any of the quantifiers can be made reluctant by following them with a question mark ?. This will cause them to match the shortest sequence they can that will still allow the entire pattern to produce a match. For example, consider my $string = "abracadabra";. Matching it against the regex /a\w*a/, using the greedy quantifier, the initial a in the regex will match the first a in the string, the \w* will consume the bracadabr, and the last a in the regex will match the last a in the string, so the regex matches the entire string, abracadabra. Matching it against the regex /a\w*?a/, using the reluctant quantifier, the initial a in the regex will again match the first a in the string. The reluctantly quantified \w*?, however, will match only the br, stopping as soon as it gets to the second a, which gets matched to the last a in the regex. Thus, the version of the regex using the reluctant quantifier finds only abra as its match.
We see several quantifiers in the number example. Firstly, a ? quantifier is bound to the - character immediately following the start-of-string anchor. This makes the - character optional. Both occurrences of the \d character class have a + quantifier bound to them—this means there can be one or more digits, which is good since we don&rdsquo;t know how big of a number the user might give us. Finally, we have another ? quantifier, this time bound to the group (\.\d+). This makes the entire group optional; however, we cannot have only part of the group present. The group must either be able to match in its entirety or be completely absent.
We now know enough to describe the number example in full. The pattern matched by the number example is: the beginning of the string, optionally followed by a negative sign, followed by one or more digits, optionally followed by a decimal point and one or more additional digits, followed by the end of the string. By checking whether the user input matches this regex, we are assuring ourselves that the user has input a validly formatted real number.
Backreferences
Groups have another purpose besides just having quantifiers bound to them. They can also be used to extract a portion of the matched string to be looked at later. The extracted groups are stored in special backreference variables, which begin with $1 for the first extracted group, $2 for the second extracted group, and so on. By using groups to extract portions of our matched string into the backreference variables, we can write a program that more clearly demonstrates the difference between the greedy and reluctant quantifiers:
When run, this program produces the output
Note that the group numbering has no bearing on what groups were actually matched, only on the groups as they are specified in the regex. So, for example, in the regex /([A-Za-z]{3})?(\d+)/, the substring matched by the group (\d+) will always be in backreference variable $2, even if the optional group preceding it was not matched. If we only intend to extract some of the defined groups, and the other sets of parentheses are being used only to define a group for a modifier to bind to, we can place the sequence ?: immediately inside the opening parenthesis of a group to tell Perl not to extract that group into a backreference variable. So, for example, using the regex /(?:[A-Za-z]{3})?(\d+)/, the optional first group will not be extracted into a backreference variable. The group we actually care about extracting, (\d+), is thus placed in backreference variable $1, since the preceding group is no longer being extracted.
No comments:
Post a Comment