Regular Expressions

Section 21.15 Regular Expressions

Regular expressions (regexes) are a powerful tool for matching patterns in text. They are widely used in programming, text processing, and data validation.

🔗

In this section, we will look at the syntax of regular expressions. Then, in the next section we will explore how to make use of them.

🔗

Subsection 21.15.1 What is a Regular Expression?

Imagine you have a text file that contains a list of phone numbers entered by users. They appear in a wide variety of formats:

🔗

123-456-7890
456.7890
123.456.7890
+1 123 456 7890
456-7890 ext. 123

Now picture trying to write code that can identify each of these as a phone number and extract the digits. Writing code using selection logic to handle each possible format would be tedious and error-prone.

🔗

What you really want to express is something like “I’m looking for a group of 3 digits followed by one or zero symbols and then 4 more digits.”

🔗

A regular expression allows you to express patterns like this.

🔗

Subsection 21.15.2 Basic Syntax

Regular expressions use a special syntax to define patterns. Here are some basic elements:

🔗

Literal characters: Match themselves. For example, the regex abc matches the string "abc".
🔗

🔗
Wildcards: The dot . matches any single character except a newline.
🔗

🔗
Bracket Groups: Square brackets can be used to indicate a set of characters to match, e.g., [abc] matches a single character that is either ’a’, ’b’, or ’c’.
🔗

🔗
Ranges: Inside brackets you can use - to indicate a range of characters. For example, [A-F] matches any letter in the range A to F. You can specify multiple ranges like [A-Fa-f] to match both uppercase and lowercase letters.
🔗

🔗
Character classes: There are special symbols like \d for digits, \w for word characters, and \s for whitespace characters (spaces and tabs).
🔗

🔗
Quantifiers: Specify how many times an element can occur.
- {n} matches exactly n occurrences of the preceding element.
  
  🔗
- * matches zero or more occurrences of the preceding element.
  
  🔗
- ? makes the preceding element optional (zero or one occurrence).
  
  🔗
For example, a{3} matches exactly three ’a’s, while a* matches zero or more ’a’s. ab? makes the preceding element (’b’ in this case) optional. It would match either "a" or "ab".

🔗
🔗

We will not attempt to cover every aspect of regular expression syntax here. Instead, you are encouraged to explore a resource like RegexOne if you want to learn more.

🔗

Subsection 21.15.3 Patterns and Groups

We can use ( ) to form groups in our patterns. This allows us to apply quantifiers to entire groups. For instance, to say “there may be a group of 3 digits”, you could use (\d{3})? .

🔗

Groups also allow you to capture parts of the match for further processing. For example, the regex (\d{3})-(\d{4}) matches a pattern like "456-7890", capturing the prefix (456) and the local number (7890) as separate groups.

🔗

To return to our phone number example, we need to write a regular expression that matches the following:

There may be a group of 3 digits for an area code. We need to use a question mark to make this optional: (\d{3})?

🔗
Next there may or may not be a separator symbol like a dash or space or period. We will allow it to be any character by using the . symbol and make it optional with a question mark: .?

🔗
There has to be a group of 3 digits. We want to capture this group so let’s make sure to use parentheses: (\d{3})

🔗
Then there may be a separator symbol again: .?

🔗
Finally, there has to be a group of 4 digits. We want to capture this group: (\d{4})

🔗

Our final regex looks like this: (\d{3})?.?(\d{3}).?(\d{4}).

🔗

This isn’t a perfect regex for phone numbers, but it would match the examples above. It also illustrates the fact that regexes can get complex quickly and end up being quite difficult to read.

🔗