.net programming, computers and assorted technology rants

Regular Expressions: Part 1


Courtesy Ondrej Balas, Visual Studio Magazine

Regular expressions — those scary strings that might as well be written in Klingon to the average person — can be a vast time-saver. They help in one of the most common tasks of programming: string manipulation. The .NET Framework has an excellent, built-in regular expressions engine that’s relatively straightforward to use.

The use and understanding of regular expressions (hereafter referred to as RegEx) traditionally comes in three parts: testing a string to see if a pattern exists within it (pattern matching); reading strings and extracting useful information; manipulating and making changes to those strings.

Pattern Matching
The first thing you’ll likely want to use a RegEx for is to do pattern matching, the simplest example of which is determining whether a particular set of characters exists within a string. In fact, if you’ve ever used the “Find” feature in any text editor, you’ve effectively used a regular expression composed only of the most basic pieces: literals. In a RegEx pattern, a literal is a character that must be matched exactly. Searching for the pattern “BCD” within the string “ABCDE” would result in a match, for example. In this case, the search pattern would read as, “The literal B, immediately followed by the literal C, immediately followed by the literal D.”

As beneficial as that is, there are certainly better and simpler ways of accomplishing a simple search without the use of regular expressions. The power of RegEx doesn’t become apparent until you start using elements like character classes, or groups of characters that can be accepted as a match. Character classes are denoted by the brackets surrounding them. Take the character class [0123456789] for example, which instructs the RegEx engine to match any character that is a number between 0 and 9. Alternatively, you could use [0-9], which has the same meaning to the engine, but is more readable to a human. RegEx additionally has some helpful shortcuts for commonly used classes like this one. [0-9] can be further shortened by using \d, the shortcut for “Match any numeric character.”

The Regex Object

Using RegEx in your code requires the RegEx object, located within the System.Text.RegularExpressions namespace. For a usage example, see Listing 1 below. The code in this example reads a string as input from the console, then shows a response depending on whether the string contains a numeric character anywhere within it (using \d as the pattern). You can experiment with different things in the pattern, such as \d\d, which will match only if two numbers are next to each other somewhere in the string (e.g., “12ab” will match, but “1a2” will not).

Listing 1: Test To See If the String Entered Contains a Number

Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim input As String = ""
    Dim pattern As String = "\d"
    While Not input.ToUpper() = "Q"
      Console.WriteLine("Enter a string, or Q to exit.")
      input = Console.ReadLine()
      If Regex.IsMatch(input, pattern) Then
        Console.WriteLine("String contains a number")
      Else
        Console.WriteLine("String does NOT contain a number")
      End If
    End While
  End Sub
End Module

Wildcards
In RegEx, wildcards are really just shortcuts to character classes. Just like \d is a shortcut for [0-9], there are other shortcuts, too. See Table 1 below for a list of the most commonly used character classes. Note that the period is special in that it matches any character except the line break by default.

Table 1: A List of Character Class Shortcuts

Shortcut Class Description
\d [0-9] numeric character
\D [^0-9] NOT a numeric character
\w [a-zA-Z0-9_] “word” character
\W [^a-zA-Z0-9_] NOT a “word” character
\s [ \f\n\r\t\v] whitespace character
\S [^ \f\n\r\t\v] NOT a whitespace character
. [^\n] Any character except line break

Validation
One common use of the Regex.IsMatch() method is in validation. For example, if you wanted to validate that a phone number was entered, a simple approach to that would be a pattern of “\d\d\d-\d\d\d\d” — in this case, the hyphen (-) acts as a literal hyphen and nothing else, so the pattern reads as, “Three digits, followed directly by a hyphen, followed directly by four more digits.” Of course, this requires the hyphen to exist; if it’s omitted from the input, there won’t be a match. To solve this, you can add a question mark after the hyphen, as “\d\d\d-?\d\d\d\d” shows. The question mark acts as a quantifier in this case, meaning “Match zero or one” of the preceding character, the hyphen.

Quantifiers
There are a few other quantifiers aside from the question mark. The most commonly used is the asterisk, which means, “Match as many as possible.” In the preceding example, an asterisk could have been used in place of the question mark like this: “\d\d\d-*\d\d\d\d”; this would result in a match regardless of how many hyphens were entered. For example, “123—-5555” would result in a match.

Another quantifier almost the same as the asterisk is the plus sign, which means, “Match as many as possible, with a minimum of one.” If you were to use the plus instead of the asterisk, like this:

"\d\d\d-+\d\d\d\d"

a string of “123-5555” would result in a match, but “1235555” would not.

There’s also a quantifier that allows you to be more precise in how many times something must be repeated for it to trigger a match. You can set these by using curly brackets, like this: “\d{3}-\d{4}”. This reads as, “Match exactly three digits, followed by one hyphen, followed by exactly four digits.” Another alternative usage allows a range to be specified instead. An example would be “\d{3,5}”, a pattern matching anywhere from three to five digits. Note that this pattern will also match if the input is “123456”, because the RegEx engine successfully finds the required number of digits.

Grouping
As mentioned above, quantifiers always act on the preceding character or group. To allow for repetitions of more than one character, you can create something called a grouping. Back to the phone number example, try using the pattern “(\d\d\d-){1,2}\d\d\d\d” to allow for area codes to be entered. With this pattern, both “123-5555” and “444-123-5555” would match.

In Listing 2 below, I bring it all together and create a somewhat more robust phone validation pattern.

Listing 2: A More Robust Phone Validation Pattern

Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim input As String = ""
    Dim pattern As String = "(\d\d\d-){1,2}\d\d\d\d"
    While Not input.ToUpper() = "Q"
      Console.WriteLine("Enter a string, or Q to exit.")
      input = Console.ReadLine()
      If Regex.IsMatch(input, pattern) Then
        Console.WriteLine("String contains a valid phone number")
      Else
        Console.WriteLine("String does NOT contain a valid phone number")
      End If
    End While
  End Sub
End Module

Getting Started
Regular expressions are a deep topic, and this article only scratches the surface. While I’ve only showed their use in validation thus far, they’re also quite handy in many scenarios involving parsing or string manipulation. With a solid understanding, regular expressions will become an invaluable tool to have in your tool belt. Check back next month for the next article in this series, in which I go deeper into the internals of regular expressions and how you can get even more out of them.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s