.net programming, computers and assorted technology rants

Posts tagged “Regular Expressions

Regular Expressions: Part 3

Courtesy Ondrej Balas, VisualStudioMagazine.com

In Part 1 and Part 2 in this series about regular expressions, I went over some of the key features of regular expressions and how to use them in your code.

Regular expressions can be useful in other places, too, such as the find/replace feature of your favorite IDE or text editor, or even in business intelligence (BI).

Advanced Find/Replace
One often overlooked use of regular expressions is in the Find window of your IDE. Let’s say for example you had Visual Studio open and were tasked with ensuring all private fields in some C# code were prefixed with an underscore. You could find all private fields NOT prefixed by an underscore by searching for:

private (\w+) ([^_]\w*)

This pattern reads as “the word private, followed by a space, followed by one or more word characters, followed by a space, followed by any character that is not an underscore, followed by zero or more word characters.”

To then add the underscore to each of those instances, set the replacement string to:

private $1 _$2

This will replace any matches with the word private, followed by a space, followed by the first capture (the type of variable), followed by a space, then an underscore, and finally the second capture (the original variable name). To see what this looks like in the Visual Studio Find and Replace window, see Figure 1. Don’t forget to make sure you check the box for Use Regular Expressions.

[Click on image for larger view.]

Figure 1. The Visual Studio Find and Replace Window

Other Uses

Another area in which regular expressions can come in handy is in simple BI. One of my clients regularly asks me to pull data from various sources, do some aggregation and put it into a spreadsheet. In doing this, I’ve seen a multitude of data sources: HTML, documents using custom markup languages, social network feeds, and even a zip file full of unrelated text files. And while many tools exist for parsing specific things such as HTML or XML, I usually turn to using regular expressions first, before switching to a more specific tool if necessary.

I generally use a free tool called Expresso, which allows me to quickly build the expression I’ll use for parsing.

Read More…

http://visualstudiomagazine.com/articles/2014/03/01/regular-expressions-part-3.aspx


Regular Expressions: Part 2

Courtesy Ondrej Balas, Visual Studio Magazine

In my last column, I left off with an explanation of how groups can be used to divide a pattern into smaller pieces, or sub-expressions, allowing for repeating subsets of the pattern. But groups have other benefits as well, such as extracting information from a string. Take the following code, for example:

Dim input As String = "555-123-4567"
Dim pattern As String = "(\d\d\d)-\d\d\d-\d\d\d\d"
Dim match As Match = Regex.Match(input, pattern)
Dim areaCode As String = match.Groups(1).Value

In this code, there’s a set of parentheses around what would be the area code portion of the pattern, forming a capturing group. When the Regex engine returns the Match object, it puts all those groups into a Groups property on that Match object. You can then use the indexer on the property to get the group you want to access. Notice that I used an index of 1 to get the area code group. While the Groups property is zero-based, the first group (Groups(0)) is always a match of the entire regular expression. Then, each left parenthesis in the pattern gets a subsequent number; the first one will be in Groups(1), the second in Groups(2) and so on.

Using numeric indexers like this is fine when working with simple patterns, but complexity increases quickly as the Regex grows in size. It can also be problematic when changing the expression, as the groups may end up with different numbers as the pattern is changed. To solve this problem, you can optionally name the groups in the pattern, and then retrieve them by name rather than by number. Named groups have a special syntax, as shown here:

Dim input As String = "555-123-4567"
Dim pattern As String = "(?<AreaCode>\d\d\d)-\d\d\d-\d\d\d\d"
Dim match As Match = Regex.Match(input, pattern)
Dim areaCode = match.Groups("AreaCode").Value

The group that was (\d\d\d) is now (?<AreaCode>\d\d\d). In this case, the question mark directly after the left parenthesis tells the Regex engine that the group should follow special rules. Because it’s followed by a name within angle brackets, it knows to treat it as a named group. Now the group can be referenced by the name “AreaCode” instead of number (though it can still be accessed by number as well).

You may have noticed that I’ve been using the Value property of the Group object to get the matched text. The Group object also has a few other helpful properties, as listed in Table 1.

Table 1: The Properties of the Group Object 

Property Description
Captures A collection containing sub-captures within the group
Index The position at which this group matches within the input string
Length The length of the captured string
Success A Boolean that specifies whether the group matched or not.
Value The matching text

Property Description Captures A collection containing sub-captures within the group Index The position at which this group matches within the input string Length The length of the captured string Success A Boolean that specifies whether the group matched or not. Value The matching text

Position
Up to this point I’ve glossed over positioning and the difference between matching a character and matching the position between two characters (also known as an anchor). The two most commonly used position characters are the caret (^), which matches the position before the first character in the string, and the dollar sign ($), which matches the position at the very end of the string. Again, revisiting the phone number example, a pattern of \d\d\d-\d\d\d\d will match if a phone number appears anywhere within the match string. Even the string “abc123-4567def” would successfully be matched by that pattern. A better pattern would be “^\d\d\d-\d\d\d\d$,” which reads as: “The position before the first character, immediately followed by three digits, then a hyphen, then four more digits, and then immediately followed by the position after the last character.” The following code snippet demonstrates this:

Dim invalidInput As String = "abc123-4567def"
Dim validInput As String = "123-4567"
Dim pattern As String = "^\d\d\d-\d\d\d\d$"
Dim invalidMatchResult As Boolean = Regex.Match(invalidInput, pattern).Success 'False
Dim validMatchResult As Boolean = Regex.Match(validInput, pattern).Success 'True

Another positional character is \b, or word boundary. Word boundary (\b) matches successfully for a position between a word character and non-word character, where a word character is defined as any alphanumeric character or underscore. An example of this would be a search for how often the word “bot” appears within a log file. With a pattern of “bot,” words such as “robot” would match as well, leading to an inaccurate word count. A pattern that would avoid this would be “\bbot\b,” which reads as: “A word boundary, immediately followed by the word bot, immediately followed by another word boundary.” Here’s a usage example:

Dim input = "search bot | robots.txt"
Dim simplePattern = "bot"
Dim betterPattern = "\bbot\b"
Dim simpleCount As Integer = Regex.Matches(input, simplePattern).Count '2
Dim betterCount As Integer = Regex.Matches(input, betterPattern).Count '1

Using the simple pattern, the engine returns a count of 2 instances of the word “bot,” when one instance is just its occurrence within the word “robots.” By requiring word boundaries before and after the word “bot,” the engine returns the correct count of 1.

Greedy or Lazy
In my last column, I showed off many of the quantifiers that Regex offers, such as the asterisk (*). To get the most out of quantifiers, it’s important to understand the distinction between greedy and lazy behaviors. By default, quantifiers behave in a “greedy” fashion, meaning they consume as many characters as possible. It’s possible to individually change that behavior to “lazy” by following them with a question mark. Consider this example:

Dim input As String = "http://www.example.com/samples/demo.html"
Dim greedyPattern As String = "http://(.*)/"
Dim lazyPattern As String = "http://(.*?)/"
Dim greedyMatch As String = Regex.Match(input, greedyPattern).Value 'http://www.example.com/samples/
Dim lazyMatch As String = Regex.Match(input, lazyPattern).Value 'http://www.example.com/

The patterns are almost identical, with the difference being that the greedy pattern uses a group of (.*) while the lazy pattern uses (.*?). The results are quite different, however. When using the greedy pattern, the resulting match was “http://www.example.com/samples/,&#8221; but with the lazy pattern it was “http://www.example.com/.&#8221; The difference is in the way the Regex engine steps through the input string to find a match.

When parsing the greedy expression, it matches the http:// and starts stepping through the input, matching as many periods (any character) as it can. It will do this until it reaches the end of the string, and then attempt to match the slash. Because there’s no slash after the end of the string, the engine will start back-tracking until it finds a slash. See Figure 1 for a simplified example of how this might be parsed by the Regex engine.

[Click on image for larger view.]Figure 1. Simplified example of behavior when matching the greedy expression.

The engine deals with the lazy expression much differently. Instead of matching as much as it can, it matches as little as it can get away with while still matching the slash (/) following the quantifier. Figure 2 shows a simplified example of this behavior.

[Click on image for larger view.]Figure 2. Simplified example of behavior when matching the lazy expression.

Tools
Regular expressions can be difficult both to write and to read; fortunately, there are some great tools that can help. To jump-start your understanding of more complex expressions, I recommend a free tool called Expresso.

If you’re interested in a deeper understanding of how the engine handles your expressions, or just want to debug a complex expression, try out RegExpose, an open source tool written by Brian Friesen and available on GitHub.

Advanced Scenarios
In the next part of this series, I’ll be exploring some advanced scenarios for regular expressions, such as using them as part of the Find & Replace feature in Visual Studio, or in applying them to business intelligence. I hope you find as much value in regular expressions as I continue to, year after year.

 

 

 


Regular Expressions: Part 1

Courtesy Ondrej Balas, Visual Studio Magazine

Regular expressions — those scary strings that might as well be written in Klingon to the average person — can be a vast time-saver. They help in one of the most common tasks of programming: string manipulation. The .NET Framework has an excellent, built-in regular expressions engine that’s relatively straightforward to use.

The use and understanding of regular expressions (hereafter referred to as RegEx) traditionally comes in three parts: testing a string to see if a pattern exists within it (pattern matching); reading strings and extracting useful information; manipulating and making changes to those strings.

Pattern Matching
The first thing you’ll likely want to use a RegEx for is to do pattern matching, the simplest example of which is determining whether a particular set of characters exists within a string. In fact, if you’ve ever used the “Find” feature in any text editor, you’ve effectively used a regular expression composed only of the most basic pieces: literals. In a RegEx pattern, a literal is a character that must be matched exactly. Searching for the pattern “BCD” within the string “ABCDE” would result in a match, for example. In this case, the search pattern would read as, “The literal B, immediately followed by the literal C, immediately followed by the literal D.”

As beneficial as that is, there are certainly better and simpler ways of accomplishing a simple search without the use of regular expressions. The power of RegEx doesn’t become apparent until you start using elements like character classes, or groups of characters that can be accepted as a match. Character classes are denoted by the brackets surrounding them. Take the character class [0123456789] for example, which instructs the RegEx engine to match any character that is a number between 0 and 9. Alternatively, you could use [0-9], which has the same meaning to the engine, but is more readable to a human. RegEx additionally has some helpful shortcuts for commonly used classes like this one. [0-9] can be further shortened by using \d, the shortcut for “Match any numeric character.”

The Regex Object

Using RegEx in your code requires the RegEx object, located within the System.Text.RegularExpressions namespace. For a usage example, see Listing 1 below. The code in this example reads a string as input from the console, then shows a response depending on whether the string contains a numeric character anywhere within it (using \d as the pattern). You can experiment with different things in the pattern, such as \d\d, which will match only if two numbers are next to each other somewhere in the string (e.g., “12ab” will match, but “1a2” will not).

Listing 1: Test To See If the String Entered Contains a Number

Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim input As String = ""
    Dim pattern As String = "\d"
    While Not input.ToUpper() = "Q"
      Console.WriteLine("Enter a string, or Q to exit.")
      input = Console.ReadLine()
      If Regex.IsMatch(input, pattern) Then
        Console.WriteLine("String contains a number")
      Else
        Console.WriteLine("String does NOT contain a number")
      End If
    End While
  End Sub
End Module

Wildcards
In RegEx, wildcards are really just shortcuts to character classes. Just like \d is a shortcut for [0-9], there are other shortcuts, too. See Table 1 below for a list of the most commonly used character classes. Note that the period is special in that it matches any character except the line break by default.

Table 1: A List of Character Class Shortcuts

Shortcut Class Description
\d [0-9] numeric character
\D [^0-9] NOT a numeric character
\w [a-zA-Z0-9_] “word” character
\W [^a-zA-Z0-9_] NOT a “word” character
\s [ \f\n\r\t\v] whitespace character
\S [^ \f\n\r\t\v] NOT a whitespace character
. [^\n] Any character except line break

Validation
One common use of the Regex.IsMatch() method is in validation. For example, if you wanted to validate that a phone number was entered, a simple approach to that would be a pattern of “\d\d\d-\d\d\d\d” — in this case, the hyphen (-) acts as a literal hyphen and nothing else, so the pattern reads as, “Three digits, followed directly by a hyphen, followed directly by four more digits.” Of course, this requires the hyphen to exist; if it’s omitted from the input, there won’t be a match. To solve this, you can add a question mark after the hyphen, as “\d\d\d-?\d\d\d\d” shows. The question mark acts as a quantifier in this case, meaning “Match zero or one” of the preceding character, the hyphen.

Quantifiers
There are a few other quantifiers aside from the question mark. The most commonly used is the asterisk, which means, “Match as many as possible.” In the preceding example, an asterisk could have been used in place of the question mark like this: “\d\d\d-*\d\d\d\d”; this would result in a match regardless of how many hyphens were entered. For example, “123—-5555” would result in a match.

Another quantifier almost the same as the asterisk is the plus sign, which means, “Match as many as possible, with a minimum of one.” If you were to use the plus instead of the asterisk, like this:

"\d\d\d-+\d\d\d\d"

a string of “123-5555” would result in a match, but “1235555” would not.

There’s also a quantifier that allows you to be more precise in how many times something must be repeated for it to trigger a match. You can set these by using curly brackets, like this: “\d{3}-\d{4}”. This reads as, “Match exactly three digits, followed by one hyphen, followed by exactly four digits.” Another alternative usage allows a range to be specified instead. An example would be “\d{3,5}”, a pattern matching anywhere from three to five digits. Note that this pattern will also match if the input is “123456”, because the RegEx engine successfully finds the required number of digits.

Grouping
As mentioned above, quantifiers always act on the preceding character or group. To allow for repetitions of more than one character, you can create something called a grouping. Back to the phone number example, try using the pattern “(\d\d\d-){1,2}\d\d\d\d” to allow for area codes to be entered. With this pattern, both “123-5555” and “444-123-5555” would match.

In Listing 2 below, I bring it all together and create a somewhat more robust phone validation pattern.

Listing 2: A More Robust Phone Validation Pattern

Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim input As String = ""
    Dim pattern As String = "(\d\d\d-){1,2}\d\d\d\d"
    While Not input.ToUpper() = "Q"
      Console.WriteLine("Enter a string, or Q to exit.")
      input = Console.ReadLine()
      If Regex.IsMatch(input, pattern) Then
        Console.WriteLine("String contains a valid phone number")
      Else
        Console.WriteLine("String does NOT contain a valid phone number")
      End If
    End While
  End Sub
End Module

Getting Started
Regular expressions are a deep topic, and this article only scratches the surface. While I’ve only showed their use in validation thus far, they’re also quite handy in many scenarios involving parsing or string manipulation. With a solid understanding, regular expressions will become an invaluable tool to have in your tool belt. Check back next month for the next article in this series, in which I go deeper into the internals of regular expressions and how you can get even more out of them.