.net programming, computers and assorted technology rants

Archive for February, 2014

360 million recently compromised passwords for sale online

Courtesy Dan Goodin, Ars Technica

Underscoring the insecurity of many online dating, job, and e-mail services, security researchers said that they have tracked almost 360 million compromised login credentials for sale in underground crime forums over the past three weeks.

The haul, which included an additional 1.25 billion records containing only e-mail addresses, came from multiple breaches, according to a statement posted Tuesday by Hold Security. The biggest single list contained 105 million details, making it among the bigger online finds, the firm told Reuters. The cache included e-mail addresses that most likely served as user names and corresponding passwords. It remains unclear what service the account credentials unlock.

THE SECRET TO ONLINE SAFETY: LIES, RANDOM CHARACTERS, ANDA PASSWORD MANAGER

Or, how to go from "123456" to "XBapfSDS3EJz4r42vDUt."

Hold Security is the same firm that in October discovered the circulation of 153 million user names and passwords stolen during a massive breach of Adobe’s corporate network. A month later, the security firm uncovered 42 million plaintext passwords taken during a hack on niche dating service Cupid Media.

At 360 million, Hold Security’s latest find is big enough that it likely also came from hacks on poorly secured Web service servers that store large caches of user credentials. The risk of these types of attacks are biggest for users who choose the same password for multiple services. Once an attacker has someone’s e-mail address and password for one site, the credentials can be used to compromise every other site account that uses the same user name and password. Ars has long advised readers to use a long, randomly generated password that’s unique for each online account. You can find a much more detailed how-to here.

Advertisements

Tor develops its own anonymous IM tool

Courtesy Sean Gallagher, Ars Technica

The Tor Foundation is moving forward with a plan to provide its own instant messaging service. Called the Tor Instant Messaging Bundle, the tool will allow people to communicate in real time while preserving anonymity by using chat servers concealed within Tor’s hidden network.

In planning since last July—as news of the National Security Agency’s broad surveillance of instant messaging traffic emerged—the Tor Instant Messaging Bundle (TIMB) should be available in experimental builds by the end of March, based on a roadmap published in conjunction with the Tor Project’s Winter Dev meeting in Iceland.

TIMB will connect to instant messaging servers configured as Tor “hidden services” as well as to commercial IM services on the open Internet.

The effort, which is funded by an anonymous donor organization, was originally called Attentive Otter. To ensure the anonymity of the user, TIMB will force all instant messaging traffic through the Tor network, regardless of whether it’s aimed at a server on the Tor network or not. TIMB will be based onInstantbird, an open source instant messaging tool which is itself based on Mozilla’s XULrunner cross-platform runtime environment.

Instantbird was chosen after the TIMB team decided against using Pidgin or libpurple, the GPL open-source instant messaging library used by Pidgin and Adium, mostly because of the amount of effort that would have been required to audit and maintain the library, and also because of some concerns about how seriously Pidgin’s developers took security concerns. The TIMB project will remove libpurple from Instantbird, a task that the Mozilla and Instantbird team were already working toward as they move the software to a pure JavaScript implementation.

The first experimental release of TIMB won’t include “off the record” (OTR) capability. OTR mode encrypts traffic further and uses an exchange of digital signatures to verify the identity of each party. But the signatures can’t be checked by anyone outside the instant messaging session and can’t be used to prove identity outside the session. The Tor team is hoping to develop OTR components for Instantbird and get them merged into future versions of the main Instantbird code line.


Service Bus for Windows Server: A Primer, Part II

Courtesy Lei Zhong, blog.appliedis.com/

In my last post on Service Bus for Windows, we covered the overview, installation and configuration of SBWS. Now it’s time to dive into the API and see it in action. The code snippets in this post will focus on Service Bus queues. Beyond queues, Service Bus also supports topics and subscriptions mode to allow independent retrieval with filtered view of the published message stream.

Project References

To set up the development environment, right-click References in Solution Explorer, then click Add Library Package Reference. Search with the “ServiceBus.v” keyword, as there are many service bus related Nuget packages. As of this writing, the latest version is 1.1. Please make sure to download the version consistent with your Service Bus installation.

System.Runtime.Serialization dll is also needed for the code snippets below.

Queue Operations

In the following code snippets, we are going to show detailed steps on how a queue is created, and how messages are sent to and retrieved from the queue.

1. Get the user credential.

1
2
// Populate NetworkCredential
NetworkCredential credential = new NetworkCredential("lei.zhong""mypassword""appliedis.com");

2. Create service bus and security token service end points.

1
2
3
var sbUriList = new List<Uri>() { new UriBuilder { Scheme = "sb", Host = " mylaptop.appliedis.com", Path = "ServiceBusDefaultNamespace" }.Uri };
 
var httpsUriList = new List<Uri>() { new UriBuilder { Scheme = "https", Host = "mylaptop.appliedis.com", Path = "ServiceBusDefaultNamespace", Port = 9355 }.Uri };

3. Create MessagingFactory. The factory will be used later to create queue client to send/receive messages. Note two types of end points are used for token provider and MessagingFactory respectively.

1
2
3
TokenProvider tokenProvider = TokenProvider.CreateOAuthTokenProvider(httpsUriList, credential);
 
    MessagingFactory messagingFactory = MessagingFactory.Create(sbUriList, tokenProvider);

4. Create a queue. This involves a few steps.

First, create ServiceBusConnectionStringBuilder.

1
2
3
ServiceBusConnectionStringBuilder connBuilder = new ServiceBusConnectionStringBuilder { ManagementPort = 9355, RuntimePort = 9354 };
            connBuilder.Endpoints.Add(sbUriList[0]);
            connBuilder.StsEndpoints.Add(httpsUriList[0]);

Next, create a NamespaceManager.

1
2
3
4
5
6
7
NamespaceManager namespaceManager = NamespaceManager.CreateFromConnectionString(connBuilder.ToString());
namespaceManager.Settings.TokenProvider = tokenProvider;
 
 const string newQueueName = "MyQueue";
 
    // check if queue exists
    if (namespaceManager.QueueExists(newQueueName)) return;

Finally, let’s create the queue. Queue parameters are wrapped in QueueDescription class.

1
2
3
4
5
var queueDescription = new QueueDescription(newQueueName);
 queueDescription.LockDuration = new TimeSpan(0, 1, 0); // 1 minute
 queueDescription.MaxDeliveryCount = 3;
 
namespaceManager.CreateQueue(queueDescription);

Note: the lockDuration is the duration of a peek lock on a message – the amount of time that the message is locked for other receivers. The maximum value for LockDuration is five minutes. MaxDelivery is the maximum delivery count after which a message is automatically deadlettered. We’ll revisit the dead letter issue later.

5. Let’s send a message to Queue. The class BrokerMessage can wrap any object type but for demo purpose let’s just send a string. I strongly recommend storing same types of objects in a given queue to make queue retrieval easier, which makes sense from business perspective anyway.

Since we already have the MessageFactory configured, creating a queue client off it is just one line of code. You can ignore the dead letter client for now, but it is also straightforward.

1
2
3
4
QueueClient queueClient = messagingFactory.CreateQueueClient(newQueueName, ReceiveMode.PeekLock);
        QueueClient deadLetterQueueClient = messagingFactory.CreateQueueClient(QueueClient.FormatDeadLetterPath(queueClient.Path), ReceiveMode.ReceiveAndDelete);
 
        queueClient.Send(new BrokeredMessage(&"Hello, service bus!"));

Here, ReceiveMode.PeekLock is to keep the message peek-locked until the receiver abandons the message, while ReceiveAndDelete is to delete the message after it is received. The first mode allows a message to be peeked (and thus processed) multiple times. I use this mode in scenario where the processing of message may fail and it needs to be returned to the queue for another retrieval.

6. Now, let’s receive the message. The BrokerMessage will be dehydrated to the correct object type (string in our case).

1
2
3
4
BrokeredMessage message = queueClient.Receive();
string messageBody = message.GetBody<string>();
Console.WriteLine(messageBody);
Console.Read();

As you might have expected, there is a batch receive method ReceiveBatch which returns IEnumerable<BrokeredMessage>. It has three overloaded flavorsdocumented here.

At this point, the message can be marked completed by calling message.Compete(), or be returned to the queue by calling message.Abandon(), depending if the message has to be consumed/processed again based on your business logic.

Here’s a more complete code block:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Populate NetworkCredential
NetworkCredential credential = new NetworkCredential("lei.zhong"" mypassword""appliedis.com");
 
var sbUriList = new List<Uri>() { new UriBuilder { Scheme = "sb", Host = "mylaptop.appliedis.com", Path = "ServiceBusDefaultNamespace" }.Uri };
var httpsUriList = new List<Uri>() { new UriBuilder { Scheme = "https", Host = " mylaptop.appliedis.com", Path = "ServiceBusDefaultNamespace", Port = 9355 }.Uri };
 
TokenProvider tokenProvider = TokenProvider.CreateOAuthTokenProvider(httpsUriList, credential);
MessagingFactory messagingFactory = MessagingFactory.Create(sbUriList, tokenProvider);
 
ServiceBusConnectionStringBuilder connBuilder = new ServiceBusConnectionStringBuilder { ManagementPort = 9355, RuntimePort = 9354 };
connBuilder.Endpoints.Add(sbUriList[0]);
connBuilder.StsEndpoints.Add(httpsUriList[0]);
 
NamespaceManager namespaceManager = NamespaceManager.CreateFromConnectionString(connBuilder.ToString());
namespaceManager.Settings.TokenProvider = tokenProvider;
 
// Create queue
const string newQueueName = "MyQueue";
if (!namespaceManager.QueueExists(newQueueName))
{
var queueDescription = new QueueDescription(newQueueName);
 
queueDescription.LockDuration = new TimeSpan(0, 1, 0);
queueDescription.MaxDeliveryCount = 3;
namespaceManager.CreateQueue(queueDescription);
}
 
// Send message to queue
QueueClient queueClient = messagingFactory.CreateQueueClient(newQueueName, ReceiveMode.PeekLock);
QueueClient deadLetterQueueClient = messagingFactory.CreateQueueClient(QueueClient.FormatDeadLetterPath(queueClient.Path), ReceiveMode.ReceiveAndDelete);
 
queueClient.Send(new BrokeredMessage("Hello, service bus!"));
 
BrokeredMessage message = queueClient.Receive();
 
string messageBody = message.GetBody<string>();
Console.WriteLine(messageBody);
 
Console.Read();

Dead Letter

The dead letter queue can be considered an internal, shadow queue of a normal queue. It is automatically created when a queue is created. Dead letter queue is where a message ends up eventually if its delivery count exceeds the specified maximum delivery count.

When creating a queue, we can specify the maximum delivery count. This value is immutable once a queue is created.

1
queueDescription.MaxDeliveryCount = 3;

You may explicitly put a message into the dead letter queue:

1
message.DeadLetter();

The dead letter queue client is created like this:

1
deadLetterQueueClient = messagingFactory.CreateQueueClient(QueueClient.FormatDeadLetterPath(queueClient.Path), ReceiveMode.ReceiveAndDelete);

Note here that ReceiveMode.ReceiveAndDelete is used simply because I only want to take one shot processing the message.

Service Bus Explorer

In my first post on SBWS I mentioned Service Bus Explorer as a helpful administrative tool. The source code can be found here. The code is well laid out but the UI leaves a lot to be desired (for instance, if a client wanted to refresh all queues in one single key stroke). Feel free to tweak it to meet your needs.

Note: If you are using SBWS 1.0, use the source code version 1.8 included the download.

One desired function I wanted was to view the content of the message upon peek or receive from the popup menu.

The location of the code is GetMessageAndProperties method in ServiceBusHelper.cs. Basically, you will use the correct generic type in this:

1
T content = message.GetBody();

Once you get ahold of strong typed content, the detailed information of the object can be viewed.

Summary

In this post we have covered basic operations of Service Bus queue to get you started. In the third (and final) post of the series, we’ll discuss some real world technical issues. Stay tuned.


Service Bus for Windows Server: A Primer

Courtesy Lei Zhong, blog.appliedis.com

An Enterprise Service Bus (ESB) is a shared messaging layer that gives you a consistent, scalable and flexible means of coordinating across disparate, loosely-connected services to execute business processes. Over the years, Microsoft has developed several service bus technologies:

BizTalk: A messaging and workflow orchestration platform to build ESB behaviors and capabilities. The BizTalk ESB toolkit provides a set of guidelines, patterns and tools.

Windows Azure Service Bus (ASB): This provides the messaging infrastructure for applications that live in the cloud, in your data center, and crosses all devices and PCs.

Service Bus for Windows Server (SBWS):  SBWS is based on ASB and shares many of the same capabilities, such as queue, topic and subscription support.  A distinct design goal is to ensure symmetry between SBWS and ASB and allow for a single set of code to be leveraged across both deployment environments.

Installation

Unlike ASB, which runs in the cloud and requires no local software installation, SBWS runs on-premises and must be installed and configured on a Windows Server 2013 or Server 2008 R2 machine. (It can also be installed on Win7 or Win8 for individual developer usage.)  For message persistence, it requires SQL Server 2012 or 2008, which is different from file-based Microsoft Message Queues.

To install, you may start from a contextual link or from Web Platform Installer. The contextual link is here. Here’s how to install with Web Platform Installer:

  1. Launch Web Platform Installer, and search for “Service Bus 1.0″
  2. Click Add for Service Bus 1.0.  We’ll install the Cumulative Update after this.
  3. Follow the steps. You may be prompted to install .NET 4.5 if it’s not already installed. Reboot the machine if directed to.
  4. Install the latest Service Bus 1.0 Cumulative Update. (As of 4/16/2013, Update 1 is available.)

Configuration

  1. In Windows All Programs, select “Service Bus Configuration”
  2. Under Create a New Farm, select “Using Default Settings”
  3. Follow the steps to enter SQL Server Name and Test Connection, Service Account UserID and Password, and Certificate Generation Key. Leave “Enable firewall rules on this computer” checked.
  4. Click the “Next” arrow to continue, and a installation summary page should be displayed. Click the button with a check icon to commit the configuration.
  5. Verify installation was successful and save a copy of the installation log using the “View Log” feature. This includes the Primary Symmetric Key and endpoint connection strings.
  6. In addition, go to the SQL instance you have chosen and verify that three databases have been created:
    • SbGatewayDatabase
    • SbManagementDB
    • SBMessageContainer01

Administration

PowerShell Cmdlets

PowerShell Cmdlets are used to manage SBWS and ASB.  Developers and system administrators should get familiar with these commands. The complete command reference is here.

Service Bus Explorer

To make it easy to manage the messages and namespaces via GUI, Microsoft created a desktop application called Service Bus Explorer. This is a very convenient tool for developers and system administrators alike. It works for both ASB and SBWS. I use it on a daily basis when developing service bus code.

Setup for Working with Service Bus Remotely

Client Certificate

Developers can work with a local service bus without certificates.  However, to connect to a service bus hosted remotely, the client machine should have the proper client certificates. The certificates must be exported from host machine.Follow the steps here. 

In the Local Computer Certificate console, simply import these two files into the Personal and Trusted Root Certification Authorities, respectively.

ManageUsers Setting

To create a queue or send messages hosted remotely, a client that accesses the service bus should also be part of the ManageUsers group in the farm namespace.  To get the current ManageUsers, run this cmdlet:

Get-SBNamespace -Name ServiceBusDefaultNamespace

To add users to the namespace, run this cmdlet:

Set-SBNamespace –Name ServiceBusDefaultNamespace –ManageUsers mydomain\username1,mydomain\username2

In my next post, we are going to dive into coding.


Watson Going Mobile with Developer Challenge

Courtesy

IBM chief Ginni Rometty addresses attendees at Mobile World Congress 2014 in Barcelona.

(Credit: IBM)

IBM wants to bring its Jeopardy-winning cognitive computing system Watson to the mobile industry.

During a keynote address at Mobile World Congress 2014, IBM CEO Ginni Rometty announced the IBM Watson Mobile Developer Challenge, a global competition to promote the development of mobile consumer and business apps powered by Watson.

During the next three months, IBM is calling on software developers who are willing to develop and bring to market a commercial application that leverages Watson capabilities, such as the ability to answer complex questions posed in natural language with speed, accuracy, and confidence. Three winners will receive IBM support to further develop their apps and bring them to market.

Rometty explained that Watson, first developed by IBM researchers to show what was possible in combining cognitive computing and natural language processing, has become far more than the novelty and headline-grabbing artificially intelligent computer system that competed against Jeopardy champions on TV a few years ago.

Since then, the company has created a Watson division, and IBM has been pouring more money into the developments to commercialize the technology. But in addition to continuing its own research and commercializing elements of Watson, IBM is also reaching out to a broader ecosystem of customers, partners, and developers to come up with their own creative applications for Watson.

IBM’s Watson during its 2011 appearance on Jeopardy.

(Credit: Screenshot by Marguerite Reardon/CNET)

The technology is already being used in several industries, including banking, health care, and retail. For instance, at Memorial Sloan-Kettering Cancer Center in New York City, oncologists are using the technology to help diagnose and treat cancer patients. Using the Watson “cloud” service, the doctors feed Watson data on clinical trials; information regarding treatments; and personal statistics on patients, which the cognitive computing engine uses to provide feedback on treatments. IBM showed a video in which a doctor at Sloan-Kettering asked Watson for a revised course of action for treatment of a patient, speaking in natural language to make the request. And then Watson answered with options for an individualized treatment plan.

Watson isn’t replacing the need for a doctor, the oncologist in the video pointed out. Instead it presents more options to help the real doctors make more-informed decisions.

In an onstage interview with tech journalist David Kirkpatrick, Rometty talked about how Watson is being used in retail. She described how the outdoor clothing company The North Face is using it to help customers buy equipment and apparel.

She demonstrated the service by telling Watson about a trip she planned to Patagonia. Watson answered with recommendations for the type of clothing she needed and the backpack she should use. It also told her to get an ABS. She said she wasn’t sure what that was and looked on a typical search engine for an explanation. She said the Web search request brought back dozens of explanations about antilock braking systems on cars.

CNET’s full coverage of Mobile World Congress

Clearly this was not the ABS that the North Face Watson application was recommending. She asked Watson what ABS was. And she was told in plain spoken language that it is a special emergency airbag system used by hikers and skiers during an avalanche.

“Watson knew I wasn’t asking about antilock brakes,” she said. She explained that the service was intelligent enough to put her request in the context of her discussion regarding what to bring on a trip to Patagonia.

“It had to know where Patagonia was, what the climate is like, and that I might encounter an avalanche,” she said.

With the new developer challenge, Rometty said, IBM wants to bring Watson to the mobile industry to see what types of applications mobile developers will come up with to leverage the intelligence service.

While other technology companies, such as Apple, have tried to offer a similar voice-activated intelligent system for mobile phones, those systems haven’t even come close to the cognitive ability Watson has achieved. Initially, Watson’s technology was too big to cram into a mobile device. When it first appeared in 2011 on the Jeopardy TV show, the system of servers took up an entire room. But IBM has worked aggressively to shrink the technology, and now it can be delivered as a cloud-based service, Rometty said.

Of course, IBM and Apple aren’t the only companies working on artificial intelligence technology that uses natural language as an input. Google recently bought London-based artificial intelligence company DeepMind for $500 million. And other tech giants, such as Facebook and Yahoo, are making forays into the world of artificial intelligence.

Still, Rometty thinks IBM has a leg up compared with the rest of the industry.

“Every major invention in data and analytics has come from IBM,” she said.


Regular Expressions: Part 2

Courtesy Ondrej Balas, Visual Studio Magazine

In my last column, I left off with an explanation of how groups can be used to divide a pattern into smaller pieces, or sub-expressions, allowing for repeating subsets of the pattern. But groups have other benefits as well, such as extracting information from a string. Take the following code, for example:

Dim input As String = "555-123-4567"
Dim pattern As String = "(\d\d\d)-\d\d\d-\d\d\d\d"
Dim match As Match = Regex.Match(input, pattern)
Dim areaCode As String = match.Groups(1).Value

In this code, there’s a set of parentheses around what would be the area code portion of the pattern, forming a capturing group. When the Regex engine returns the Match object, it puts all those groups into a Groups property on that Match object. You can then use the indexer on the property to get the group you want to access. Notice that I used an index of 1 to get the area code group. While the Groups property is zero-based, the first group (Groups(0)) is always a match of the entire regular expression. Then, each left parenthesis in the pattern gets a subsequent number; the first one will be in Groups(1), the second in Groups(2) and so on.

Using numeric indexers like this is fine when working with simple patterns, but complexity increases quickly as the Regex grows in size. It can also be problematic when changing the expression, as the groups may end up with different numbers as the pattern is changed. To solve this problem, you can optionally name the groups in the pattern, and then retrieve them by name rather than by number. Named groups have a special syntax, as shown here:

Dim input As String = "555-123-4567"
Dim pattern As String = "(?<AreaCode>\d\d\d)-\d\d\d-\d\d\d\d"
Dim match As Match = Regex.Match(input, pattern)
Dim areaCode = match.Groups("AreaCode").Value

The group that was (\d\d\d) is now (?<AreaCode>\d\d\d). In this case, the question mark directly after the left parenthesis tells the Regex engine that the group should follow special rules. Because it’s followed by a name within angle brackets, it knows to treat it as a named group. Now the group can be referenced by the name “AreaCode” instead of number (though it can still be accessed by number as well).

You may have noticed that I’ve been using the Value property of the Group object to get the matched text. The Group object also has a few other helpful properties, as listed in Table 1.

Table 1: The Properties of the Group Object 

Property Description
Captures A collection containing sub-captures within the group
Index The position at which this group matches within the input string
Length The length of the captured string
Success A Boolean that specifies whether the group matched or not.
Value The matching text

Property Description Captures A collection containing sub-captures within the group Index The position at which this group matches within the input string Length The length of the captured string Success A Boolean that specifies whether the group matched or not. Value The matching text

Position
Up to this point I’ve glossed over positioning and the difference between matching a character and matching the position between two characters (also known as an anchor). The two most commonly used position characters are the caret (^), which matches the position before the first character in the string, and the dollar sign ($), which matches the position at the very end of the string. Again, revisiting the phone number example, a pattern of \d\d\d-\d\d\d\d will match if a phone number appears anywhere within the match string. Even the string “abc123-4567def” would successfully be matched by that pattern. A better pattern would be “^\d\d\d-\d\d\d\d$,” which reads as: “The position before the first character, immediately followed by three digits, then a hyphen, then four more digits, and then immediately followed by the position after the last character.” The following code snippet demonstrates this:

Dim invalidInput As String = "abc123-4567def"
Dim validInput As String = "123-4567"
Dim pattern As String = "^\d\d\d-\d\d\d\d$"
Dim invalidMatchResult As Boolean = Regex.Match(invalidInput, pattern).Success 'False
Dim validMatchResult As Boolean = Regex.Match(validInput, pattern).Success 'True

Another positional character is \b, or word boundary. Word boundary (\b) matches successfully for a position between a word character and non-word character, where a word character is defined as any alphanumeric character or underscore. An example of this would be a search for how often the word “bot” appears within a log file. With a pattern of “bot,” words such as “robot” would match as well, leading to an inaccurate word count. A pattern that would avoid this would be “\bbot\b,” which reads as: “A word boundary, immediately followed by the word bot, immediately followed by another word boundary.” Here’s a usage example:

Dim input = "search bot | robots.txt"
Dim simplePattern = "bot"
Dim betterPattern = "\bbot\b"
Dim simpleCount As Integer = Regex.Matches(input, simplePattern).Count '2
Dim betterCount As Integer = Regex.Matches(input, betterPattern).Count '1

Using the simple pattern, the engine returns a count of 2 instances of the word “bot,” when one instance is just its occurrence within the word “robots.” By requiring word boundaries before and after the word “bot,” the engine returns the correct count of 1.

Greedy or Lazy
In my last column, I showed off many of the quantifiers that Regex offers, such as the asterisk (*). To get the most out of quantifiers, it’s important to understand the distinction between greedy and lazy behaviors. By default, quantifiers behave in a “greedy” fashion, meaning they consume as many characters as possible. It’s possible to individually change that behavior to “lazy” by following them with a question mark. Consider this example:

Dim input As String = "http://www.example.com/samples/demo.html"
Dim greedyPattern As String = "http://(.*)/"
Dim lazyPattern As String = "http://(.*?)/"
Dim greedyMatch As String = Regex.Match(input, greedyPattern).Value 'http://www.example.com/samples/
Dim lazyMatch As String = Regex.Match(input, lazyPattern).Value 'http://www.example.com/

The patterns are almost identical, with the difference being that the greedy pattern uses a group of (.*) while the lazy pattern uses (.*?). The results are quite different, however. When using the greedy pattern, the resulting match was “http://www.example.com/samples/,&#8221; but with the lazy pattern it was “http://www.example.com/.&#8221; The difference is in the way the Regex engine steps through the input string to find a match.

When parsing the greedy expression, it matches the http:// and starts stepping through the input, matching as many periods (any character) as it can. It will do this until it reaches the end of the string, and then attempt to match the slash. Because there’s no slash after the end of the string, the engine will start back-tracking until it finds a slash. See Figure 1 for a simplified example of how this might be parsed by the Regex engine.

[Click on image for larger view.]Figure 1. Simplified example of behavior when matching the greedy expression.

The engine deals with the lazy expression much differently. Instead of matching as much as it can, it matches as little as it can get away with while still matching the slash (/) following the quantifier. Figure 2 shows a simplified example of this behavior.

[Click on image for larger view.]Figure 2. Simplified example of behavior when matching the lazy expression.

Tools
Regular expressions can be difficult both to write and to read; fortunately, there are some great tools that can help. To jump-start your understanding of more complex expressions, I recommend a free tool called Expresso.

If you’re interested in a deeper understanding of how the engine handles your expressions, or just want to debug a complex expression, try out RegExpose, an open source tool written by Brian Friesen and available on GitHub.

Advanced Scenarios
In the next part of this series, I’ll be exploring some advanced scenarios for regular expressions, such as using them as part of the Find & Replace feature in Visual Studio, or in applying them to business intelligence. I hope you find as much value in regular expressions as I continue to, year after year.

 

 

 


Regular Expressions: Part 1

Courtesy Ondrej Balas, Visual Studio Magazine

Regular expressions — those scary strings that might as well be written in Klingon to the average person — can be a vast time-saver. They help in one of the most common tasks of programming: string manipulation. The .NET Framework has an excellent, built-in regular expressions engine that’s relatively straightforward to use.

The use and understanding of regular expressions (hereafter referred to as RegEx) traditionally comes in three parts: testing a string to see if a pattern exists within it (pattern matching); reading strings and extracting useful information; manipulating and making changes to those strings.

Pattern Matching
The first thing you’ll likely want to use a RegEx for is to do pattern matching, the simplest example of which is determining whether a particular set of characters exists within a string. In fact, if you’ve ever used the “Find” feature in any text editor, you’ve effectively used a regular expression composed only of the most basic pieces: literals. In a RegEx pattern, a literal is a character that must be matched exactly. Searching for the pattern “BCD” within the string “ABCDE” would result in a match, for example. In this case, the search pattern would read as, “The literal B, immediately followed by the literal C, immediately followed by the literal D.”

As beneficial as that is, there are certainly better and simpler ways of accomplishing a simple search without the use of regular expressions. The power of RegEx doesn’t become apparent until you start using elements like character classes, or groups of characters that can be accepted as a match. Character classes are denoted by the brackets surrounding them. Take the character class [0123456789] for example, which instructs the RegEx engine to match any character that is a number between 0 and 9. Alternatively, you could use [0-9], which has the same meaning to the engine, but is more readable to a human. RegEx additionally has some helpful shortcuts for commonly used classes like this one. [0-9] can be further shortened by using \d, the shortcut for “Match any numeric character.”

The Regex Object

Using RegEx in your code requires the RegEx object, located within the System.Text.RegularExpressions namespace. For a usage example, see Listing 1 below. The code in this example reads a string as input from the console, then shows a response depending on whether the string contains a numeric character anywhere within it (using \d as the pattern). You can experiment with different things in the pattern, such as \d\d, which will match only if two numbers are next to each other somewhere in the string (e.g., “12ab” will match, but “1a2” will not).

Listing 1: Test To See If the String Entered Contains a Number

Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim input As String = ""
    Dim pattern As String = "\d"
    While Not input.ToUpper() = "Q"
      Console.WriteLine("Enter a string, or Q to exit.")
      input = Console.ReadLine()
      If Regex.IsMatch(input, pattern) Then
        Console.WriteLine("String contains a number")
      Else
        Console.WriteLine("String does NOT contain a number")
      End If
    End While
  End Sub
End Module

Wildcards
In RegEx, wildcards are really just shortcuts to character classes. Just like \d is a shortcut for [0-9], there are other shortcuts, too. See Table 1 below for a list of the most commonly used character classes. Note that the period is special in that it matches any character except the line break by default.

Table 1: A List of Character Class Shortcuts

Shortcut Class Description
\d [0-9] numeric character
\D [^0-9] NOT a numeric character
\w [a-zA-Z0-9_] “word” character
\W [^a-zA-Z0-9_] NOT a “word” character
\s [ \f\n\r\t\v] whitespace character
\S [^ \f\n\r\t\v] NOT a whitespace character
. [^\n] Any character except line break

Validation
One common use of the Regex.IsMatch() method is in validation. For example, if you wanted to validate that a phone number was entered, a simple approach to that would be a pattern of “\d\d\d-\d\d\d\d” — in this case, the hyphen (-) acts as a literal hyphen and nothing else, so the pattern reads as, “Three digits, followed directly by a hyphen, followed directly by four more digits.” Of course, this requires the hyphen to exist; if it’s omitted from the input, there won’t be a match. To solve this, you can add a question mark after the hyphen, as “\d\d\d-?\d\d\d\d” shows. The question mark acts as a quantifier in this case, meaning “Match zero or one” of the preceding character, the hyphen.

Quantifiers
There are a few other quantifiers aside from the question mark. The most commonly used is the asterisk, which means, “Match as many as possible.” In the preceding example, an asterisk could have been used in place of the question mark like this: “\d\d\d-*\d\d\d\d”; this would result in a match regardless of how many hyphens were entered. For example, “123—-5555” would result in a match.

Another quantifier almost the same as the asterisk is the plus sign, which means, “Match as many as possible, with a minimum of one.” If you were to use the plus instead of the asterisk, like this:

"\d\d\d-+\d\d\d\d"

a string of “123-5555” would result in a match, but “1235555” would not.

There’s also a quantifier that allows you to be more precise in how many times something must be repeated for it to trigger a match. You can set these by using curly brackets, like this: “\d{3}-\d{4}”. This reads as, “Match exactly three digits, followed by one hyphen, followed by exactly four digits.” Another alternative usage allows a range to be specified instead. An example would be “\d{3,5}”, a pattern matching anywhere from three to five digits. Note that this pattern will also match if the input is “123456”, because the RegEx engine successfully finds the required number of digits.

Grouping
As mentioned above, quantifiers always act on the preceding character or group. To allow for repetitions of more than one character, you can create something called a grouping. Back to the phone number example, try using the pattern “(\d\d\d-){1,2}\d\d\d\d” to allow for area codes to be entered. With this pattern, both “123-5555” and “444-123-5555” would match.

In Listing 2 below, I bring it all together and create a somewhat more robust phone validation pattern.

Listing 2: A More Robust Phone Validation Pattern

Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim input As String = ""
    Dim pattern As String = "(\d\d\d-){1,2}\d\d\d\d"
    While Not input.ToUpper() = "Q"
      Console.WriteLine("Enter a string, or Q to exit.")
      input = Console.ReadLine()
      If Regex.IsMatch(input, pattern) Then
        Console.WriteLine("String contains a valid phone number")
      Else
        Console.WriteLine("String does NOT contain a valid phone number")
      End If
    End While
  End Sub
End Module

Getting Started
Regular expressions are a deep topic, and this article only scratches the surface. While I’ve only showed their use in validation thus far, they’re also quite handy in many scenarios involving parsing or string manipulation. With a solid understanding, regular expressions will become an invaluable tool to have in your tool belt. Check back next month for the next article in this series, in which I go deeper into the internals of regular expressions and how you can get even more out of them.