Monday, August 11, 2014

Express Yourself, Regularly

The Basics of Regular Expressions

Introduction

Unix administrators have long used regular expressions to help them locate files, modify data, and manage system configurations. Tools like grep and sed are designed to process regular expressions to provide the administrator with exactly the information he wants. While versions of these tools have been ported to Windows, most Windows administrators are unaware that they exist. Because of the limitations of the Windows command shell, Windows administrators typically stick with slower, more complicated graphical tools to manage the operating system.
Enter PowerShell. PowerShell is designed to be a replacement for the standard Windows shell, and it is far more powerful and flexible than its predecessor. Among the many command enhancements PowerShell offers is built-in support for regular expressions. It borrows this capability heavily from Perl, a scripting language that was developed specifically for processing text.
Regular expressions are used to search for character sequences inside text strings or files. Programs that process regular expressions look for text that matches a given pattern. The components of a regular expression are not complicated, but the available combinations are many and varied, making it possible to perform some very sophisticated matches. Whether you’re administering Windows, Linux, or the Unix-based Mac OS X, you should invest some time learning the cryptic syntax of regular expressions so that you can manage systems and automate common tasks.
This tutorial will introduce regular expressions. It is not aimed at a particular operating system. Students of both Linux and PowerShell will come away with a basic knowledge of how regular expressions work and how to craft their own. Specific tools such as Linux’s grep command and PowerShell’s -match operator are covered in those respective classes at Centriq Training.

Overview

Whenever we begin to learn a new technology we get excited about the possibilities. We catch a glimpse of all kinds of nifty things that we can do with this knowledge. We often forget, though, that each technology has its limitations. Regular expressions are cool and powerful and flexible and a lot of other things, but there are some things that they’re not—some things that they cannot do. Like all things new, regular expressions come with a learning curve that is best overcome with practice. To avoid getting frustrated as you begin to learn regular expressions you must always keep three rules in mind.

Rule 1. Regular expressions match text, not numbers.

Regular expressions can represent any sequence of characters that you can find on a typical keyboard, and even some that you can’t, but they can’t express any other kind of data. They don’t understand numbers.
This confuses people at first, because regular expressions are frequently employed to determine things like whether a user’s input is a numeric age or zip code or year. But in these cases the regular expression is used to test the characters and ensure that they are numerals. The quantities that these numerals stand for are completely lost on the regular expression: it can’t identify a numeral’s value.
Remember, even numbers are written with characters. Regular expressions can be used to recognize those characters, but they can’t be used to determine their values, so they don’t work with actual numbers.

Rule 2. Regular expressions are made from three components: characters, anchors, and modifiers.

Every regular expression must have at least one character. This provides the basis for any match that will be performed. Anchors may be used to establish that the characters belong in certain positions in the text. Modifiers may be used to match repeated instances of a character or change a character’s meaning.
Characters are pretty simple, but the anchors and modifiers are what make regular expressions so powerful—and difficult to read. No matter how complex the expression, though, it always begins with at least one character.

Rule 3. Each program is a little different.

There are many programs and programming languages that can process regular expressions. While there is a standard definition of characters, anchors, and modifiers individual programs have sometimes extended and customized their definition of a regular expression. Generally, this is done to make expressions easier for us humans to use, but it can lead to confusion for the student who is just learning the syntax. In this tutorial I’m going to stick mostly with standard syntax, but at the end I’ll provide specific examples with grep and PowerShell.
So if you run across a regular expression that looks unusual or that doesn’t work in your specific tool, that doesn’t necessarily mean that it’s wrong. Each program is a little different.

A First Look at Characters

The group of characters that make up a regular expression’s search string is called a pattern. Patterns can be very simple. For example, d is a valid pattern. It matches any string of text that contains a “d” character, like “Dave”, “dude”, or “all work and no play makes Jack a dull boy”. Whether the matching is case-sensitive depends on the processor that’s performing the match. Remember Rule 3: each program is a little different.
When it comes to matching characters, you should know that most regular expression processors are going to match only the first instance that they find. In that last string, the expression matches the “d” in “and”, but not the one in “dull.” Once a program finds the first match, it usually stops processing the string altogether.
Characters can be combined to make whole words or phrases. The pattern error will match any string that contains the word “error” in any form. This includes “errors”, as well as other words like “terror”. Remember that the pattern you’re searching for is just a string of characters. If your pattern includes spaces or other special characters, you usually have to enclose it in quotation marks.
Sometimes we don’t want to match a particular character, just any character. For this we use a wildcard, which in regular expression syntax is a period. The pattern d.d would match “dad”, “dude”, and “katydid”, because there must be exactly one character between the two d’s. The pattern d...d will match “David”, “domed”, “android”, and even “my good buddy Steve” because in each case there are d’s separated by three characters. Note that in the last example those three characters include a space. That’s okay: a wildcard matches any character—even numerals, spaces, and punctuation marks.

Anchors Put It Where You Want It

An anchor is a special character that ties a part of your pattern to the beginning or end of a text string. The caret symbol, ^, can be created by pressing Shift-6 on most US keyboards. It anchors a pattern to the beginning of a search string. The pattern ^car will match “caret”, but not “lascar”. The caret symbol at the beginning of the pattern tells a regular expression processor that no character may precede the pattern.
The dollar symbol, $, can be created by pressing Shift-4 on most US keyboards. It anchors the pattern to the end of a text string. The pattern ave$ will match “Dave” but not “avenue”. The dollar symbol at the end of the pattern tells the regular expression processor that no character may follow the pattern.
A pattern that includes both anchors can be used to search for an exact match of the pattern to the text string. The pattern ^error$ will exactly match the string “error”, but not “terror” or “errors”. It will not match a string that contains the word “error” among other things, like “error in module fuse.ko”—it is always an exact match.
Of course, you can combine wildcards with anchors. The pattern ^d...d$ will match “David” and “dared”, but not “android” or “dreaded”. And if you want to match on a blank line, you can use the pattern ^$.

Expanding a Pattern With Modifiers

The Great Escape

You’ve learned that, in regular expression syntax, a period is a wildcard. But sometimes we want to search for a literal period. Because the period stands for any character, the pattern ^169.254. would match the IP address “169.254.14.2”, but it would also match the string “1698254a”, which is not what we’re looking for. What we want to do is modify the period and change its meaning.
The backslash character, \, is a special modifier called the “escape character”. It changes the meaning of the character that immediately follows it, “escaping” from the normal interpretation of the pattern. When it precedes a period, the backslash takes away the period’s meaning as a wildcard so that it becomes a normal period. So the pattern ^169\.254\. will match “169.254.14.2” but not “1698254a”. Since the periods have been “escaped”, there are no wildcards in this pattern. Likewise, the pattern ^\$ will look for a string that begins with a dollar sign, and ^4\^2$ will match a string that contains exactly “4^2”.

Time and Time Again

It is often necessary to look for a repeating number of characters in a text string. Multipliers allow you to extend your expression to include repetition in your searches. They are special characters that follow some other character in a pattern. They multiply the character that appears immediately before them by some value. There are several multipliers, so you’ll need to commit them to memory.
The question mark, ?, can be produced on most US keyboards by pressing Shift-/. It multiplies the preceding character by zero or one. The pattern ^d.?d$ will match both “dd” and “did”, requiring zero or one characters between the d’s.
The asterisk, *, can be produced on most US keyboards by pressing Shift-8. The asterisk multiplies the preceding character by zero or more. The pattern ^d.*d$ will match “dd”, “did”, “dreamed”, and “drumming in your head”, because it allows any number of characters to exist between the d’s.
The plus symbol, +, can be produced on most US keyboards by pressing Shift-=. It multiplies the preceding character by one or more. The pattern ^d.+d$ matches “did”, “dreamed”, and “drumming in your head”, but it does not match “dd”. The plus symbol requires at least one character to appear between the d’s.
Advanced multiplication is not supported by all regular expression processors, and not consistently among those that do. Remember, each program is a little different. However, because many programs and programming languages support it to some extent you should get to know it and its variations.
Advanced multipliers contain values inside curly braces, the { and } characters. These can be produced on most US keyboards by pressing Shift-[ and Shift-] respectively.
Placing a single value in the braces multiplies the preceding character by exactly that number. The pattern ^d.{3}d$ matches “David” and “druid”, requiring exactly three characters between the d’s. Note that this pattern could have been written ^d...d$, but as we learn about more characters we’ll see that the advanced multiplier can be much easier to read.
Enclosing two values separated by a comma within the braces, we get a specific range of multipliers. The pattern ^d.{3,5}d$ multiplies the wildcard by three, four, and five. It matches “druid”, “darned”, and “dreamed”, requiring three to five characters between the d’s.
If the braces contain a single number followed by a comma, the range of multipliers has no upper limit. The pattern ^d.{3,}d$ matches any string with three or more characters between the d’s.

Advanced Characters

There’s more to characters than literal symbols and wildcards. While the advanced characters may not look like single characters to you, to the regular expression processor they are indeed just a character. Sometimes these are called “meta-characters”, because they are a group of symbols that stand for a single character. Combined with multipliers, these meta-characters make it possible to create sophisticated searches.
Square brackets allow you to specify one from a group of characters that you want to match. The pattern ^d[aiu]d$ represents three characters. The symbols between the brackets are applied in turn to the search, so that this pattern matches “dad”, “did”, and “dud”. The pattern requires that there must be nothing but an a, i, or u between the d’s.
If you want to search for a string that begins with a vowel you can use the pattern ^[aeiou]. You can negate the pattern with a caret symbol inside the brackets. The pattern ^[^aeiou] matches all strings that begin with any character that is not a vowel. Don’t let the use of the caret confuse you. At the beginning of a pattern the caret is an anchor. Inside square brackets it reverses the meaning of the group. This could be read as “not a, e, i, o, or u”, so inside brackets the caret symbol means “not”.
The brackets can also contain a range, two values separated by a hyphen. The pattern [0-9] represents all numerals. The pattern ^169\.254\.[0-9]{1,3}\.[0-9]{1,3}$ uses escape sequences, ranges, and multipliers to match IP addresses that begin with “169.254.”.
Since a range of characters is just a character in regular expression syntax, ranges can be grouped. The pattern ^[0-9a-z] will match all strings that begin with a numeral or a lowercase letter. Many shell scripts use the pound sign at the beginning of a line to identify comments. The pattern ^[#a-zA-Z] matches all the lines in a script that begin with a pound sign or a letter.
Many common character groups have special classes defined for them. The range [0-9] can also be written using the class [:digit:]. Other classes include [:alpha:] for all letters, [:lower:] for lowercase letters, [:upper:] for uppercase letters, [:alnum:] for letters and numbers, [:space:] for white space characters like space and tab, [:cntrl:] for non-printable control characters, and [:xdigit:] for characters used to represent hexadecimal numbers, equivalent to [0-9a-fA-F].
Some advanced regular expression processors such as those found in Perl and PowerShell can also use escape sequences to represent character classes. Some common ones are: \d to represent any digit; \w to represent any word character such as letters, numbers, and some punctuation; and \s to represent white space characters. A capital letter in the escape sequence negates it, so \D represents any character that is not a digit, and \S represents any character that is not white space.
Parentheses, ( and ), can be produced on most US keyboards by pressing Shift-9 and Shift-0 respectively. They can be used to combine multiple patterns together. The pipe symbol, |, is produced by pressing Shift-\ on most US keyboards. It ties together multiple patterns within the parentheses using “or” logic. The expression (^[:digit:]{5}$ | ^[:digit:]{5}-[:digit:]{4}$) contains two patterns, and will process a string until one pattern or the other matches some text. This regular expression matches a US zip code written in either the five-digit or five-plus-four-digit format, such as “02134” and “64119-4105”.
Finally come the angle brackets, < and >, which can be produced on most US keyboards by pressing Shift-, and Shift-. respectively. These identify word boundaries, so whatever is enclosed within is considered to be a whole word. The pattern < error > matches “error”, but not “errors” or “terror”. The angle brackets require that only white space or punctuation may appear on either side of the enclosed pattern.
Please note that many regular expression processors will require that curly braces and angle brackets be preceded by a backslash to escape them, otherwise they are treated as literal brackets. You may have to experiment or read your program’s documentation to determine what it will support.

Summary

Regular expressions consist of character patterns that are matched against text strings. Each pattern must contain at least one character, but its matching capabilities can be enhanced with anchors, modifiers, and advanced meta-characters.
Regular expression patterns can be written to match almost any kind of text, but they don’t assign any meaning to that text. A regular expression recognizes no numeric values, it doesn’t understand what to do with punctuation marks, and it’s limited to matching on one line of a file at a time. All that the pattern can represent is text characters.
Regular expression processors are programs and programming language constructs that use patterns to find and work with text. Some of these have very advanced capabilities, such as extending a search to include multiple lines of a file or performing pattern matches backward as well as forward on the text. The beginner will need to practice with each of the regular expression tools that he intends to use to gain an understanding of its features, but the fundamental concepts covered in this tutorial will be applicable.
Regular expressions provide the administrator with tools to search for any kinds of text within files. They can be added to scripts to check for patterns within user input. They are often used to identify important information from log files, email servers, and web sites. Any process that works with text can be improved by the judicious use of regular expressions.

PowerShell Examples

When the PowerShell -match operator finds the pattern in a string it returns “True”. It returns “False” if the pattern is not found.
PS C:\> $value = "This is a test. 1234"
PS C:\> $value -match "a.t"
True
PS C:\> $value -match "^t"
True
Note that PowerShell is not case-sensitive.
PS C:\> $value -match "\st\w{3}"
True
PS C:\> $value -match "\d$"
True
PS C:\> $value -match "t[aeiou]s"
True
PS C:\> $value -match "t[^aeiou]s"
False
PS C:\> $value -match "\w+\. \d+$"
True
PS C:\> $value -match "\w+\. \d?$"
False
PS C:\> $value -match "(tisk | test)"
True

Grep Examples

The grep command is intended to work with files, so these examples pass a test string to the command through standard input. The command returns the string that matches the pattern, or null if there is no match.
$ value="This is a test. 1234"
$ echo $value | egrep a.t
This is a test. 1234
$ echo $value | egrep ^t
$ echo $value | egrep -i ^t
This is a test. 1234
Note that grep is case-sensitive. Use the -i parameter switch to enable case-insensitive matches.
$ echo $value | egrep [[:space:]]t[[:alpha:]]\{3\}
This is a test. 1234
With grep, character classes must be quoted or contained within square brackets. The double-bracket form is the most common. Note that curly braces must be escaped.
$ echo $value | egrep [[:digit:]]$
This is a test. 1234
$ echo $value | egrep t[aeiou]s
This is a test. 1234
$ echo $value | egrep t[^aeiou]s
$ echo $value | egrep '[[:alpha:]]+\. [[:digit:]]+$'
This is a test. 1234
$ echo $value | egrep '[[:alpha:]]+\. [[:digit:]]?$'
$ echo $value | egrep '(tisk | test)'
This is a test. 1234
$ echo $value | egrep '\'
This is a test. 1234