The Basics of Regular Expressions

Introduction

Unix administrators have long used regular expressions to help them locate files, modify data, and manage system configurations. Tools like grep and sed are designed to process regular expressions to provide the administrator with exactly the information he wants. While versions of these tools have been ported to Windows, most Windows administrators are unaware that they exist. Because of the limitations of the Windows command shell, Windows administrators typically stick with slower, more complicated graphical tools to manage the operating system.

Enter PowerShell. PowerShell is designed to be a replacement for the standard Windows shell, and it is far more powerful and flexible than its predecessor. Among the many command enhancements PowerShell offers is built-in support for regular expressions. It borrows this capability heavily from Perl, a scripting language that was developed specifically for processing text.

Regular expressions are used to search for character sequences inside text strings or files. Programs that process regular expressions look for text that matches a given pattern. The components of a regular expression are not complicated, but the available combinations are many and varied, making it possible to perform some very sophisticated matches. Whether you’re administering Windows, Linux, or the Unix-based Mac OS X, you should invest some time learning the cryptic syntax of regular expressions so that you can manage systems and automate common tasks.

This tutorial will introduce regular expressions. It is not aimed at a particular operating system. Students of both Linux and PowerShell will come away with a basic knowledge of how regular expressions work and how to craft their own. Specific tools such as Linux’s grep command and PowerShell’s -match operator are covered in those respective classes at Centriq Training.

Overview

Whenever we begin to learn a new technology we get excited about the possibilities. We catch a glimpse of all kinds of nifty things that we can do with this knowledge. We often forget, though, that each technology has its limitations. Regular expressions are cool and powerful and flexible and a lot of other things, but there are some things that they’re not—some things that they cannot do. Like all things new, regular expressions come with a learning curve that is best overcome with practice. To avoid getting frustrated as you begin to learn regular expressions you must always keep three rules in mind.

Rule 1. Regular expressions match text, not numbers.

Regular expressions can represent any sequence of characters that you can find on a typical keyboard, and even some that you can’t, but they can’t express any other kind of data. They don’t understand numbers.

This confuses people at first, because regular expressions are frequently employed to determine things like whether a user’s input is a numeric age or zip code or year. But in these cases the regular expression is used to test the characters and ensure that they are numerals. The quantities that these numerals stand for are completely lost on the regular expression: it can’t identify a numeral’s value.

Remember, even numbers are written with characters. Regular expressions can be used to recognize those characters, but they can’t be used to determine their values, so they don’t work with actual numbers.

Rule 2. Regular expressions are made from three components: characters, anchors, and modifiers.

Every regular expression must have at least one character. This provides the basis for any match that will be performed. Anchors may be used to establish that the characters belong in certain positions in the text. Modifiers may be used to match repeated instances of a character or change a character’s meaning.

Characters are pretty simple, but the anchors and modifiers are what make regular expressions so powerful—and difficult to read. No matter how complex the expression, though, it always begins with at least one character.

Rule 3. Each program is a little different.

There are many programs and programming languages that can process regular expressions. While there is a standard definition of characters, anchors, and modifiers individual programs have sometimes extended and customized their definition of a regular expression. Generally, this is done to make expressions easier for us humans to use, but it can lead to confusion for the student who is just learning the syntax. In this tutorial I’m going to stick mostly with standard syntax, but at the end I’ll provide specific examples with grep and PowerShell.

So if you run across a regular expression that looks unusual or that doesn’t work in your specific tool, that doesn’t necessarily mean that it’s wrong. Each program is a little different.

A First Look at Characters

The group of characters that make up a regular expression’s search string is called a pattern. Patterns can be very simple. For example, d is a valid pattern. It matches any string of text that contains a “d” character, like “Dave”, “dude”, or “all work and no play makes Jack a dull boy”. Whether the matching is case-sensitive depends on the processor that’s performing the match. Remember Rule 3: each program is a little different.

When it comes to matching characters, you should know that most regular expression processors are going to match only the first instance that they find. In that last string, the expression matches the “d” in “and”, but not the one in “dull.” Once a program finds the first match, it usually stops processing the string altogether.

Characters can be combined to make whole words or phrases. The pattern error will match any string that contains the word “error” in any form. This includes “errors”, as well as other words like “terror”. Remember that the pattern you’re searching for is just a string of characters. If your pattern includes spaces or other special characters, you usually have to enclose it in quotation marks.

Sometimes we don’t want to match a particular character, just any character. For this we use a wildcard, which in regular expression syntax is a period. The pattern d.d would match “dad”, “dude”, and “katydid”, because there must be exactly one character between the two d’s. The pattern d...d will match “David”, “domed”, “android”, and even “my good buddy Steve” because in each case there are d’s separated by three characters. Note that in the last example those three characters include a space. That’s okay: a wildcard matches any character—even numerals, spaces, and punctuation marks.

Anchors Put It Where You Want It

An anchor is a special character that ties a part of your pattern to the beginning or end of a text string. The caret symbol, ^, can be created by pressing Shift-6 on most US keyboards. It anchors a pattern to the beginning of a search string. The pattern ^car will match “caret”, but not “lascar”. The caret symbol at the beginning of the pattern tells a regular expression processor that no character may precede the pattern.

The dollar symbol, $, can be created by pressing Shift-4 on most US keyboards. It anchors the pattern to the end of a text string. The pattern ave$ will match “Dave” but not “avenue”. The dollar symbol at the end of the pattern tells the regular expression processor that no character may follow the pattern.

A pattern that includes both anchors can be used to search for an exact match of the pattern to the text string. The pattern ^error$ will exactly match the string “error”, but not “terror” or “errors”. It will not match a string that contains the word “error” among other things, like “error in module fuse.ko”—it is always an exact match.

Of course, you can combine wildcards with anchors. The pattern ^d...d$ will match “David” and “dared”, but not “android” or “dreaded”. And if you want to match on a blank line, you can use the pattern ^$.

Expanding a Pattern With Modifiers

The Great Escape

You’ve learned that, in regular expression syntax, a period is a wildcard. But sometimes we want to search for a literal period. Because the period stands for any character, the pattern ^169.254. would match the IP address “169.254.14.2”, but it would also match the string “1698254a”, which is not what we’re looking for. What we want to do is modify the period and change its meaning.

The backslash character, \, is a special modifier called the “escape character”. It changes the meaning of the character that immediately follows it, “escaping” from the normal interpretation of the pattern. When it precedes a period, the backslash takes away the period’s meaning as a wildcard so that it becomes a normal period. So the pattern ^169\.254\. will match “169.254.14.2” but not “1698254a”. Since the periods have been “escaped”, there are no wildcards in this pattern. Likewise, the pattern ^\$ will look for a string that begins with a dollar sign, and ^4\^2$ will match a string that contains exactly “4^2”.

Time and Time Again

It is often necessary to look for a repeating number of characters in a text string. Multipliers allow you to extend your expression to include repetition in your searches. They are special characters that follow some other character in a pattern. They multiply the character that appears immediately before them by some value. There are several multipliers, so you’ll need to commit them to memory.

The question mark, ?, can be produced on most US keyboards by pressing Shift-/. It multiplies the preceding character by zero or one. The pattern ^d.?d$ will match both “dd” and “did”, requiring zero or one characters between the d’s.

The asterisk, *, can be produced on most US keyboards by pressing Shift-8. The asterisk multiplies the preceding character by zero or more. The pattern ^d.*d$ will match “dd”, “did”, “dreamed”, and “drumming in your head”, because it allows any number of characters to exist between the d’s.

The plus symbol, +, can be produced on most US keyboards by pressing Shift-=. It multiplies the preceding character by one or more. The pattern ^d.+d$ matches “did”, “dreamed”, and “drumming in your head”, but it does not match “dd”. The plus symbol requires at least one character to appear between the d’s.

Advanced multiplication is not supported by all regular expression processors, and not consistently among those that do. Remember, each program is a little different. However, because many programs and programming languages support it to some extent you should get to know it and its variations.

Advanced multipliers contain values inside curly braces, the { and } characters. These can be produced on most US keyboards by pressing Shift-[ and Shift-] respectively.

Placing a single value in the braces multiplies the preceding character by exactly that number. The pattern ^d.{3}d$ matches “David” and “druid”, requiring exactly three characters between the d’s. Note that this pattern could have been written ^d...d$, but as we learn about more characters we’ll see that the advanced multiplier can be much easier to read.

Enclosing two values separated by a comma within the braces, we get a specific range of multipliers. The pattern ^d.{3,5}d$ multiplies the wildcard by three, four, and five. It matches “druid”, “darned”, and “dreamed”, requiring three to five characters between the d’s.

If the braces contain a single number followed by a comma, the range of multipliers has no upper limit. The pattern ^d.{3,}d$ matches any string with three or more characters between the d’s.

Advanced Characters

There’s more to characters than literal symbols and wildcards. While the advanced characters may not look like single characters to you, to the regular expression processor they are indeed just a character. Sometimes these are called “meta-characters”, because they are a group of symbols that stand for a single character. Combined with multipliers, these meta-characters make it possible to create sophisticated searches.

Square brackets allow you to specify one from a group of characters that you want to match. The pattern ^d[aiu]d$ represents three characters. The symbols between the brackets are applied in turn to the search, so that this pattern matches “dad”, “did”, and “dud”. The pattern requires that there must be nothing but an a, i, or u between the d’s.

If you want to search for a string that begins with a vowel you can use the pattern ^[aeiou]. You can negate the pattern with a caret symbol inside the brackets. The pattern ^[^aeiou] matches all strings that begin with any character that is not a vowel. Don’t let the use of the caret confuse you. At the beginning of a pattern the caret is an anchor. Inside square brackets it reverses the meaning of the group. This could be read as “not a, e, i, o, or u”, so inside brackets the caret symbol means “not”.

The brackets can also contain a range, two values separated by a hyphen. The pattern [0-9] represents all numerals. The pattern ^169\.254\.[0-9]{1,3}\.[0-9]{1,3}$ uses escape sequences, ranges, and multipliers to match IP addresses that begin with “169.254.”.

Since a range of characters is just a character in regular expression syntax, ranges can be grouped. The pattern ^[0-9a-z] will match all strings that begin with a numeral or a lowercase letter. Many shell scripts use the pound sign at the beginning of a line to identify comments. The pattern ^[#a-zA-Z] matches all the lines in a script that begin with a pound sign or a letter.

Many common character groups have special classes defined for them. The range [0-9] can also be written using the class [:digit:]. Other classes include [:alpha:] for all letters, [:lower:] for lowercase letters, [:upper:] for uppercase letters, [:alnum:] for letters and numbers, [:space:] for white space characters like space and tab, [:cntrl:] for non-printable control characters, and [:xdigit:] for characters used to represent hexadecimal numbers, equivalent to [0-9a-fA-F].

Some advanced regular expression processors such as those found in Perl and PowerShell can also use escape sequences to represent character classes. Some common ones are: \d to represent any digit; \w to represent any word character such as letters, numbers, and some punctuation; and \s to represent white space characters. A capital letter in the escape sequence negates it, so \D represents any character that is not a digit, and \S represents any character that is not white space.

Parentheses, ( and ), can be produced on most US keyboards by pressing Shift-9 and Shift-0 respectively. They can be used to combine multiple patterns together. The pipe symbol, |, is produced by pressing Shift-\ on most US keyboards. It ties together multiple patterns within the parentheses using “or” logic. The expression (^[:digit:]{5}$ | ^[:digit:]{5}-[:digit:]{4}$) contains two patterns, and will process a string until one pattern or the other matches some text. This regular expression matches a US zip code written in either the five-digit or five-plus-four-digit format, such as “02134” and “64119-4105”.

Finally come the angle brackets, < and >, which can be produced on most US keyboards by pressing Shift-, and Shift-. respectively. These identify word boundaries, so whatever is enclosed within is considered to be a whole word. The pattern < error > matches “error”, but not “errors” or “terror”. The angle brackets require that only white space or punctuation may appear on either side of the enclosed pattern.

Please note that many regular expression processors will require that curly braces and angle brackets be preceded by a backslash to escape them, otherwise they are treated as literal brackets. You may have to experiment or read your program’s documentation to determine what it will support.

Summary

Regular expressions consist of character patterns that are matched against text strings. Each pattern must contain at least one character, but its matching capabilities can be enhanced with anchors, modifiers, and advanced meta-characters.

Regular expression patterns can be written to match almost any kind of text, but they don’t assign any meaning to that text. A regular expression recognizes no numeric values, it doesn’t understand what to do with punctuation marks, and it’s limited to matching on one line of a file at a time. All that the pattern can represent is text characters.

Regular expression processors are programs and programming language constructs that use patterns to find and work with text. Some of these have very advanced capabilities, such as extending a search to include multiple lines of a file or performing pattern matches backward as well as forward on the text. The beginner will need to practice with each of the regular expression tools that he intends to use to gain an understanding of its features, but the fundamental concepts covered in this tutorial will be applicable.

Regular expressions provide the administrator with tools to search for any kinds of text within files. They can be added to scripts to check for patterns within user input. They are often used to identify important information from log files, email servers, and web sites. Any process that works with text can be improved by the judicious use of regular expressions.

PowerShell Examples

When the PowerShell -match operator finds the pattern in a string it returns “True”. It returns “False” if the pattern is not found.

PS C:\> $value = "This is a test. 1234"
PS C:\> $value -match "a.t"
True
PS C:\> $value -match "^t"
True

Note that PowerShell is not case-sensitive.

PS C:\> $value -match "\st\w{3}"
True
PS C:\> $value -match "\d$"
True
PS C:\> $value -match "t[aeiou]s"
True
PS C:\> $value -match "t[^aeiou]s"
False
PS C:\> $value -match "\w+\. \d+$"
True
PS C:\> $value -match "\w+\. \d?$"
False
PS C:\> $value -match "(tisk | test)"
True

Grep Examples

The grep command is intended to work with files, so these examples pass a test string to the command through standard input. The command returns the string that matches the pattern, or null if there is no match.

$ value="This is a test. 1234"
$ echo $value | egrep a.t
This is a test. 1234
$ echo $value | egrep ^t
$ echo $value | egrep -i ^t
This is a test. 1234

Note that grep is case-sensitive. Use the -i parameter switch to enable case-insensitive matches.

$ echo $value | egrep [[:space:]]t[[:alpha:]]\{3\}
This is a test. 1234

With grep, character classes must be quoted or contained within square brackets. The double-bracket form is the most common. Note that curly braces must be escaped.

$ echo $value | egrep [[:digit:]]$
This is a test. 1234
$ echo $value | egrep t[aeiou]s
This is a test. 1234
$ echo $value | egrep t[^aeiou]s
$ echo $value | egrep '[[:alpha:]]+\. [[:digit:]]+$'
This is a test. 1234
$ echo $value | egrep '[[:alpha:]]+\. [[:digit:]]?$'
$ echo $value | egrep '(tisk | test)'
This is a test. 1234
$ echo $value | egrep '\'
This is a test. 1234

i was working on a little project that required me to read an xml file from the internet. since i like to have some useful examples to refer to when i'm doing something that i don't often do (and so don't remember how to) i decided to dig into the three scripting languages that i'm most likely to use and learn how to work with xml data.

you can grab the sample xml file here: tech_books.xml. i recommend that you look it over so you can see what the scripts are doing. we'll be writing scripts to list all the book titles and authors, then search for a specific title and fetch its isbn and its parent publisher's name attribute. all three should provide the same output, which you can see below:


Title: The Revolutionary Guide to Assembly Language
Author: Vitaly Maljugin
Author: Jacov Izrailevich
Author: A. Sopin
Author: S. Lavin
- - - - - - - - - - - - - - - - - - - - - - - - - 
Title: Writing Solid Code
Subtitle: Microsoft's Techniques for Developing Bug-Free C Programs
Author: Steve Maguire
- - - - - - - - - - - - - - - - - - - - - - - - - 
Title: Windows PowerShell Scripting Guide
Subtitle: Automating Administration of Windows Vista and Windows Server 2008
Author: Ed Wilson
- - - - - - - - - - - - - - - - - - - - - - - - - 
Title: Practical C Programming
Author: Steve Oualline
- - - - - - - - - - - - - - - - - - - - - - - - - 
Title: Programming Python
Author: Mark Lutz
- - - - - - - - - - - - - - - - - - - - - - - - - 
Title: The Ruby Way, Second Edition
Author: Hal Fulton
- - - - - - - - - - - - - - - - - - - - - - - - - 
Title: LaTeX: A Document Preparation System
Author: Leslie Lamport
- - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - 
Writing Solid Code [ISBN 1-556-15551-4] published by Microsoft Press
- - - - - - - - - - - - - - - - - - - - - - - - -

the first language i turned to is python, an old friend of mine. reading through the xml document object model api made my head swim, but with a little research i came across what used to be a third-party module, elementtree, that has now been incorporated into the core python installation. (this is one of the reasons that i love open source software.)

elementtree uses parent.findall(child) to fetch an array of all nodes of type child under parent. the parent.find(child) method returns only a single element object. parent.findtext(child) returns the child node data. it is similar to parent.find(child).text: the former returns an empty string if the element is not found, the latter raises an exception. element.get(attribute) returns a node's attribute value.


1  # xmlsearch.py
2
3  import xml.etree.ElementTree
4
5
6  tree = xml.etree.ElementTree.ElementTree()
7  tree.parse('tech_books.xml')
8
9  # List all the books by title and author from the XML data.
10 books = tree.findall('publisher/book')
11 for book in books:
12     print "Title: %s" % book.findtext('title')
13    if book.findtext('subtitle'):
14        print "Subtitle: %s" % book.findtext('subtitle')
15    authors = book.findall('author')
16    for author in authors:
17        print "Author: %s" % author.text
18    print '- ' * 25
19  
20 # Get the ISBN and publisher for a book called "Writing Solid Code."
21 print '- ' * 25
22 search = 'Writing Solid Code'
23 publishers = tree.findall('publisher')
24 for publisher in publishers:
25     books = publisher.findall('book')
26     for book in books:
27         if book.findtext('title') == search:
28             print "%s [ISBN %s] published by %s" % (search, 
29                 book.findtext('isbn'), publisher.get('name'))
30 print '- ' * 25
31

next i investigated xml under powershell. i was certain that it would be no more difficult than python, and what i found was that basic xml searches worked in a very similar way.

powershell has xml processing built-in, so no modules have to be included in the source file to make it work. it uses parent.SelectNodes(child) to get an array of all nodes called child under the node parent. the parent.SelectNode(child) method returns only the first node child that is found under parent. note the plural SelectNodes and singular SelectNode. element.GetAttribute(attribute) returns a node's attribute value.


1  # xmlsearch.ps1
2
3  [xml] $tree = Get-Content tech_books.xml
4
5  # List all the books by title and author from the XML data.
6  $books = $tree.SelectNodes('//publisher//book')
7  foreach ($book in $books) {
8      "Title: {0}" -f $book.title
9      if ($book.subtitle) {
10         "Subtitle: {0}" -f $book.subtitle
11     }
12     $authors = $book.SelectNodes('author')
13     foreach ($author in $authors) {
14         "Author: {0}" -f $author.'#text'
15     }
16     '- ' * 25
17 }
18
19 # Get the ISBN and publisher for a book called "Writing Solid Code."
20 '- ' * 25
21 $search = 'Writing Solid Code'
22 $publishers = $tree.SelectNodes('//publisher')
23 foreach ($publisher in $publishers) {
24     $books = $publisher.SelectNodes('book')
22     foreach ($book in $books) {
23         if ($book.title -eq $search) {
24             "{0} [ISBN {1}] published by {2}" -f $search, $book.isbn,
25                 $publisher.GetAttribute('name')
26         }
25     }
26 }
27 '- ' * 25
28

finally i dusted off ruby, a language that i haven't really touched in more than a year, to see what it would do. again, i wasn't impressed with the general xml documentation, but google searches led me to a ruby gem called nokogiri. i installed it and was very pleased with the result.

nokogiri uses parent.xpath(child) to get an array of child elements, parent.at_xpath(child) to get a single element, and element.attr(attribute) to get a node's attribute value. one nifty property lacking in the other two languages is element.parent, which returns the parent node of the current element. it saves us from having to define multiple “books” arrays in this example.


1  # xmlsearch.rb
2
3  require 'rubygems'
4  require 'nokogiri'
5
6
7  tree = Nokogiri.XML(open('tech_books.xml'))
8
9  # List all the books by title and author from the XML data.
10 books = tree.xpath('//publisher//book')
11 books.each do |book|
12   puts "Title: %s" % book.at_xpath('title').content
13   if book.at_xpath('subtitle')
14     puts "Subtitle: %s" % book.at_xpath('subtitle').content
15   end
16   authors = book.xpath('author')
17   authors.each do |author|
18     puts "Author: %s" % author.content
19   end
20   puts '- ' * 25
21 end
22
23 # Get the ISBN and publisher for a book called "Writing Solid Code."
24 puts '- ' * 25
25 search = 'Writing Solid Code'
26 books.each do |book|
27   if book.at_xpath('title').content == search
28     puts "%s [ISBN %s] published by %s" % [search, 
29       book.at_xpath('isbn').content, book.parent.attr('name')]
29   end
30 end
31 puts '- ' * 25
32

so after doing a little research i find that it's not really too difficult to read xml in my favorite scripting languages, and now i have some nice little examples to look at the next time i need a refresher. until next time.

a traveller from an antique land

Monday, August 11, 2014

Express Yourself, Regularly

The Basics of Regular Expressions

Introduction

Overview

Rule 1. Regular expressions match text, not numbers.

Rule 2. Regular expressions are made from three components: characters, anchors, and modifiers.

Rule 3. Each program is a little different.

A First Look at Characters

Anchors Put It Where You Want It

Expanding a Pattern With Modifiers

The Great Escape

Time and Time Again

Advanced Characters

Summary

PowerShell Examples

Grep Examples

Saturday, December 25, 2010

"ruby, any messages for me?"

Sunday, July 25, 2010

reading xml 1: powershell, python and ruby, oh my!

fellow travellers

look upon my works

nothing beside remains

and on the pedestal these words appear: