Thursday, July 29, 2010

xml 2: and now, a word from our compilers...


last time i demonstrated how to read xml data from a file, looking at the xml support in a few scripting languages. so that we might round out this discussion i will now show how to do the exact same thing with the exact same data using a handful of compiled languages. refer to my previous post to get the xml file and view what the program output should be.

the first language we'll look at is the one of the three that i am least familiar with: c#. despite the fact that i've never used the language outside the classroom and never written a line of “real” production code with it, this was the easiest to use. c# reads and writes much like the scripting languages we looked at last time.


1 // xmlsearch.cs
2
3 // Demonstrates how to parse data in an XML file using C-sharp's XPath
4 // implementation. The file is a collection of books, tech_books.xml.
5
6 using System;
7 using System.Xml;
8
9
10 class XmlSearch {
11
12 static void Main(string[] args) {
13 String separator = "- - - - - - - - - - - - - - - - - - - - - - - - -";
14 XmlDocument tree = new XmlDocument();
15 tree.Load("tech_books.xml");
16
17 // List all the books by title and author from the XML data.
18 XmlNodeList books = tree.SelectNodes("//publisher//book");
19 foreach(XmlNode book in books) {
20 Console.WriteLine("Title: {0}",
21 book.SelectSingleNode("title").InnerText);
22 if(book.SelectSingleNode("subtitle") != null) {
23 Console.WriteLine("Subtitle: {0}",
24 book.SelectSingleNode("subtitle").InnerText);
25 }
26 foreach(XmlNode author in book.SelectNodes("author")){
27 Console.WriteLine("Author: {0}", author.InnerText);
28 }
29 Console.WriteLine(separator);
30 }
31
32 // Get the ISBN and publisher for a book called "Writing Solid Code."
33 Console.WriteLine(separator);
34 String search = "Writing Solid Code";
35 foreach(XmlNode book in books) {
36 if(book.SelectSingleNode("title").InnerText == search) {
37 Console.WriteLine("{0} [{1}] published by {2}", search,
38 book.SelectSingleNode("isbn").InnerText,
39 book.ParentNode.Attributes["name"].Value);
40 }
41 }
42 Console.WriteLine(separator);
43 } // end method Main
44
45 } // end class XmlSearch
46

the next example is in java. while i have worked with java, i've never had to make it read xml before. i was a little disappointed. with factory classes and old-fashioned for loops it's more like c than c#.


1 /* XmlSearch.java
2 **
3 ** Demonstrates how to parse data in an XML file using Java's XPath API.
4 ** The file is a collection of books, tech_books.xml.
5 */
6
7 import javax.xml.parsers.*;
8 import javax.xml.xpath.*;
9 import org.w3c.dom.*;
10
11
12 public class XmlSearch {
13
14 public static void main(String[] args)
15 throws ParserConfigurationException, XPathExpressionException,
16 org.xml.sax.SAXException, java.io.IOException {
17
18 String separator = "- - - - - - - - - - - - - - - - - - - - - - - - -";
19 DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
20 Document tree = factory.newDocumentBuilder().parse("tech_books.xml");
21 XPath xp = XPathFactory.newInstance().newXPath();
22
23 // List all the books by title and author from the XML data.
24 NodeList books = (NodeList)xp.evaluate("//publisher//book", tree,
25 XPathConstants.NODESET);
26 for(int i = 0; i < books.getLength(); i++) {
27 Node book = books.item(i);
28 System.out.printf("Title: %s\n",
29 xp.evaluate("title", book, XPathConstants.STRING));
30 if((xp.evaluate("subtitle", book, XPathConstants.NODE)) != null) {
31 System.out.printf("Subtitle: %s\n",
32 xp.evaluate("subtitle", book, XPathConstants.STRING));
33 }
34 NodeList authors = (NodeList)xp.evaluate("author", book,
35 XPathConstants.NODESET);
36 for(int j = 0; j < authors.getLength(); j++) {
37 System.out.printf("Author: %s\n",
38 authors.item(j).getTextContent());
39 }
40 System.out.println(separator);
41 }
42
43 // Get the ISBN and publisher for a book called "Writing Solid Code."
44 System.out.println(separator);
45 String search = "Writing Solid Code";
46 for(int i = 0; i < books.getLength(); i++) {
47 Node book = books.item(i);
48 if(search.equals(
49 xp.evaluate("title", book, XPathConstants.STRING))) {
50 System.out.printf("%s [%s] published by %s\n", search,
51 xp.evaluate("isbn", book, XPathConstants.STRING),
52 ((Element)book.getParentNode()).getAttribute("name"));
53 }
54 }
55 System.out.println(separator);
56 } // end method main
57
58 } // end class XmlSearch
59

for the sake of completeness, here is the code to do the same thing in c. i'm using libxml2 for its xdom and xpath capabilities. since this is a short example, i have simplified it by excluding all pointer and memory management. don't ever do that in real life. be sure to clean up after yourself: your parent object doesn't work here.


1 /* xmlsearch.c
2 *
3 * Demonstrates how to parse data in an XML file using LIBXML2 and XPath in C.
4 * The file is a collection of books, tech_books.xml.
5 */
6
7 #include <stdio.h>
8 #include <stdlib.h>
9 #include <libxml/parser.h>
10 #include <libxml/xpath.h>
11
12
13 int main(int argc, char **argv) {
14 xmlDocPtr doc;
15 xmlXPathContextPtr tree;
16 xmlXPathObjectPtr result;
17 xmlNodeSetPtr nodeList;
18 xmlNodePtr book, cursor;
19 xmlChar *title, *subtitle, *author, *search, *isbn, *publisher;
20 char *separator = "- - - - - - - - - - - - - - - - - - - - - - - - -";
21 int found = 0;
22
23 doc = xmlParseFile("tech_books.xml");
24 tree = xmlXPathNewContext(doc);
25 result = xmlXPathEvalExpression((xmlChar *)"//publisher//book", tree);
26 nodeList = result->nodesetval;
27
28 // List all the books by title and author from the XML data.
29 for(int i = 0; i < nodeList->nodeNr; i++) {
30 book = nodeList->nodeTab[i];
31 cursor = book->xmlChildrenNode;
32 while(cursor != NULL) {
33 if((!xmlStrcmp(cursor->name, (xmlChar *)"title"))) {
34 title = xmlNodeListGetString(doc, cursor->xmlChildrenNode, 1);
35 printf("Title: %s\n", title);
36 }
37 if((!xmlStrcmp(cursor->name, (xmlChar *)"subtitle"))) {
38 subtitle = xmlNodeListGetString(doc,
39 cursor->xmlChildrenNode, 1);
40 printf("Subtitle: %s\n", subtitle);
41 }
42 if((!xmlStrcmp(cursor->name, (xmlChar *)"author"))) {
43 author = xmlNodeListGetString(doc, cursor->xmlChildrenNode, 1);
44 printf("Author: %s\n", author);
45 }
46 cursor = cursor->next;
47 }
48 puts(separator);
49 }
50
51 // Get the ISBN and publisher for a book called "Writing Solid Code."
52 puts(separator);
53 search = (xmlChar *)"Writing Solid Code";
54 result = xmlXPathEvalExpression((xmlChar *)"//publisher", tree);
55 nodeList = result->nodesetval;
56
57 for(int i = 0; i < nodeList->nodeNr; i++) {
58 cursor = nodeList->nodeTab[i];
59 publisher = xmlGetProp(cursor, (xmlChar *)"name");
60 book = cursor->xmlChildrenNode;
61 book = book->next;
62 while(book != NULL) {
63 cursor = book->xmlChildrenNode;
64
65 while(cursor != NULL) {
66 if((!xmlStrcmp(cursor->name, (xmlChar *)"title"))) {
67 title = xmlNodeListGetString(doc,
68 cursor->xmlChildrenNode, 1);
69 }
70 if((!xmlStrcmp(cursor->name, (xmlChar *)"isbn"))) {
71 isbn = xmlNodeListGetString(doc,
72 cursor->xmlChildrenNode, 1);
73 }
74 cursor = cursor->next;
75 }
76 if((!xmlStrcmp(title, search))) {
77 found = 1;
78 break;
79 }
80 book = book->next;
81 }
82 if(found) {
83 printf("%s [%s] published by %s\n", title, isbn, publisher);
84 break;
85 }
86 }
87 puts(separator);
88
89 return 0;
90 } // end method main
91

Sunday, July 25, 2010

reading xml 1: powershell, python and ruby, oh my!


i was working on a little project that required me to read an xml file from the internet. since i like to have some useful examples to refer to when i'm doing something that i don't often do (and so don't remember how to) i decided to dig into the three scripting languages that i'm most likely to use and learn how to work with xml data.

you can grab the sample xml file here: tech_books.xml. i recommend that you look it over so you can see what the scripts are doing. we'll be writing scripts to list all the book titles and authors, then search for a specific title and fetch its isbn and its parent publisher's name attribute. all three should provide the same output, which you can see below:

Title: The Revolutionary Guide to Assembly Language
Author: Vitaly Maljugin
Author: Jacov Izrailevich
Author: A. Sopin
Author: S. Lavin
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: Writing Solid Code
Subtitle: Microsoft's Techniques for Developing Bug-Free C Programs
Author: Steve Maguire
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: Windows PowerShell Scripting Guide
Subtitle: Automating Administration of Windows Vista and Windows Server 2008
Author: Ed Wilson
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: Practical C Programming
Author: Steve Oualline
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: Programming Python
Author: Mark Lutz
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: The Ruby Way, Second Edition
Author: Hal Fulton
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: LaTeX: A Document Preparation System
Author: Leslie Lamport
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
Writing Solid Code [ISBN 1-556-15551-4] published by Microsoft Press
- - - - - - - - - - - - - - - - - - - - - - - - -

the first language i turned to is python, an old friend of mine. reading through the xml document object model api made my head swim, but with a little research i came across what used to be a third-party module, elementtree, that has now been incorporated into the core python installation. (this is one of the reasons that i love open source software.)

elementtree uses parent.findall(child) to fetch an array of all nodes of type child under parent. the parent.find(child) method returns only a single element object. parent.findtext(child) returns the child node data. it is similar to parent.find(child).text: the former returns an empty string if the element is not found, the latter raises an exception. element.get(attribute) returns a node's attribute value.


1 # xmlsearch.py
2
3 import xml.etree.ElementTree
4
5
6 tree = xml.etree.ElementTree.ElementTree()
7 tree.parse('tech_books.xml')
8
9 # List all the books by title and author from the XML data.
10 books = tree.findall('publisher/book')
11 for book in books:
12 print "Title: %s" % book.findtext('title')
13 if book.findtext('subtitle'):
14 print "Subtitle: %s" % book.findtext('subtitle')
15 authors = book.findall('author')
16 for author in authors:
17 print "Author: %s" % author.text
18 print '- ' * 25
19
20 # Get the ISBN and publisher for a book called "Writing Solid Code."
21 print '- ' * 25
22 search = 'Writing Solid Code'
23 publishers = tree.findall('publisher')
24 for publisher in publishers:
25 books = publisher.findall('book')
26 for book in books:
27 if book.findtext('title') == search:
28 print "%s [ISBN %s] published by %s" % (search,
29 book.findtext('isbn'), publisher.get('name'))
30 print '- ' * 25
31

next i investigated xml under powershell. i was certain that it would be no more difficult than python, and what i found was that basic xml searches worked in a very similar way.

powershell has xml processing built-in, so no modules have to be included in the source file to make it work. it uses parent.SelectNodes(child) to get an array of all nodes called child under the node parent. the parent.SelectNode(child) method returns only the first node child that is found under parent. note the plural SelectNodes and singular SelectNode. element.GetAttribute(attribute) returns a node's attribute value.


1 # xmlsearch.ps1
2
3 [xml] $tree = Get-Content tech_books.xml
4
5 # List all the books by title and author from the XML data.
6 $books = $tree.SelectNodes('//publisher//book')
7 foreach ($book in $books) {
8 "Title: {0}" -f $book.title
9 if ($book.subtitle) {
10 "Subtitle: {0}" -f $book.subtitle
11 }
12 $authors = $book.SelectNodes('author')
13 foreach ($author in $authors) {
14 "Author: {0}" -f $author.'#text'
15 }
16 '- ' * 25
17 }
18
19 # Get the ISBN and publisher for a book called "Writing Solid Code."
20 '- ' * 25
21 $search = 'Writing Solid Code'
22 $publishers = $tree.SelectNodes('//publisher')
23 foreach ($publisher in $publishers) {
24 $books = $publisher.SelectNodes('book')
22 foreach ($book in $books) {
23 if ($book.title -eq $search) {
24 "{0} [ISBN {1}] published by {2}" -f $search, $book.isbn,
25 $publisher.GetAttribute('name')
26 }
25 }
26 }
27 '- ' * 25
28

finally i dusted off ruby, a language that i haven't really touched in more than a year, to see what it would do. again, i wasn't impressed with the general xml documentation, but google searches led me to a ruby gem called nokogiri. i installed it and was very pleased with the result.

nokogiri uses parent.xpath(child) to get an array of child elements, parent.at_xpath(child) to get a single element, and element.attr(attribute) to get a node's attribute value. one nifty property lacking in the other two languages is element.parent, which returns the parent node of the current element. it saves us from having to define multiple “books” arrays in this example.


1 # xmlsearch.rb
2
3 require 'rubygems'
4 require 'nokogiri'
5
6
7 tree = Nokogiri.XML(open('tech_books.xml'))
8
9 # List all the books by title and author from the XML data.
10 books = tree.xpath('//publisher//book')
11 books.each do |book|
12 puts "Title: %s" % book.at_xpath('title').content
13 if book.at_xpath('subtitle')
14 puts "Subtitle: %s" % book.at_xpath('subtitle').content
15 end
16 authors = book.xpath('author')
17 authors.each do |author|
18 puts "Author: %s" % author.content
19 end
20 puts '- ' * 25
21 end
22
23 # Get the ISBN and publisher for a book called "Writing Solid Code."
24 puts '- ' * 25
25 search = 'Writing Solid Code'
26 books.each do |book|
27 if book.at_xpath('title').content == search
28 puts "%s [ISBN %s] published by %s" % [search,
29 book.at_xpath('isbn').content, book.parent.attr('name')]
29 end
30 end
31 puts '- ' * 25
32

so after doing a little research i find that it's not really too difficult to read xml in my favorite scripting languages, and now i have some nice little examples to look at the next time i need a refresher. until next time.

Wednesday, July 14, 2010

powershell: my 3 favorite functions (2 of 2)


you may have noticed that when you print a really, really long string in powershell, the text is wrapped at the edge of the console. if your screen is 80 characters wide, your string gets wrapped at 80 characters. you may also have noticed that this wrapping isn't very intelligent. your string will be wrapped at exactly 80 characters, even if that means that the line gets broken in the middle of a word. just look through the powershell help to see what i mean.

textwrap

python has a nifty little module called textwrap that wraps a string the smart way. i don't know a lot about it—i only use the parts i need. it is not my intent to re-write the python module for powershell, only to borrow the concept with my own textwrap function. the idea is to take a long string and a maximum width in characters and return a string formatted so that each line does not exceed the width. whole words should be preserved at the end of each line so that the text is readable. because this is a more complex function, we will build it incrementally and talk about it along the way.


C:\> function textwrap ($longstring, $width) {
>>> $word_list = -split $longstring
>>> $out_strings = @("")
>>> $i = 0
>>>

the simple way to begin is to take the long string and convert into an array of words using the string split() method or, if you're using powershell 2, the split operator as i did here. we're going to convert the string to an array of strings, each no longer than the desired width. we'll store this in $out_strings. notice we initialized the array with an empty string. that way we can write to $out_strings[0] without getting errors about non-existent array elements. we'll start indexing our array at zero with our good buddy $i.


>>> foreach ($word in $word_list) {
>>> if ($out_strings[$i].Length + $word.Length -gt $width) {
>>> $i++
>>> $out_strings += ""
>>> }
>>> $out_strings[$i] += $word
>>>

what we want to do is concatenate each word to the end of the string in our array as long as that wouldn't make the string longer than our maximum width. we use the foreach loop to fetch each word from the list. in the loop we test to see if we can safely add the word to the end of the string in the first array element. once we've added as many words to the string as we can, we increment the array index and append a new empty string to the array. then we add the next word to the end of the new string.


>>> if ($out_strings[$i].Length + 1 -le $width) {
>>> $out_strings[$i] += " "
>>> }
>>> }
>>> return ($out_strings -join "`n")
>>> }
>>>

before we leave the foreach loop we need to add a space to the end of the word we just concatenated onto our string. that is, as long as it doesn't put us above our width. the if statement above adds a space if there is room to do so. finally, we return our array of strings as a single string with the join operator. we separate them with a newline character so that the new string prints on multiple lines. given a long string stored in $ls a sample run of this function might look like this:


C:\> "-" * 25
-------------------------
C:\> textwrap $ls 25
This is a really long
string. It is wrapped in
a way that doesn't chop
up words.
C:\>

the finished function appears below. i've added code so that it can even deal with hyphenated words and do the right thing. i'll let you examine that to see how it works. until next time.


33
34 # textwrap:
35 # expects a string and an int
36 # returns a string
37 # Intelligently wraps a long string to a maximum width.
38
39 function textwrap ($long_string, $width) {
40 $word_list = -split $long_string
41 $out_strings = @("")
42 $i = 0
43
44 foreach ($word in $word_list) {
45 # If adding the word would make the final string too long, start a
46 # new string:
47 if ($out_strings[$i].Length + $word.Length -gt $width) {
48 # If the word is hyphenated, try just the first part:
49 if (([regex] "\w+\-\w+").IsMatch($word)) {
50 if ($out_strings[$i].Length + 1 +
51 $word.Split("-")[0].Length -le $width) {
52 $out_strings[$i] += $word.Split("-")[0]
53 $out_strings[$i] += "-"
54 $word = $word.SubString($word.IndexOf("-") + 1)
55 }
56 }
57 $i++
58 $out_strings += ""
59 }
60 $out_strings[$i] += $word
61
62 # Add a space if it doesn't make the string too long.
63 if ($out_strings[$i].Length + 1 -le $width) {
64 $out_strings[$i] += " "
65 }
66 }
67
68 return ($out_strings -join "`n")
69 } # end function textwrap
70

Monday, July 12, 2010

powershell: my 3 favorite functions (1 of 2)


i have every intention of continuing our lessons in x86 assembler and nasm, but right now there is something else on my mind. let's pause our look at assembler for a little while and discuss powershell.

powershell is a very powerful scripting language for managing windows hosts. in fact, somewhere else i've gone on record saying, “if you administer a windows network, you must learn powershell.” i meant that.

powershell is very flexible and much easier to customize than dos batch files of yore. the powershell profile, reminiscent of the .profile used in the unix bourne shell, allows you to collect your personal preferences and code where they will always be available when you run a script. to that end there are three functions that, for me and my profile, are “must-haves.” i'll share them with you here.

pause

the first is the simple dos pause command. i'm not sure why powershell doesn't have such a useful command. it has proven its worth through decades of batch files. nevertheless, the powershell team has left it out. no matter. we can write our own with hardly any effort at all.

5
6
7 # pause:
8 # accepts a string
9 # Prints a message on the screen, then waits until a key is pressed before
10 # proceeding. If a string is not passed to the function the default string
11 # will be used. Simulates the "pause" command in DOS.
12
13 function pause ($msg = "Press any key to continue...") {
14 $msg
15 $Host.UI.RawUI.ReadKey("NoEcho, IncludeKeyDown") | Out-Null
16 } # end function pause
17

this function will accept a string to print as your pause message, but will default to the old dos standard “press any key” in absence of one. it uses the console's ui.rawui.readkey method to capture a raw key press, and redirects the output to the out-null cmdlet to avoid echoing the key's character on screen. a sample run of this function should look like this:

C:\> pause
Press any key to continue...
C:\> pause "I can't do that, Dave."
I can't do that, Dave.
C:\>

test

the second function is test, borrowed from the linux bourne-again shell, or bash. the test command takes an expression and determines whether the result is true or false. now, the bash command isn't really intended to be user-friendly: the result of the test is stored in the program's exit code. this is useful in a shell script, but i like to use test to help me debug scripts and check that my expressions are giving me the results i expect, so i've expanded it a little for powershell.

18
19 # test:
20 # expects an expression
21 # Accepts any valid expression and evalutes it for truth. Prints a message
22 # indicating that the expression evaluated true or false.
23
24 function test ($expr) {
25 if ($expr) {
26 Write-Host -ForegroundColor Green "True"
27 }
28 else {
29 Write-Host -ForegroundColor Red "False"
30 }
31 } # end function test
32

we can use this function from the interactive prompt to determine if some logical expression does what we think it does. a sample run of the function would look like this:

C:\> test (3 -gt 4)
False
C:\> test ("Bill" -is [string])
True
C:\> test (42 - 6 -lt 30 -or 56 + 9 -gt 70)
False
C:\>

now we can easily test even a complex expression to see whether it should be true or false when we use it in a script.

be sure to check back next time for the coolest of my three favorite functions.