Sunday, July 25, 2010

reading xml 1: powershell, python and ruby, oh my!


i was working on a little project that required me to read an xml file from the internet. since i like to have some useful examples to refer to when i'm doing something that i don't often do (and so don't remember how to) i decided to dig into the three scripting languages that i'm most likely to use and learn how to work with xml data.

you can grab the sample xml file here: tech_books.xml. i recommend that you look it over so you can see what the scripts are doing. we'll be writing scripts to list all the book titles and authors, then search for a specific title and fetch its isbn and its parent publisher's name attribute. all three should provide the same output, which you can see below:

Title: The Revolutionary Guide to Assembly Language
Author: Vitaly Maljugin
Author: Jacov Izrailevich
Author: A. Sopin
Author: S. Lavin
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: Writing Solid Code
Subtitle: Microsoft's Techniques for Developing Bug-Free C Programs
Author: Steve Maguire
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: Windows PowerShell Scripting Guide
Subtitle: Automating Administration of Windows Vista and Windows Server 2008
Author: Ed Wilson
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: Practical C Programming
Author: Steve Oualline
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: Programming Python
Author: Mark Lutz
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: The Ruby Way, Second Edition
Author: Hal Fulton
- - - - - - - - - - - - - - - - - - - - - - - - -
Title: LaTeX: A Document Preparation System
Author: Leslie Lamport
- - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - -
Writing Solid Code [ISBN 1-556-15551-4] published by Microsoft Press
- - - - - - - - - - - - - - - - - - - - - - - - -

the first language i turned to is python, an old friend of mine. reading through the xml document object model api made my head swim, but with a little research i came across what used to be a third-party module, elementtree, that has now been incorporated into the core python installation. (this is one of the reasons that i love open source software.)

elementtree uses parent.findall(child) to fetch an array of all nodes of type child under parent. the parent.find(child) method returns only a single element object. parent.findtext(child) returns the child node data. it is similar to parent.find(child).text: the former returns an empty string if the element is not found, the latter raises an exception. element.get(attribute) returns a node's attribute value.


1 # xmlsearch.py
2
3 import xml.etree.ElementTree
4
5
6 tree = xml.etree.ElementTree.ElementTree()
7 tree.parse('tech_books.xml')
8
9 # List all the books by title and author from the XML data.
10 books = tree.findall('publisher/book')
11 for book in books:
12 print "Title: %s" % book.findtext('title')
13 if book.findtext('subtitle'):
14 print "Subtitle: %s" % book.findtext('subtitle')
15 authors = book.findall('author')
16 for author in authors:
17 print "Author: %s" % author.text
18 print '- ' * 25
19
20 # Get the ISBN and publisher for a book called "Writing Solid Code."
21 print '- ' * 25
22 search = 'Writing Solid Code'
23 publishers = tree.findall('publisher')
24 for publisher in publishers:
25 books = publisher.findall('book')
26 for book in books:
27 if book.findtext('title') == search:
28 print "%s [ISBN %s] published by %s" % (search,
29 book.findtext('isbn'), publisher.get('name'))
30 print '- ' * 25
31

next i investigated xml under powershell. i was certain that it would be no more difficult than python, and what i found was that basic xml searches worked in a very similar way.

powershell has xml processing built-in, so no modules have to be included in the source file to make it work. it uses parent.SelectNodes(child) to get an array of all nodes called child under the node parent. the parent.SelectNode(child) method returns only the first node child that is found under parent. note the plural SelectNodes and singular SelectNode. element.GetAttribute(attribute) returns a node's attribute value.


1 # xmlsearch.ps1
2
3 [xml] $tree = Get-Content tech_books.xml
4
5 # List all the books by title and author from the XML data.
6 $books = $tree.SelectNodes('//publisher//book')
7 foreach ($book in $books) {
8 "Title: {0}" -f $book.title
9 if ($book.subtitle) {
10 "Subtitle: {0}" -f $book.subtitle
11 }
12 $authors = $book.SelectNodes('author')
13 foreach ($author in $authors) {
14 "Author: {0}" -f $author.'#text'
15 }
16 '- ' * 25
17 }
18
19 # Get the ISBN and publisher for a book called "Writing Solid Code."
20 '- ' * 25
21 $search = 'Writing Solid Code'
22 $publishers = $tree.SelectNodes('//publisher')
23 foreach ($publisher in $publishers) {
24 $books = $publisher.SelectNodes('book')
22 foreach ($book in $books) {
23 if ($book.title -eq $search) {
24 "{0} [ISBN {1}] published by {2}" -f $search, $book.isbn,
25 $publisher.GetAttribute('name')
26 }
25 }
26 }
27 '- ' * 25
28

finally i dusted off ruby, a language that i haven't really touched in more than a year, to see what it would do. again, i wasn't impressed with the general xml documentation, but google searches led me to a ruby gem called nokogiri. i installed it and was very pleased with the result.

nokogiri uses parent.xpath(child) to get an array of child elements, parent.at_xpath(child) to get a single element, and element.attr(attribute) to get a node's attribute value. one nifty property lacking in the other two languages is element.parent, which returns the parent node of the current element. it saves us from having to define multiple “books” arrays in this example.


1 # xmlsearch.rb
2
3 require 'rubygems'
4 require 'nokogiri'
5
6
7 tree = Nokogiri.XML(open('tech_books.xml'))
8
9 # List all the books by title and author from the XML data.
10 books = tree.xpath('//publisher//book')
11 books.each do |book|
12 puts "Title: %s" % book.at_xpath('title').content
13 if book.at_xpath('subtitle')
14 puts "Subtitle: %s" % book.at_xpath('subtitle').content
15 end
16 authors = book.xpath('author')
17 authors.each do |author|
18 puts "Author: %s" % author.content
19 end
20 puts '- ' * 25
21 end
22
23 # Get the ISBN and publisher for a book called "Writing Solid Code."
24 puts '- ' * 25
25 search = 'Writing Solid Code'
26 books.each do |book|
27 if book.at_xpath('title').content == search
28 puts "%s [ISBN %s] published by %s" % [search,
29 book.at_xpath('isbn').content, book.parent.attr('name')]
29 end
30 end
31 puts '- ' * 25
32

so after doing a little research i find that it's not really too difficult to read xml in my favorite scripting languages, and now i have some nice little examples to look at the next time i need a refresher. until next time.

No comments:

Post a Comment