Taking Applications to the Next Level with XML, Part 3: The Toolbox of XML APIs

Submit New Article


Last Modified On :   September 17, 2008 4:40 PM PDT
Rate
 


Introduction

So far in this series I have emphasized how one of the great strengths of XML lies in how richly it adds to the developer's toolbox. Now we move to a closer look at many of these tools. One of the most important areas is in Application Programming Interfaces (APIs) that bind XML technologies to other programming and run-time environments. In this article, we shall look closely at the two preeminent APIs for XML: Simple API for XML (SAX) and Document Object Model (DOM). They represent two strongly contrasting processing models for XML, and as such have quite complementary sets of advantages and disadvantages. A basic understanding of XML is required, as well as familiarity with some object-oriented programming language.

Prospecting the Markup Stream

SAX is a very interesting species. It was essentially created on a marathon thread on the XML-DEV mailing list, which has long been the prime habitat for XML experts. David Megginson led the discussion and the result was one of the most successful XML initiatives, with no large company or standards-body sponsorship.

SAX is an event-driven API. That is, the developer registers handler code for specific events triggered by different parts of XML mark-up (e.g. start and end tags, text, entities). The parser then sends a stream of these events based on the input XML, which the handler code processes in turn. Figure 1 is an illustration of this process.

 

The XML parser engine is tied to the SAX system through an appropriate driver, which causes SAX events to be fired in a stream as the parsing progresses. The developer will typically register handlers to capture these events and take appropriate action. This style of processing might be familiar to you if you have done user interface programming using popular systems such as Microsoft* Foundation Classes or many XWindows* APIs. If you are not, however, familiar with this style, such event-based processing might represent a somewhat novel way of thinking. Let us proceed with an example.

Listing 1 is a small SAX program that draws a crude graph of the tree structure of an XML document. It is written in Python*, so users of almost any OO language should be able to follow along.

# This is a special form of string literal that can span multiple lines

xml_source = """<?xml version='1.0'?>

<memo>

<title>With Usura Hath no Man a House of Good Stone</title>

<date form="ISO-8601">1936-04-03</date>

<to>The Art World</to>

<body>

It has come to our attention that the basis for art production

Has shifted from keen patronage to vulgar commercial measure.

Management is concerned this will erode the lasting value of the age's

works.

</body>

</memo>

"""

import string

from xml import sax

#We subclass from ContentHandler in order to gain default behaviors

class TreePrintHandler(sax.ContentHandler):

#A class that handles XML events and draws a tree from the document

structure

def __init__(self):

#In Python, self is conventionally used as a name of the value

#That represents the instance on which the method is being invoked

#This means that you handle instance variables as attributes of self

self.depth = 0

self.increment = " "

return

def startDocument(self):

print "--- [Document]"

self.depth = self.depth + 1

return

def startElement(self, name, attributes):

#In Python multiplying a string by an Integer repeats the string that

#many times

print self.increment*self.depth + "+-- [Element <" + name + ">]"

self.depth = self.depth + 1

return

def endElement(self, name):

self.depth = self.depth - 1

return

def characters(self, text):

#Only print out any information on this string

#if it has non-space characters

if not text.isspace():

print self.increment*self.depth + "+-- [Text <" + text + ">]"

return

#A little Python magic that allow this program to run as a stand-alone

script

if __name__ == '__main__':

#Create a new parser instance, using the default XML engine SAX provides

parser = sax.make_parser()

#Create an instance of our handler class, which will be registered

#to receive SAX events

handler = TreePrintHandler()

#Pass a string to be parsed, and pass the handler to be registered

#to receive SAX events.

sax.parseString(xml_source, handler)

#At this point, the parser has completed processing, and all events

#have been dispatched. We're done.

 

Listing 1

In object-oriented languages, event-based interfaces are usually implemented by registering objects whose classes have special methods for each event. We define such a class, TreePrintHandler. If you compare with the event names in Figure 1, you will see that the methods in this class are similar to the event names. This is no coincidence: the SAX engine works by calling these particular methods whenever the corresponding event is dispatched. We derive TreePrintHandler from a built-in class, sax.ContentHandler, which provides a default in case we don't define a method to handle a particular event. Note that we don't implement every SAX event. We don't need to do anything special for the "end document" event, for example.

Very often in SAX programming, you need to keep track of where you are in the flow, or perhaps some values that will need to be used at some point. Managing such details is known as managing state. The information comes str ictly in the order that the XML parser follows, which is not always the natural order for processing, so the SAX developer must be familiar with state management. Luckily, since we are using a class for event handling, this is not so difficult. You define instance variables that keep track of state. In the example program, the state to be managed is the depth of indentation reached.

The following is an example of running this program using Python, assuming you've copied Listing 1 to a file named listing1.py:

> python listing1.py 
--- [Document]
+-- [Element <memo>]
+-- [Element <title>]
+-- [Text <With Usura Hath no Man a House of Good Stone>]
+-- [Element <date>]
+-- [Text <1936-04-03>]
+-- [Element <to>]
+-- [Text <The Art World>]
+-- [Element <body>]
+-- [Text <It has come to our attention that the basis for art production>]
+-- [Text <Has shifted from keen patronage to vulgar commercial measure.>]
+-- [Text <Management is concerned this will erode the lasting value of the
age's works.>]

 

Don't Treat me like an Object just because I'm a Model

The SAX example shows how an object-oriented language is used to deal with event-based processing. However, the W3C also developed an object model for XML documents that can be used more directly. The Document Object Model (DOM) is the result of this effort. DOM is well known to Web developers as the way to manipulate forms and the like in an HTML Web browser. The XML aspect of DOM is much the same as the HTML aspect. The basic idea is that a document is decomposed into a tree model, where each component of the XML syntax is represented by a node. For instance, the following XML document can be represented by the tree illustrated in Figure 2.

<?xml version = "1.0"?> 
<ADDRBOOK>
<ENTRY ID="pa">
<NAME>Pieter Aaron</NAME>
<ADDRESS>404 Error Way</ADDRESS>
<PHONENUM DESC="Work">404-555-1234</PHONENUM>
<PHONENUM DESC="Fax">404-555-4321</PHONENUM>
<EMAIL>paaron@inter.net</EMAIL>
</ENTRY>
<ENTRY ID="en">
<NAME>Emeka Ndubuisi</NAME>
<ADDRESS>42 Spam Blvd</ADDRESS>
<PHONENUM DESC="Work">767-555-7676</PHONENUM>
<PHONENUM DESC="Home">767-555-5555</PHONENUM>
<EMAIL>endu@spamtron.com</EMAIL>
</ENTRY>
</ADDRBOOK>

Figure 2 - Click to enlarge

Each part of the document is a separate node in the tree. Notice the root node, labeled ("/"). This is a node that in effect represents the document as a whole. It has one element child, known as the document element (ADDR in our case). Elements can have other elements as children, as well as text nodes. Attributes are not quite considered children of their elements, so I represent them using a dashed line to what is known as their owner element. Attribute names are marked with an @ sign.

Imagine an API that allows you to navigate this tree, moving from parent to child node, to siblings, and other steps, taking advantage of special properties of certain types of nodes (for instance, elements can have attributes and text nodes have the text data). If you can imagine this, then you already have a basic handle on the DOM.

The DOM, like SAX, is designed to be language-neutral. In the case of the DOM, the generic interface definition language (IDL) standardized by the Object Management Group (OMG) is used to express the tree node interfaces. For instance, the interface for retrieving a particular attribute from an element is defined as follows:

DOMString getAttribute(in DOMString name); 

This uses the special type DOMString to represent a string based on the rules for XML strings. This is typically translated into a method of the same name in the implementation language.

Again, an example should help illustrate. Listing 2 is a program that navigates and mutates an XML document using DOM.

xml_source = "<top><outer><inner>center</inner></outer></top>" 
def ReadDom():
#This function uses proprietary routines to create a DOM tree from
#XML text, since the standard API for this is still in development
from xml.dom.minidom import parseString
tree = parseString(xml_source)
return tree
def Invert():
tree = ReadDom()
top_element = tree.documentElement
#Simple traversal from parent to child
outer_element = top_element.firstChild
inner_element = outer_element.firstChild
#get all the children of the outer element (only one in this case)
inner_element = outer_element.childNodes[0]
#swap outer and inner elements
top_element.removeChild(outer_element)
outer_element.removeChild(inner_element)
#Move all the children from inner to outer
for node in inner_element.childNodes:
outer_element.appendChild(node)
top_element.appendChild(inner_element)
inner_element.appendChild(outer_element)
#now print out the result of mutating the tree
#Note: toxml() is also a proprietary method soon to be replaced
#With a standard form in DOM Level 3
print "Modified XML"
print top_element.toxml()
#Just for kicks, navigate though again and print out the
sole text node
inner_eleme nt = top_element.firstChild
outer_element = inner_element.firstChild
center_text = outer_element.firstChild
print "Center text:"
print center_text.data
return
if __name__ == '__main__':
import sys
Invert()

 

Listing 2

Notice that some very important functions, such as converting XML source into a tree ready for DOM processing, and converting such a tree back to text, are not yet covered in the DOM standard. DOM is evolving through several "levels," each of which builds on the prior one. Level 1 covered the basics, Level 2 added namespace support, iterators, and so forth. Namespaces are a way to avoid clashes in XML element and attribute names. I'll discuss this in more detail in the next article in this series. Iterators are mechanisms for walking over the DOM tree, invoking some action for each node in turn. DOM Level 3 adds loading and saving trees, and other tools, but is still in development as of my writing. Be sure that the DOM library you choose has the support for XML features you need.

As you can see from the example, DOM trees can be manipulated and not just read. One must do this manipulation in small steps, though--by dissociating parents from children and attaching them to other parents. Using the DOM, you can also store arbitrary pointers to any nodes in the tree, and quickly jump to exactly the part you wish to process. If you copy Listing 2 to a file named listing2.py, you can try it out as follows:

> python listing2.py 
Modified XML
<top><inner><outer>center</outer></inner></top>
Center text:
center


Conclusion

DOM and SAX are quite different approaches to XML processing, and both are valuable tools to have around. In deciding when to use one or the other, there are several factors to consider.

  • For many people, DOM is easier to understand than SAX at first, because it involves painstaking navigation of a tree that can be readily grasped.
  • DOM, however, generally keeps all XML nodes in memory, which can be very inefficient for larger documents. State management in SAX can be very complex to set up. If your processing doesn't flow naturally with the document, it can be hard work.
  • SAX tends to use much less memory because only the current bits of the event stream need be instantiated.
  • Once you have dealt with any initial difficulties setting up SAX handlers, they tend to be more robust.
It is not unusual to use both in processing, for instance, in skimming over a large document using SAX and then building a small DOM tree for the portion of the document one wishes to process. Whether you need speed of development or speed of execution, regardless of what processing patterns you employ for each particular XML task, one or the other is probably the key to getting things done among the tags.

 

Resources

The XML-DEV mailing list home page:
http://www.xml.org/xml/xmldev.shtml*

Simple API for XML (SAX) project home page:
http://www.saxproject.org/*

The W3C's home page for DOM:
http://www.w3.org/DOM/*

Read other articles in this series: Part 1, Part 2, Part 4, Part 5





Comments (0)



Leave a comment

Name (required)

Email (required; will not be displayed on this page)

Your URL (optional)


Comment*