Taking the Document Object Model (DOM) Approach to XML Documents

By Barry Burd

The dichotomy between linear thinking and holistic thinking separates (respectively) SAX from DOM.

SAX (Simple API for XML) treats an XML document linearly, working through a document piece by piece, from beginning to end. But with DOM (the Document Object Model), you jump in and look at the whole document. A bit later, you zoom in on the root element, and then focus more closely on an element within the root element. In some situations, jumping in is exactly what you need to do.

DOM nodes

With DOM, you think of an XML document as having several nodes. Examples of nodes include elements, attributes, comments, and the characters between a pair of start and end tags. An entire XML document is itself a node. All in all, an XML document can have 12 different types of nodes.

The nodes of a tree

Taken together, all the DOM nodes in an XML document form a tree. Take, for instance, the document in Listing 1. This document’s DOM tree looks like the tree shown in Figure 1.

Listing 1: The Anchovy Lovers Club

<?xml version=”1.0″ encoding=”UTF-8″?>
<!–AnchovyLoversClub.xml –>
<Member firstname=”Herbert”>
Founder, President, Secretary, Publicity Manager

The tree has eleven nodes. To count them, start by counting the tree’s branches (conveniently displayed in Figure 1). Then count the Member firstname = “Herbert” branch a second time. (This branch has two DOM nodes on it. The element named Member is a node, and the element’s attribute firstname=”Herbert” is a node.)

Figure 1: A tree representing the document in Listing 1.

A treatise on trees

There are a few things you’ll discover by staring at the tree in Figure 1.

  • Some nodes are children of other nodes.
    For example, the Member node is a child of the AnchovyLoversClub node. That’s because, in Listing 1, the Member element is nested inside the AncoverLoversClub element.
    In a similar way, the Standing node is a child of the Member node. This family analogy goes on and on. The Member node is the parent of the Standing node, and the AncoverLoversClub is the parent of the Member node.
  • The entire document is a node.
    This is an important point, and it’s easy to forget. In Listing 1, the document’s root element is AnchovyLoversClub. But in Figure 1, the name AnchovyLoversClub isn’t at the top of the tree. Instead, the word #document is at the top of the tree.
    A DOM tree’s topmost node represents an entire XML document. Errors occur when programmers think that the document’s root element starts the tree. (It doesn’t.)
  • Comments and pieces of text are nodes.
    In Figure 1, the comment <!–AnchovyLoversClub.xml –> is a child node of the document node. That’s because, in Listing 1, the comment is part of the document. The comment isn’t nested inside any of the document’s elements.
    Once again, we play genealogy. We say that the #document node has two children — a comment node and an AnchovyLoversClub node. These two nodes — the comment and the AnchovyLoversClub — are called siblings.
    Also in Figure 1, the text Founder, President, Secretary, Publicity Manager is part of a node. In Listing 1, the text Founder, President, Secretary, Publicity Manager is inside the Standing element. So, in Figure 1, this text node is a child of the Standing node.
  • Even ignorable text is part of a node.
    According to Figure 1, the AnchovyLoversClub node has three direct child nodes — two nodes labeled #text, and another node labeled Member. That’s because, as far as DOM is concerned, the AnchovyLoversClub node has three things in it.


carriage return and three blanks

Member element

carriage return


    The situation is illustrated in Figure 2.
Figure 2: Two text nodes in Listing 1.
    The three children of the Member node — two pieces of whitespace and one Standing element — are all siblings.
    Now, notice the dots and the [cr] in Figures 1 and 2. In the tree diagram, a dot represents a blank space, and [cr] represents a carriage return. With DOM, all the ignorable whitespace between the AnchovyLoversClub start tag and the Member start tag forms a node. Starting with the angle bracket that terminates the AnchovyLoversClub start tag, you go to the next line, and then you have three blank spaces before the angle bracket that opens the Member start tag. All that stuff is a DOM node.
  • End tags aren’t nodes.
    With SAX, you may be thinking in terms of starting the Member element, and later ending the Member element. In DOM, you don’t think this way. Instead, you visit the Member element just once. Within that visit, you visit the Standing element and some text. DOM has no method corresponding to the SAX endElement method.