About the Document Processing System

This document contains notes about the PIA's Document Processing System (DPS), as implemented in the Java module org.risource.dps.

See Also:

Philosophy

The DPS has several ways of looking at a document:

Perhaps the simplest way to view an active documet is as a program that expands into a character stream. The easy way to implement this would be to use the toString method on Node. Active nodes would simply expand appropriately by overriding the toString method, and everything would be cool.

There are several problems with this:

  1. In some cases the document is represented as a stream (file), and we want to expand it ``on the fly''.
  2. In other cases, we want the result to be another parse tree.
  3. In yet other cases, we want the result to be something else altogether, for example nothing (evaluation for side effects), a bitmap (rendering on the screen or for print), or a totally transformed document in a file.
  4. Sometimes we want to do all three: expand into a stream, modify the original parse tree itself, and store the modified parse tree back into the original file.
  5. What's worse, we want to be able to process the same document in different ways (e.g. expanding it or pretty-printing it), using different definitions for its tags.
As a result, we prefer to view an active document as an ActiveDocument object composed of Nodes (some of) which implement the ActiveNode interface. Active nodes simply expand into more nodes. The classes whose names start with ``ParseTree'' in org.risource.dps.active implement the ActiveNode interface.

The efficient way to accomplish this kind of processing is to use a parse tree traverser as an Input, and a constructor of some sort for an Output. This avoids actually constructing nodes or nodelists until they are really needed. (Attribute lists, in particular, will usually have to be constructed, but the attributes themselves can be shared so the overhead is small. Nodes in child lists cannot be shared, but we use traversal to avoid constructing them in most cases.)

The default action for a node is to copy it (recursively, if necessary) from the current Input to the current Output. Active nodes have an action routine that performs any necessary computation, and puts the results on the Output.

Local Transformations

Active nodes in the DPS all have the characteristic that they implement local functions: at most they will be replaced by a NodeList.

If it can be determined at parse time that a node contains no children with side effects, all the children of that node can be expanded in parallel.

In any case, a document in which all active nodes are local can be processed in a single depth-first, left-to-right traversal.

Implementation:

Processing Overview

The fundamental implementation is to traverse an active document (ParseTree) using a kind of iterator called an Input An Input is basically a specialized version of the DOM's TreeIterator except: (Note that TreeIterator has been dropped from recent versions of the DOM Level 1, but it's still a good idea.)

Similarly, the output is constructed by means of an Output. This combines traversal with construction, so that it is possible to add a node to the document under construction, then go on to add its children.

This means that Output needs startNode and endNode as well as putNode (which adds a complete subtree).

A Parser is simply a specialized Input that implements the standard operations by parsing an input stream. It does not have to build an actual tree. Similarly, an Output need not be a tree constructor; it might simply output a text representation of the node to an output stream.

An action routine will normally obtain its ``arguments'' (i.e. the attributes and children of the node that invoked it) by way of a suitable Output. This could then be returned to the action routine's calling context as a continuation. In any case, the Output is strongly typed, so the action routine can obtain the relevant objects (e.g. ActiveElement, NodeList, etc.) knowing their exact types.

Algorithms:

The basic main loop, without evaluation, is:

copyNodes(input, output)

  1. input.hasChildren?
    yes:
    output.startNode
    1. input.toFirstChild
    2. copyNodes(input, output)
    3. input.toParent
    output.endNode
    no:
    output.putNode
  2. input.toNext (advance to a new node)
  3. repeat until no more siblings
This can be done iteratively by just keeping track of nesting depth, although it's probably simpler to do it recursively.

The basic main loop for evaluation reduces to copyNodes if no active nodes are present:

processNodes(input, context, output)

  1. node = input.getNode (get a new node)
  2. node.hasAction?
    yes:
    node.action(input, context, output)
    no:
    (copy and process)
    1. input.hasChildren?
      yes:
      output.startNode
      1. input.toFirstChild
      2. processNodes(input, context, output)
      3. input.toParent
      output.endNode
      no:
      output.putNode
  3. input.toNext(); node = input.getNode(); (get a new node)
  4. if (n != null) process(input, context, output)

processNodes is actually a method of the interface org.risource.dps.Process. Process in turn extends Context, and has input and output as instance variables. This makes the code as implemented somewhat less clear than that given above. An additional complication is the fact that a node's ``action'' is actually contained in a separate object (the ``handler'') which the node points to.

It is the responsibility of the node's action handler to obtain and process its contents if necessary. This is done by constructing a ``subProcess'' with the original input, and a nodelist as output. It is possible to further simplify the processing, at the expense of efficiency, by insisting that every ``passive'' node have an action that simply performs the ``copy and process'' step. (In fact, every passive node does have such an action, but we test for several common cases using the actionCode method.)

A node's action is either a ``primitive'' (which executes at this point), or is specified by an ``action list'' of nodes (some of which may be active). In the latter case, the original node's attribute list and processed content list are bound to the local variables (entities) attributes and content respectively, and a new subProcess is constructed with the action list as input and the original output as output.

The typical node action, then, is:

action(input, context, output)

  1. atts = getProcessedAttributes(input, context)
  2. input.hasChildren?
      (process and collect children)
    1. content = new NodeList;
    2. input.toFirstChild
    3. process(input, context, content)
    4. input.toParent
  3. (Construct new context; bind entities)
    1. p = context.subProcess;
    2. p.set("attributes", atts)
    3. p.set("content", content)
  4. process(actions, p, output) (process action list)
Some (primitive) node actions may require the content in a different form, for example a string. In this case they simply use a different type of object as their output.

Again, there are several complications in the actual code. For example, NodeList cannot (presently) be used as either an input or an output; it has to be wrapped in an iterator.

Structure Sharing and Copying

Because of the way the DOM is designed, with nodes linked to their parent and siblings, it is essentially impossible to share structure. If a node is part of an existing document tree and needs to be added to a new one, it has to be recursively copied. This is inefficient, so the DPS only copies nodes when it is actually building structure.

The main exception to this at present is attribute lists -- these are represented as instances of ParseTreeAttrs, which is essentially an array. Hence attributes can be shared. Attribute lists, however, have to be copied in order to prevent changes made to the attributes of one element from being reflected in those of its copies. Fortunately, attribute lists are fairly light-weight.

A major advantage of this implementation is that, if a document consists primarily of text and ordinary ``passive'' markup, the entire parse tree is never actually constructed. Passive nodes can flow directly from the input to the output without having to be copied. This is partly because an Input can be queried about the ``effective'' structure (i.e. whether a node has children), while actually handing out an ``orphan'' node with no parent and no children, ready to be inserted into a new tree by the Output.


Alternative Implementations

There are two main alternative implementation techniques.

Recursive Tree Expansion

One alternative implementation would be to generate a complete parse tree, with each node having an ``expand'' method that recursively expands the node and its children, putting the results on an Output. This is similar to the standard OO technique of having every object know how to ``print itself'' on an output stream.

The disadvantage of this approach is that a document's entire parse tree has to reside in memory. The advantage is that most, if not all, context can be kept on the execution stack.

An extreme variation on this technique is to compile the original document into an equivalent program, and then execute it.

Event Driven

Many parsers (including all SGML parsers conforming to the SAX interface standard) operate by calling an ``action routine'' for each of the following ``events'':

  1. encountering a start tag
  2. encountering an end tag
  3. encountering some other node

They are, in other words, calling on the equivalent of an Output to perform their processing. It would be possible to rewrite our processing algorithm so that an Output could do the necessary processing without needing an Input. It would then be operating in ``push mode''.

There are several variations on this technique:

  1. The first is to separate the action for each active node into a ``start action'' and an ``end action''. In between, a processed parse tree is constructed for the content. Most of the work happens in the end action; the processor's state consists mainly of a stack of partially-constructed trees.
  2. The second is to have the ``start action'' return a new Output as a continuation, pushing the previous output on a stack. Performing an end tag event on such an output would result in the end action being performed and the previous output being popped. This can be significantly more efficient if special processing has to be done on an active tag's content.
  3. The third, which we actually implemented for our first ``InterForm Interpretor'', is to have the event routines construct a sequence of ``token'' objects that are then fed to the processor (which associates separate actions with them). This has the disadvantage that objects are constructed for the end tags, then immediately thrown away.
  4. The fourth, by far the simplest to derive from the present code, is to put the parser and processor in separate threads. The ``event-driven'' Input simply blocks until the parser thread provides it with a node.

Copyright © 1998-2000 Ricoh Innovations, Inc.
$Id: about.html,v 1.10 2001-01-11 23:37:08 steve Exp $