This document contains notes about the PIA's Document Processing System
(DPS), as implemented in the Java module org.risource.dps
.
to-do
file
./handle/
../site/
../pia/
Perhaps the simplest way to view an active documet is as a program
that expands into a character stream. The easy way to implement this
would be to use the toString
method on
Node
. Active nodes would simply expand appropriately by
overriding the toString
method, and everything would be cool.
There are several problems with this:
ParseTree
'' in org.risource.dps.active
implement the ActiveNode
interface.
The efficient way to accomplish this kind of processing is to use a parse tree
traverser as an Input
, and a constructor of
some sort for an Output
. This avoids actually constructing nodes
or nodelists until they are really needed. (Attribute lists, in particular,
will usually have to be constructed, but the attributes themselves can be
shared so the overhead is small. Nodes in child lists cannot be
shared, but we use traversal to avoid constructing them in most cases.)
The default action for a node is to copy it (recursively, if necessary) from
the current Input
to the current Output
. Active
nodes have an action
routine that performs any necessary
computation, and puts the results on the Output
.
If it can be determined at parse time that a node contains no children with side effects, all the children of that node can be expanded in parallel.
In any case, a document in which all active nodes are local can be processed in a single depth-first, left-to-right traversal.
Input
An
Input
is basically a specialized version of the DOM's
TreeIterator
except:
toNext
toParent
toFirstChild
input.hasChildren()
returns true
but
input.getNode().hasChildren()
returns false
,
we know that input
is a parser, and that
the current node's children have not been seen yet.
getNode
-- a reference to the actual current node
in the parse tree. It is possible that not all of its children
will be present.
TreeIterator
has been dropped from recent
versions of the DOM Level 1, but it's still a good idea.)
Similarly, the output is constructed by means of an Output
. This
combines traversal with construction, so that it is possible to add a node to
the document under construction, then go on to add its children.
This means that Output needs startNode
and endNode
as well as putNode
(which adds a complete subtree).
A Parser
is simply a specialized Input
that
implements the standard operations by parsing an input stream. It does
not have to build an actual tree. Similarly, an Output
need not be a tree constructor; it might simply output a text representation
of the node to an output stream.
An action routine will normally obtain its ``arguments'' (i.e. the attributes and children of the node that invoked it) by way of a suitable Output. This could then be returned to the action routine's calling context as a continuation. In any case, the Output is strongly typed, so the action routine can obtain the relevant objects (e.g. ActiveElement, NodeList, etc.) knowing their exact types.
The basic main loop for evaluation reduces to copyNodes
if no
active nodes are present:
processNodes
is actually a method of the interface
org.risource.dps.Process
. Process
in turn extends
Context
, and has input
and output
as
instance variables. This makes the code as implemented somewhat less clear
than that given above. An additional complication is the fact that a node's
``action'' is actually contained in a separate object (the ``handler'') which
the node points to.
It is the responsibility of the node's action handler to obtain and process
its contents if necessary. This is done by constructing a ``subProcess'' with
the original input, and a nodelist as output. It is possible to further
simplify the processing, at the expense of efficiency, by insisting that every
``passive'' node have an action that simply performs the ``copy and process''
step. (In fact, every passive node does have such an action, but
we test for several common cases using the actionCode
method.)
A node's action is either a ``primitive'' (which executes at this point), or
is specified by an ``action list'' of nodes (some of which may be active). In
the latter case, the original node's attribute list and processed content list
are bound to the local variables (entities) attributes
and
content
respectively, and a new subProcess is constructed with
the action list as input and the original output as output.
The typical node action, then, is:
Again, there are several complications in the actual code. For example, NodeList cannot (presently) be used as either an input or an output; it has to be wrapped in an iterator.
Because of the way the DOM is designed, with nodes linked to their parent and siblings, it is essentially impossible to share structure. If a node is part of an existing document tree and needs to be added to a new one, it has to be recursively copied. This is inefficient, so the DPS only copies nodes when it is actually building structure.
The main exception to this at present is attribute lists -- these are
represented as instances of ParseTreeAttrs
, which is essentially
an array. Hence attributes can be shared. Attribute lists, however,
have to be copied in order to prevent changes made to the attributes of one
element from being reflected in those of its copies. Fortunately, attribute
lists are fairly light-weight.
A major advantage of this implementation is that, if a document consists
primarily of text and ordinary ``passive'' markup, the entire parse tree is
never actually constructed. Passive nodes can flow directly from the input to
the output without having to be copied. This is partly because an
Input
can be queried about the ``effective'' structure
(i.e. whether a node has children), while actually handing out an ``orphan''
node with no parent and no children, ready to be inserted into a new
tree by the Output
.
There are two main alternative implementation techniques.
One alternative implementation would be to generate a complete parse tree,
with each node having an ``expand
'' method that recursively
expands the node and its children, putting the results on an
Output
. This is similar to the standard OO technique of having
every object know how to ``print itself'' on an output stream.
The disadvantage of this approach is that a document's entire parse tree has to reside in memory. The advantage is that most, if not all, context can be kept on the execution stack.
An extreme variation on this technique is to compile the original document into an equivalent program, and then execute it.
Many parsers (including all SGML parsers conforming to the SAX interface standard) operate by calling an ``action routine'' for each of the following ``events'':
They are, in other words, calling on the equivalent of an
Output
to perform their processing. It would be possible to
rewrite our processing algorithm so that an Output
could do the
necessary processing without needing an Input
. It would then be
operating in ``push mode''.
There are several variations on this technique:
Output
as a continuation, pushing the previous
output on a stack. Performing an end tag event on such an output
would result in the end action being performed and the previous
output being popped. This can be significantly more efficient if
special processing has to be done on an active tag's content.
Input
simply blocks until the parser thread provides it
with a node.