Several times I have alluded to the idea of representing documents as compiled code in some programming language such as C. Such a representation is one of the extrema on the time-space continuum, and close to an extremum on the efficiency-portability continuum. (The corresponding executable is at the extremum.)
There are actually two distinct ways of representing documents as programs:
The first technique is significantly more versatile, and avoids problems that arise when trying to compile constructs like <expand> that are most easily implemented if one has a parse tree representation of the document in the first place. Given a document representation as compiled parse trees, the corresponding program is essentially an interpretor.
Note also that the distinction disappears in interpreted languages such as LISP, in which trees are used as the base representation of both programs and data.
Even though the W3C's Document Object Model (DOM) specifies a representation for document parse trees, it is not really a particularly good internal representation for documents, for the following reasons:
There are three plausible organizations for documents:
There is a separate but related decision about the extent to which contiguous blocks of text ought to be broken up into words. This trades space overhead for efficiency in operations that split their text content on whitespace. One possible compromise is to make this decision on an element-by-element basis. It's also possible (likely, even) that structure sharing (i.e., the flyweight pattern) will gain back more space than it costs.
There are two alternatives for representing the children; both have their uses.
nextSibling
and firstChild
pointers
would take the same amount of space, but precludes structure sharing
within content. On the other hand, it permits on-the-fly
construction since it would not be necessary to count children.
Combined with a special "end-of-list" node that points back to the
parent, this would allow stackless tree traversal (as in the DOM).
Number the elements in the document, and use these numbers in the node
identifiers. Use, e.g., eNNN
for the element,
aNNN
for its attribute list array, and
cNNN
for its content array.
Alternatively, keep a separate count for each type of element, and
incorporate the tagname into the identifier, e.g.
h1_NNN
. This would be a little more complex, but
would result in more readable code.
Tagnames will normally be represented as indices into a table of
handlers or as references to descriptor structures. Here the use of the
tag name in the corresponding identifier or #define
symbol is
almost obligatory for debugging purposes.