Compiled Document Representations

Introduction

Several times I have alluded to the idea of representing documents as compiled code in some programming language such as C. Such a representation is one of the extrema on the time-space continuum, and close to an extremum on the efficiency-portability continuum. (The corresponding executable is at the extremum.)

There are actually two distinct ways of representing documents as programs:

  1. as an initialized data structure that corresponds to a representation of its parse tree
  2. as a program that, when executed, outputs the document as a character stream, with all active constructs expanded.

The first technique is significantly more versatile, and avoids problems that arise when trying to compile constructs like <expand> that are most easily implemented if one has a parse tree representation of the document in the first place. Given a document representation as compiled parse trees, the corresponding program is essentially an interpretor.

Note also that the distinction disappears in interpreted languages such as LISP, in which trees are used as the base representation of both programs and data.

Document Data Structures

Even though the W3C's Document Object Model (DOM) specifies a representation for document parse trees, it is not really a particularly good internal representation for documents, for the following reasons:

There are three plausible organizations for documents:

  1. Represent the entire document as a single array of identical Node structures. This has the advantage of being very simple to construct. The drawbacks are that it may require more space (since text, elements, attributes, and end tags all need the same size node), and that some compilers break when confronted with enormous arrays.
  2. Represent each Node as a separate, named static variable (or group of variables, in some cases). This is definitely the most straightforward method.
  3. Represent the document as a list of pointers to, or indices of, nodes. This corresponds roughly to a ``threaded-code'' or ``byte-code'' representation of the document.

There is a separate but related decision about the extent to which contiguous blocks of text ought to be broken up into words. This trades space overhead for efficiency in operations that split their text content on whitespace. One possible compromise is to make this decision on an element-by-element basis. It's also possible (likely, even) that structure sharing (i.e., the flyweight pattern) will gain back more space than it costs.

Design Sketch

There are two alternatives for representing the children; both have their uses.

  1. Represent a nodelist (either an attribute list or the content of an element) as an array of pointers to (or indices of) nodes. This permits structure sharing but requires a complete parse tree for construction.
  2. Using nextSibling and firstChild pointers would take the same amount of space, but precludes structure sharing within content. On the other hand, it permits on-the-fly construction since it would not be necessary to count children. Combined with a special "end-of-list" node that points back to the parent, this would allow stackless tree traversal (as in the DOM).

Number the elements in the document, and use these numbers in the node identifiers. Use, e.g., eNNN for the element, aNNN for its attribute list array, and cNNN for its content array.

Alternatively, keep a separate count for each type of element, and incorporate the tagname into the identifier, e.g. h1_NNN. This would be a little more complex, but would result in more readable code.

Tagnames will normally be represented as indices into a table of handlers or as references to descriptor structures. Here the use of the tag name in the corresponding identifier or #define symbol is almost obligatory for debugging purposes.


Copyright © 1997-1999 Ricoh Innovations, Inc.
$Id: compiled.html,v 1.2 2001-01-11 23:36:53 steve Exp $
Stephen R. Savitzky <steve@rii.ricoh.com>