Document Processing in the PIA

    1: Overview
        1.1: Document-Oriented Computing
        1.2: Examples
    2: Tags All the Way Down
        2.1: How it works
        2.2: Multiple Views
        2.3: Combining Information and Processing
        2.4: Separating Input, Processing, and Output
    3: The DPS and Standards
        3.1: Choice of Syntax: HTML or XML
        3.2: Application Programming Interfaces

1: Overview

The PIA's Document Processing System (DPS) is fundamentally a document formatting system masquerading as a macro language. (Programmers may be more comfortable thinking of it as a simple programming language optimized for document processing). The most unusual feature of the system is that the input documents, output documents, and processing instructions all have the same syntax -- XML (or a less strict superset that includes HTML the way most people write it).

1.1: Document-Oriented Computing

In many ways, the PIA represents a new way of using computers on the Web, which we call ``Document-Oriented Computing.'' It reflects the realization that, not only does the Web consist entirely of documents, but that documents and document fragments have become the basic objects of computation for web applications. They are stored on servers, passed from servers to clients, operated on by style sheets (which are also documents), cached in files, and so on.

Just as numbers are the basic objects of computation for a spreadsheet or a ``number-crunching'' application, just as images and pixels are the basic objects of games and presentation programs, documents and their components are the basic objects of computation for the PIA. The PIA goes further than traditional applications, however: because the PIA allows documents to be active, it is possible to build complete applications using nothing but documents.

A ``style sheet'' is a simple example of an active document: it specifies how some other document is supposed to be formatted for presentation in a browser, or for printing. An ``active server page'' is another example: it consists of an ordinary HTML document with some fragments of program embedded in it, which are interpreted by the server.

The difference in the PIA is that active documents don't contain pieces of code in some programming language. Instead, the PIA simply associates actions with some of the document's ``tags.'' Some tags (about two dozen) are predefined, others can be defined as needed by the application designer.

1.2: Examples

For example, this is all one needs to do in order to create a ``footer'' tag to go at the bottom of every document on your web site:

<define element="footer" empty="yes">
  <action>
    <b><i>Copyright 1999 
        <a href="http://RiSource.org/">RiSource.org</a></i></b><br>
  </action>
</define>

Notice the way this definition mixes ordinary HTML tags with a small number of ``active tags'' -- <define> and <action> in this case.

The document being processed need not be a local file -- it can come from anywhere on the Web, and processing need not be confined to simple substitution. For example, here is a fragment of PIA code that extracts all of the links from a web page and presents them as a bulleted list:

<ul>
  <repeat>
    <foreach entity="link">
      <extract><from><include src="http://RiSource.org/"/></from>
	       <name all="yes">a</name>
      </extract>
    </foreach>
    <li> &link; </li>
  </repeat>
</ul>

Something only slightly more complicated could be used to prepare a site index.

2: Tags All the Way Down

``The world is supported on the backs of four elephants standing on the shell of a giant turtle.''
``And what is the turtle standing on?''
``You're very clever, young man, but it's turtles all the way down.''
-- told of William James, Bertrand Russell, and others; see here and here.

The PIA's document processing is entirely XML-based; there are no snippets with other syntax embedded in attributes or text. All of the actions of the document processing system are performed by associating actions with tags.

This makes the PIA completely compatible with existing XML toolsets. But because the parser used is also capable of dealing with standard HTML and many other SGML-based markup languages, it can also be used to process documents from a wide variety of sources, including those generated by programs or created using simple text editors.

2.1: How it works

The DPS works by making a single pass over a document, performing the actions that are associated with the tags in the document. The default action for any unknown tag is simply to copy it. (More correctly, to copy the start tag and its corresponding end tag, and to process the contents.)

The action associated with a tag can either be a definition -- a document fragment that simply replaces the tag in the document; or a primitive -- an action defined by the implementation, in some programming language. The set of primitives is small, but sufficiently powerful that any possible document transformation can be performed. The set of tag definitions used to process a document is called a tagset.

The DPS is implemented as a ``processor'' situated between an ``input'' which functions as a parser or parse-tree traverser, and an ``output'' which functions as a tree builder. In most cases the input and output tree structures can be entirely virtual.

This approach has several advantages:

All transformations on the document are local: a tag's action can affect only the element and its contents.
A simple consequence of this locality is that, once a part of a document has been processed, it never needs to be looked at again. This means that the DPS can handle documents that are too large to fit in the computer's memory, making the DPS useable in embedded applications.
Because the DPS operates on ``parse trees'' rather than on strings, it is impossible to produce a document that is syntactically incorrect -- the output document is always well-formed.
But because the DPS is not restricted to XML parsers, it is not necessary for the input to be well-formed XML -- it might be HTML or even something entirely different.
Because the set of primitives is small and the syntax of the language is pure XML or HTML, the DPS is easily implemented in any programming language. Although the current implementation language is Java, an earlier version was written in Perl, and a version in C is being contemplated.
Because the set of tag definitions is separate from the document being processed, it is possible to process the same document in multiple ways. (We will examine this further in the next section.)
Again because the definitions are separate from the document, the DPS is inherently secure: by restricting the set of primitives available when processing a document, it is easy to prevent a document ``imported'' from some other web site from affecting local files or accessing other sites. At the same time, documents that are part of a PIA-based application need not be so restricted.
On the other hand, by making a more complete set of primitives available to it, a single document can be made to specify both an HTML form, and the associated processing. The advantages of this can be easily seen in the PIA, which allows complete web-based applications to be built entirely without the use of ``traditional'' programming languages or scripts.

2.2: Multiple Views

One of the major advantages of the DPS is that a document can be processed using different tagsets for different purposes. There are several applications of this in the PIA; others will no doubt spring to mind.

Style. Because any tag can have a definition, it's easy to apply ``styles'' to ordinary HTML documents. A document can be a mixture of invented tags (for example, <header> and <footer>) and ordinary HTML tags.
Formatting. Something we aren't doing at the moment, but could do rather easily, is formatting HTML pages (for a printer, for example).
Documentation. One of the things we do in the PIA is to automatically generate documentation for our tagsets. We do this by means of a specialized tagset, tsdoc.ts, that redefines the <define> tag.
Extraction. It's easy in the DPS to extract information (all images, for example, or all headings) from a document.

2.3: Combining Information and Processing

As we have seen, the PIA allows information and processing to be mixed in the same document, using the same XML-derived syntax. This has several benefits:

Processing and document never become separated, making them less likely to get out of sync.
Because processing has the same syntax as the document, only one set of editing tools is needed.
Files that contain primarily processing instructions, for example tagsets and configuration files, are nevertheless perfectly valid documents and can be treated as such. This allows software and its documentation to stay together. Tagsets and configuration files can easily be turned into readable HTML or printed documentation.

2.4: Separating Input, Processing, and Output

The PIA has separate interfaces (API's) for input, output, and processor objects.

Input
is essentially a ``document traverser'' -- it includes the methods necessary for making a single left-to-right, top-down pass through a document's parse tree. An important feature of the Input interface, however, is that the tree never has to exist in its complete form.
Output
defines the interface for a ``document constructor.'' As with the Input interface, the document produced need not be a DOM tree; it might be a character stream, a string, or a SAX application.
Processor
defines the interface for the classes that actually process documents. A Processor implementation can operate in either of two modes: ``pull mode,'' in which the document is supplied by an Input, and ``push mode,'' in which the Processor is ``driven'' by an Output. The latter is necessary for interfacing to ``event-driven'' parsers such as those that use the popular SAX interface.

3: The DPS and Standards

3.1: Choice of Syntax: HTML or XML

The DPS's parser is capable of handling either XML or HTML syntax -- the parser's degree of ``strictness'' in handling things like omitted end tags or minimized attributes is specified in the tagset.

Unlike a ``pure XML'' system, this means that HTML documents can be ``imported'' from other web sites and processed using the DPS. It also means that standard HTML authoring tools can be used to create documents.

But the option of using XML means that the DPS can create documents that are pure XML. This is true even if the input format is HTML, so that the DPS is easily integrated into any XML-based system.

3.2: Application Programming Interfaces

There are two major API's in use in the Java/XML community: SAX and the DOM. The DPS works well with both.

SAX is the Simple API for XML. It is designed as the interface between a parser or ``driver'' and an ``application.'' It is ``event-driven,'' in the sense that an ``event handler'' in the application is called by the driver for every ``node'' in the document's parse tree. The DPS can use the SAX interface ``at both ends'' -- the processor can be driven by a SAX driver, and it can drive a SAX application.
This versatility means that the PIA can be easily inserted into a SAX ``tool chain'' to perform special-purpose processing between the parser and the application.
DOM is the Document Object Model -- a language-independent set of interfaces for manipulating document parse trees. The PIA uses its own, extended implementation of the DOM as its internal representation, but because of the extensions it is not really suitable for use in other applications. However, any DOM implementation can be used as an input or output for the PIA, making the PIA suitable for use with other DOM-based software such as XML databases and document managers.

Stephen R. Savitzky < steve@rii.ricoh.com>