RiSource.org / White Papers / Document Processing in the PIA

Table of Contents


    1: Overview
        1.1: Document-Oriented Computing
        1.2: Examples
    2: Tags All the Way Down
        2.1: How it works
        2.2: Multiple Views
        2.3: Combining Information and Processing
        2.4: Separating Input, Processing, and Output
    3: The DPS and Standards
        3.1: Choice of Syntax: HTML or XML
        3.2: Application Programming Interfaces

1: Overview

The PIA's Document Processing System (DPS) is fundamentally a document formatting system masquerading as a macro language. (Programmers may be more comfortable thinking of it as a simple programming language optimized for document processing). The most unusual feature of the system is that the input documents, output documents, and processing instructions all have the same syntax -- XML (or a less strict superset that includes HTML the way most people write it).

1.1: Document-Oriented Computing

In many ways, the PIA represents a new way of using computers on the Web, which we call ``Document-Oriented Computing.'' It reflects the realization that, not only does the Web consist entirely of documents, but that documents and document fragments have become the basic objects of computation for web applications. They are stored on servers, passed from servers to clients, operated on by style sheets (which are also documents), cached in files, and so on.

Just as numbers are the basic objects of computation for a spreadsheet or a ``number-crunching'' application, just as images and pixels are the basic objects of games and presentation programs, documents and their components are the basic objects of computation for the PIA. The PIA goes further than traditional applications, however: because the PIA allows documents to be active, it is possible to build complete applications using nothing but documents.

A ``style sheet'' is a simple example of an active document: it specifies how some other document is supposed to be formatted for presentation in a browser, or for printing. An ``active server page'' is another example: it consists of an ordinary HTML document with some fragments of program embedded in it, which are interpreted by the server.

The difference in the PIA is that active documents don't contain pieces of code in some programming language. Instead, the PIA simply associates actions with some of the document's ``tags.'' Some tags (about two dozen) are predefined, others can be defined as needed by the application designer.

1.2: Examples

For example, this is all one needs to do in order to create a ``footer'' tag to go at the bottom of every document on your web site:

<define element="footer" empty="yes">
  <action>
    <b><i>Copyright 1999 
        <a href="http://RiSource.org/">RiSource.org</a></i></b><br>
  </action>
</define>

Notice the way this definition mixes ordinary HTML tags with a small number of ``active tags'' -- <define> and <action> in this case.

The document being processed need not be a local file -- it can come from anywhere on the Web, and processing need not be confined to simple substitution. For example, here is a fragment of PIA code that extracts all of the links from a web page and presents them as a bulleted list:

<ul>
  <repeat>
    <foreach entity="link">
      <extract><from><include src="http://RiSource.org/"/></from>
	       <name all="yes">a</name>
      </extract>
    </foreach>
    <li> &link; </li>
  </repeat>
</ul>  

Something only slightly more complicated could be used to prepare a site index.

2: Tags All the Way Down


``The world is supported on the backs of four elephants standing on the shell of a giant turtle.''
``And what is the turtle standing on?''
``You're very clever, young man, but it's turtles all the way down.''
-- told of William James, Bertrand Russell, and others; see here and here.

The PIA's document processing is entirely XML-based; there are no snippets with other syntax embedded in attributes or text. All of the actions of the document processing system are performed by associating actions with tags.

This makes the PIA completely compatible with existing XML toolsets. But because the parser used is also capable of dealing with standard HTML and many other SGML-based markup languages, it can also be used to process documents from a wide variety of sources, including those generated by programs or created using simple text editors.

2.1: How it works

The DPS works by making a single pass over a document, performing the actions that are associated with the tags in the document. The default action for any unknown tag is simply to copy it. (More correctly, to copy the start tag and its corresponding end tag, and to process the contents.)

The action associated with a tag can either be a definition -- a document fragment that simply replaces the tag in the document; or a primitive -- an action defined by the implementation, in some programming language. The set of primitives is small, but sufficiently powerful that any possible document transformation can be performed. The set of tag definitions used to process a document is called a tagset.

The DPS is implemented as a ``processor'' situated between an ``input'' which functions as a parser or parse-tree traverser, and an ``output'' which functions as a tree builder. In most cases the input and output tree structures can be entirely virtual.

This approach has several advantages:

2.2: Multiple Views

One of the major advantages of the DPS is that a document can be processed using different tagsets for different purposes. There are several applications of this in the PIA; others will no doubt spring to mind.

2.3: Combining Information and Processing

As we have seen, the PIA allows information and processing to be mixed in the same document, using the same XML-derived syntax. This has several benefits:

2.4: Separating Input, Processing, and Output

The PIA has separate interfaces (API's) for input, output, and processor objects.

3: The DPS and Standards

3.1: Choice of Syntax: HTML or XML

The DPS's parser is capable of handling either XML or HTML syntax -- the parser's degree of ``strictness'' in handling things like omitted end tags or minimized attributes is specified in the tagset.

Unlike a ``pure XML'' system, this means that HTML documents can be ``imported'' from other web sites and processed using the DPS. It also means that standard HTML authoring tools can be used to create documents.

But the option of using XML means that the DPS can create documents that are pure XML. This is true even if the input format is HTML, so that the DPS is easily integrated into any XML-based system.

3.2: Application Programming Interfaces

There are two major API's in use in the Java/XML community: SAX and the DOM. The DPS works well with both.


Copyright © 1997-1999 Ricoh Innovations, Inc.
$Id: wp-dps.html,v 1.3 2001/01/12 01:45:39 steve Exp $
Stephen R. Savitzky < steve@rii.ricoh.com>