PIA Internals Manual

This manual describes the internal workings of the PIA, including the Application Programming Interface.

Contents


Implementation

The PIA is implemented in Java. Complete documentation of the Java packages and classes can be found in the javadoc subdirectory of this Manual. This documentation also includes the JavaDoc descriptions of the standard Java packages used by the PIA.

There are two classes with useable main programs (several others have main programs for test purposes):

  1. org.risource.pia.Pia
    the main program of the PIA itself.
  2. org.risource.dps.Filter
    a stand-alone document processor for use as a stream filter, document processor, or CGI script. This replaces the earlier interpretor,
Both are accessible via small wrapper programs in the pia/bin directory:
  1. bin/pia (and pia.exe for Windows)
  2. bin/process (and process.exe for Windows)
Note that all are stand-alone applications. None of these programs can be used inside a browser as applets; since the PIA is basically a server, an applet would not be particularly useful.

Packages

The PIA is implemented using the following packages (NOTE: the links may be broken depending on which version of the JDK the documentation was compiled with. Try the index in javadoc/ to be sure):
org.risource.ds
generic Data Structures. Some were originally written with an interface that makes it easy to translate similar constructs in PERL. In particular, most return null on out-of-range keys or indices instead of throwing an exception.
JP.ac.osaka_u.ender.util.regex
A regular-expression package. Copyright (C) 1997 Shugo Maeda under the GNU General Public License.
org.risource.dom
The Document Object Model. This is an implementation of the W3C's DOM draft specification. While the interfaces should be up to date with the 1.0 spec, the current implementation has a few differences (related to efficiently handling collections of nodes).
org.risource.dps
The Document Processing System, which is used for processing executable markup. It uses interfaces in org.risource.dom for representing active parse trees; these are extended to include syntax and semantic handlers.
org.risource.dps.handle
Classes that handle particular active elements.
org.risource.dps.Tagset
Classes and tagset definition files that implement the various tagsets.
org.risource.pia
The PIA itself, and in general everything that deals with HTTP Transactions.
org.risource.pia.agent
``Handle'' classes that implement individual Agents.
org.risource.tf
Classes that implement Transaction Features for Agents to match.
org.risource.util
Utility classes, mainly to implement operations that belong on standard Java objects, but have to be implemented statically because many standard classes can't be extended.
Handler classes in org.risource.dps.handle, org.risource.pia.agent, and org.risource.tf are loaded by name when needed.

Note:

This manual is still under construction. Suggestions about what else to include would be greatly appreciated.


The Resolver

The heart of the PIA is a class called the ``Resolver'' which makes the association between HTTP ``Transaction''s and the Agents that operate on them.

The Resolver operates in two phases:

  1. In the first phase, Agents are matched with the Transaction's Features according to their match Criteria, and are allowed to act on the matched Transaction.
  2. In the second, any Agents that have been registered (during the first phase) as a ``handler'' are invoked to `` satisfy the Transaction, usually by forwarding it to a client or server.
Clients and servers are represented in the PIA by proxy objects called ``Machine''s which contain the streams that connect to the client or server, and possibly some associated information. A Machine is responsible for the communication protocol required by the client or server to which it is connected.

The Resolver Algorithm

Given a resolver R:
  1. Input:
    Check for incoming messages. For each message M:
    1. Push M onto R's queue Q.
  2. Next Transaction:
    Shift the next transaction T from Q
  3. Resolution:
    For each agent A:
    1. match A's criteria against T's features.
    2. If A and T match, call A's actOn method with T and R.
  4. Satisfaction:
    For each object S on T's satisfiers queue:
    1. call S's handle method with T and R.
  5. Finishing:
    If no S returned true,
    1. if T is a request, push an error response onto Q.
    2. if T is a response, forward it to its requestor.
  6. repeat from step 1.

Agents

An Agent in the PIA is represented internally by a class that implements the Agent interface. In practice, all of them descend from GenericAgent. Many Agents are, in fact, implemented directly by GenericAgent, the main exceptions at the moment being Agency, Dofs, and Logo. Cache, when we implement it, will probably also require its own class.

The main reason for implementing an Agent as a separate class so that the Java code can grab control before dispatching a URL to an active document. Efficiency (Agency) and the ability to handle non-HTML data (Logo) also play a part, although in fact Logo is currently implemented by dispatching to a PERL program.


The Document Processing System

The Document Processing Algorithm

...has been totally rewritten, and greatly simplified. It used to be a variant of the PIA's Resolver algorithm, and operated on parse trees that had been flattened into a stream of ``tokens'', with each element represented as ``start tag'' and ``end tag'' tokens.

The current version is much more efficient, and is based on a subset of the DOM's TreeIterator interface: the Input is used for a unidirectional, depth-first traversal of a parse tree. An Output is used for constructing a parse tree, generic DOM Document, or stream. A Processor sits between them; its state consists of the current Input, Output, and entity binding table. (This is considerably simpler than the old Processor state, which also included action bindings and traversal state.)

The Input and Output interfaces are designed so that one can copy a document by traversing it with an Input and passing the resulting nodes to an Output. Parse trees extend the DOM by decorating each Node with an Action object (strategy pattern). Operations implemented by this technique are local -- an active node in the input document is replaced by one or more nodes in the output document. The resulting ``interpretor'' is simple enough to fit on a single page.

Actions are associated with nodes at parse time; if it can be determined that a node and its children are all passive, they can be copied without invoking an action. Operations with no non-local side-effects (also determinable at parse time) can be parallelized.

Note that an Output, for example to a character stream, need not actually construct new nodes, so the new traversal-based interpretor is significantly more efficient than the old Token-based one. Copying could be done iteratively, i.e. without recursion, by keeping track of the current depth in the input tree. An Input does this, so this will be a trivial extension.

At the moment processing is done recursively; calling an action is expected to fully handle the corresponding node and its children. It would be simple to extend handlers to include start and end actions, and allow them to perform push and pop operations on the Processor that calls them. The result would actually be rather similar to the old Token-based interpretor. Later.


Copyright © 1997 Ricoh Innovations, Inc.
$Id: internals.html,v 1.8 2001-01-11 23:36:47 steve Exp $