PIA Internals Manual
This manual describes the internal workings of the PIA, including the
Application Programming Interface.
Contents
The PIA is implemented in Java. Complete documentation of the Java
packages and
classes can be found in the
javadoc subdirectory of this Manual. This
documentation also includes the JavaDoc descriptions of the standard Java
packages used by the PIA.
There are two classes with useable main programs (several others have main
programs for test purposes):
- org.risource.pia.Pia
the main program of the PIA itself.
- org.risource.dps.Filter
a stand-alone document processor for use as a
stream filter, document processor, or CGI script. This replaces the
earlier interpretor,
Both are accessible via small wrapper programs in the pia/bin
directory:
-
bin/pia
(and
pia.exe
for Windows)
-
bin/process
(and
process.exe
for Windows)
Note that all are stand-alone applications. None of these programs can be
used inside a browser as applets; since the PIA is basically a server, an
applet would not be particularly useful.
Packages
The PIA is implemented using the following packages
(NOTE: the links may be broken depending on which version of the JDK the
documentation was compiled with. Try the index in javadoc/ to be sure):
- org.risource.ds
- generic Data Structures. Some were originally written with an
interface that makes it easy to translate similar constructs in PERL.
In particular, most return
null
on out-of-range keys or
indices instead of throwing an exception.
- JP.ac.osaka_u.ender.util.regex
- A regular-expression package. Copyright (C) 1997 Shugo Maeda
under the GNU General Public License.
- org.risource.dom
- The Document Object Model. This is an implementation of the
W3C's DOM draft specification. While
the interfaces should be up to date with the 1.0 spec, the
current implementation has a few differences (related to efficiently
handling collections of nodes).
- org.risource.dps
- The Document Processing System, which is used for processing executable
markup. It uses interfaces in org.risource.dom for representing
active parse trees; these are extended to include syntax and semantic
handlers.
- org.risource.dps.handle
- Classes that handle particular active elements.
- org.risource.dps.Tagset
- Classes and tagset definition files that implement the various tagsets.
- org.risource.pia
- The PIA itself, and in general everything that deals with HTTP
Transactions.
- org.risource.pia.agent
- ``Handle'' classes that implement individual Agents.
- org.risource.tf
- Classes that implement Transaction Features for Agents to match.
- org.risource.util
- Utility classes, mainly to implement operations that belong on
standard Java objects, but have to be implemented statically because
many standard classes can't be extended.
Handler classes in
org.risource.dps.handle,
org.risource.pia.agent, and
org.risource.tf are loaded by name when
needed.
Note:
This manual is still under construction. Suggestions about what else to
include would be greatly appreciated.
The heart of the PIA is a class called the ``Resolver'' which makes the
association between HTTP ``Transaction''s and the
Agents that operate on them.
The Resolver operates in two
phases:
- In the first phase, Agents
are matched with the Transaction's
Features according to their
match Criteria, and are
allowed to act on the
matched Transaction.
- In the second, any Agents that
have been registered (during the first phase) as a ``handler'' are
invoked to ``
satisfy the
Transaction, usually by forwarding it to a client or server.
Clients and servers are represented in the PIA by proxy objects called
``Machine''s which contain the
streams that connect to the client or server, and possibly some associated
information. A Machine is
responsible for the communication protocol required by the client or server to
which it is connected.
The Resolver Algorithm
Given a resolver R:
- Input:
Check for incoming messages. For each message M:
- Push M onto R's queue Q.
- Next Transaction:
Shift the next transaction T from Q
- Resolution:
For each agent A:
- match A's
criteria
against T's
features
.
- If A and T match, call A's
actOn
method with T and
R.
- Satisfaction:
For each object S on T's satisfiers
queue:
- call S's
handle
method with T and R.
- Finishing:
If no S returned true
,
- if T is a request, push an error response onto Q.
- if T is a response, forward it to its requestor.
- repeat from step 1.
An Agent in the PIA is represented internally by a class that implements the
Agent interface. In practice, all of
them descend from GenericAgent. Many Agents are,
in fact, implemented directly by GenericAgent, the main exceptions
at the moment being Agency, Dofs, and Logo. Cache, when we implement it, will
probably also require its own class.
The main reason for implementing an Agent as a separate class so that the Java
code can grab control before dispatching a URL to an active document. Efficiency
(Agency) and the ability to handle
non-HTML data (Logo) also play a part,
although in fact Logo is currently
implemented by dispatching to a PERL
program.
The Document Processing Algorithm
...has been totally rewritten, and greatly simplified. It used to be a
variant of the PIA's Resolver algorithm, and operated on parse trees that had
been flattened into a stream of ``tokens'', with each element represented as
``start tag'' and ``end tag'' tokens.
The current version is much more efficient, and is based on a subset of the
DOM's TreeIterator interface: the Input is used for a unidirectional,
depth-first traversal of a parse tree. An Output is used for
constructing a parse tree, generic DOM Document, or stream. A
Processor sits between them; its state consists of the current Input, Output,
and entity binding table. (This is considerably simpler than the old
Processor state, which also included action bindings and traversal state.)
The Input and Output interfaces are designed so that one can copy a document
by traversing it with an Input and passing the resulting nodes to an Output.
Parse trees extend the DOM by decorating each Node with an Action object
(strategy pattern). Operations implemented by this technique are
local -- an active node in the input document is replaced by one or
more nodes in the output document. The resulting ``interpretor'' is simple
enough to fit on a single page.
Actions are associated with nodes at parse time; if it can be determined that
a node and its children are all passive, they can be copied without invoking
an action. Operations with no non-local side-effects (also determinable at
parse time) can be parallelized.
Note that an Output, for example to a character stream, need not
actually construct new nodes, so the new traversal-based interpretor is
significantly more efficient than the old Token-based one. Copying could be
done iteratively, i.e. without recursion, by keeping track of the current
depth in the input tree. An Input does this, so this will be a trivial
extension.
At the moment processing is done recursively; calling an action is expected to
fully handle the corresponding node and its children. It would be simple to
extend handlers to include start and end actions, and allow them to perform
push and pop operations on the Processor that calls them. The result would
actually be rather similar to the old Token-based interpretor. Later.
Copyright © 1997 Ricoh Innovations, Inc.
$Id: internals.html,v 1.8 2001-01-11 23:36:47 steve Exp $