Roadmap

This document describes, at a very high level, where I think the PIA ought to be going over the next few months.

See also: projects.html: Goes into considerable depth (including design details) on some of the projects mentioned here.
See also: Design/ (subdirectory): Contains detailed architectural and design notes on individual projects.

The Immediate Future

This section enumerates my plans for the immediate future. Numbers in parentheses give the current priority for the first few changes. (x) indicates an ongoing activity, (!) a completed one.

Goals

Support PIA application developers: by continuing to improve the functionality, useability, and understandability of the PIA and DPS.
Leverage existing technology (making PIA a sustainable technology): by adhering to existing standards, including XML, HTML, SAX, and DOM, as much as possible.
Make PIA useful in other contexts: by providing simple API's that allow PIA technology to be incorporated into existing standards-based applications, including Apache and Servlet-based servers.

Top Priorities

(!) New directory structure -- basically a single tree with possible virtual entries, but no agents.
(!) Event-driven interface to Processor -- using the DPS as a SAX application falls out rather directly from this. It also leads to...
C-DPS -- using the DPS as an Apache module falls out from this.
SAX and DOM interfaces -- The only thing that's not trivial today is taking input from a SAX parser; that requires an event-driven interface to Processor.

New Directory Structure

See naming.html

(!) New ``single tree'' directory structure: Instead of the present complex system of search paths, mounted agents, etc., make the directory structure seen from the DPS match the directory structure seen from the browser. Directories will contain virtual entries, but this will be basically invisible inside the PIA: Java's mechanisms for locating files and reading directories will not be visible at the XML level.
(!) Decouple agents from URL's: Agents only do operations on transactions; they don't get involved in serving files. The objects that represent virtual files and directories will be different.
Support WebDAV: The new directory structure will be a good match for WebDAV (which uses extensions to HTTP, and XML metadata).

Language

(!) Make it unnecessary to use entities as local variables: The idea is to define a constructor for elements, so that we don't have to use entities to refer to variables. This will require a fair amount of grunt work to fix all the .xh files. An alternative requiring more work would have been to add an attribute to any element that indicates that more attributes are contained in the content, or to use a special sub-element to set attributes.
Enforce correct entity processing: In any case, it will still be possible to use entities defined before processing the document, since the parser will be using the TopContext as its entity manager. We can enforce this behavior by always looking up entities at parse time, and always in the top context. This will cover most of the more common cases, and will make the processor more efficient as well.
#FIXED attributes: XML allows attribute values to be specified in the DTD. This is used for namespaces, for example. In the DPS, we could use this as a way of decorating ordinary DOM nodes with the information needed by the DPS; e.g. a handler hash table with keys in a DPS:handler attribute.
XML Namespaces: Along the way we need to support XML namespaces; in particular DPS and PIA namespaces. The PIA's current namespaces should properly be called ``entity namespaces'' or ``variable namespaces'', and the documentation should be updated to reflect this.

DPS

(x) General interface cleanup: The plan is to clean up the DPS interfaces (especially Action and Processor, which have accumulated a certain amount of cruft). It might be desirable to move some of the interfaces used internally (e.g. Action) down into subdirectories.
(3) Standard interfaces: ToSAX, FromDOM, ToDOM. Add SAXProxy wrapper for Processor. We will use the packages org.risource.sax and org.risource.dom (which already exists) for these, in order to give a clean separation between the ``user-level'' API's and the internal ones. Ideally there should also be a stream interface at this level; it's not clear what to call it.
(4) Cursor cleanup: Improve the Cursor interface to the point where it can be used without creating Node instances in most cases. getNode(level) is only used to implement retainTree in BasicParser, and there are other ways to do it.
(!) Processor cleanup / event interface: The idea is not only to simplify Processor, but to add an event-driven interface to it, the better to handle SAX parsers and similar push-driven interfaces. The technique used is a specialized Output implementation called ToProcessor.
Event-driven Action interface: Add operations to the Action interface to improve efficiency when the Processor is being driven from an event interface like SAX.

Apache/C

See porting.html and Design/c-port.html

mod_pia: Implement a PIA module, probably using mod_java as a base. This would be a ``quick-and-dirty'' implementation just to see how the PIA fits into the Apache environment.
(2) C-DPS: Implement a C++ port of the DPS, possibly using Jade as the base. This would be easiest to do after giving Processor an event interface.
Apache/C-DPS integration.: Actually, this would be greatly simplified if we skipped using mod_java and started with C-DPS.

Other

(!) PIA command line: The PIA's command-line processing needed to be simplified; it used to be a kludge. It is now possible to do all configuration from the command line.
(x) Pia rationalization: Currently the org.risource.pia.Pia class is a nightmare. It needs to be transformed into something like a namespace or hashtable, and all the access functions that are still useful should be made static. There should be static caches for the most frequently-used variables (e.g. verbosity and log).
Binary file format: Devise a simple binary file format that maps well into C data structures, DOM trees, and SAX events.

PIA, DPS and Standard Interfaces

DPS: DOM and SAX

I see the transition of the DPS to standard interfaces (DOM and SAX) as proceeding in these stages:

Extend Output so that it can take a Cursor instead of a Node, or even node components (node name, node type, attribute list, value string, and content). It would be possible for a SAX parser or other event-driven parser to drive an Output almost directly at this point; an adapter would be needed to handle the differences in method names and parameter types.
Add an event-driven interface to Processor -- essentially, make a new kind of Output that drives a TopProcessor, rather than being driven by it. Equivalently, one could simply have TopContext extend Output.
In either case, this would require constructing subtrees for some (perhaps not all) active nodes. An alternative, event-oriented action method would probably be needed in the Action interface, or (more likely) a pair of them. In the default case these would simply pass the event (in the form of a cursor) to the Output. For active nodes, these would construct a subtree and call the original 3-argument action routine when the end tag is seen, but eventually many handlers would be rewritten for better performance.
It would be possible at this point to switch Parser from being an Input to being something that merely drives an Output.
The Input and Output implementations FromDocument, ToDocument, ToSAX. (Document, of course, refers to the corresponding DOM class). ToDocument already exists as a stub. FromSAX would require the use of two threads and a queue, so instead we implement DocumentProcessor, the DPS equivalent of SAX DocumentHandler.
Extend Tagset to implement DocumentType (it's almost there now). This would probably be sufficient to allow our own DOM to be used in some other applications. (Not all -- in particular, getElementsByTagName is still unimplemented.) Replace all operations that create Nodes by calls on a Document's creation methods.
Extend the Cursor implementation so that it doesn't require the presence of an actual ActiveNode. (A previous implementation involved a DOM Node, but you shouldn't need that, either.) It may even be reasonable to virtualize Text nodes as strings. See Design/dom.html for details. There are less than 50 calls on getNode and getActive. More work needs to be done in order to replace all (most?) uses of NodeList with cursors as well.

Note that if Output extends SAX, and Processor extends Output, it becomes possible to use the DPS to process XML delivered by a SAX parser. You couldn't use it on documents with embedded DPS control structures or entities, but you could use it for other kinds of processing. Actually, you could use it for XXML as well if you had a way to avoid entities; this would have other advantages. There are three possibilities:

Add a constructor for elements, so that an attribute list could be built dynamically. We need this anyway.
Define sub-elements as alternatives to every attribute. We might not want this in all cases, but a small number of standard elements would cover the common ones (e.g. name, href, ...).
Define a general-purpose sub-element to replace any attribute. This could be handled in BasicHandler and GenericHandler, so it's much simpler than the previous alternative. <o> (for ``option'') comes to mind.

The event-driven Processor is somewhat related to the old Token-oriented ``interform interpretor'' enshrined in pia/src/java/crc/interform -- not that it's a particularly good example.

XP is a fast XML parser written in Java by James Clark; it provides almost everything we need except that it doesn't seem to report external entities in attribute values. (All of the DPS's entities are effectively external.)

PIA: Servlet and Apache

The choices for the PIA are Servlet (which seems pretty obvious), EJB, and an Apache module. There are two different choices within Servlet.

The entire PIA (more specifically the Resolver) is a Servlet. This allows the PIA's own Agent name resolution and proxying to work cleanly. It reduces the parent server to a mere shell.
Each Agent is a Servlet. This bypasses the Resolver, but has the potential to greatly clean up the interface between the server portion of the PIA and the Agents, and probably gets rid of a lot of junk on Transaction as well.

The best is probably a hybrid approach: the Resolver is a Servlet in its own right, but delegates the actual response (respond method) to Agents that are also Servlets. This gives the option of wrapping ordinary third-party Servlets as Agents. Essentially, both Resolver.push and Agent.respond would get replaced by Servlet.service.

The Servlet package contains ServletRequest and ServletResponse interfaces that include most of the functionality of the PIA's Machine and Transaction classes. The thread machinery of Transaction and much of Content would still be required, and it's not clear whether Servlet is adequate for dealing with proxying. It's likely that we would have to retain Transaction in some form in order to support threads, XML representations, and some DPS features.

An Apache module is most easily implemented using mod_java, but any design should allow a C version of the PIA to be dropped in instead. Apache uses a unified transaction structure that contains both request and response information in one place. (Of course, if the servlet interface is used, it becomes even easier to interface the PIA to mod_java.)

C Port

See porting.html

We need to produce a sufficiently complete design sketch of a C/C++ port that some other group can take over the job.

Agents and Tagsets in XML

This is in progress. See Design/xml-world.html

The goal here is to represent as many PIA internal objects as possible in XML; in other words, to be able to write and read them as XML documents. (Agent and Namespace have been completed; Tagset is less critical since they are already defined in XML, so writing them out is less necessary.)

Improved Document Representation

This is in progress. See Design/dom.html

The goal here is to make the internal representation of parse trees conform completely to the W3C's Document Object Model (this part has been completed as of 1999-04-15), and then to make the DPS almost completely independent of the DOM implementation by moving it to a cursor-based access model (still in progress).

Language Improvements

`When I use a word,' Humpty Dumpty said in rather a scornful tone, `it means just what I choose it to mean -- neither more nor less.'
`The question is,' said Alice, `whether you can make words mean so many different things.'
`The question is,' said Humpty Dumpty, `which is to be master - - that's all.'
-- Lewis Carrol

XML compliance

For better or for worse, XML has moved farther apart from HTML. Its proponents would like to drag HTML along for the ride, and use strict XML for everything on the Web. I don't necessarily hold with this view, but as long as the PIA has been touted as an XML-based system we have to go along at least part of the way, and make our language as strictly XML-compliant as we can.

At this point there are really only two places where we differ from XML:

Our use of namespaces diverges a great deal from XML's, to the point where an XML application that is aware of namespaces might not be able to deal with our .xh files. On the other hand it might; XML doesn't appear to recognize namespaces in entity names.
Our language was designed with SGML attribute minimization in mind. The resulting constructs (e.g. ``foo="foo"'' are often particularly ugly in XML.

Namespaces

In order to accomodate XML namespaces, we may want to use a different character as our namespace separator. Dash, period, and underscore are available; period would seem to be the best substitute.

Attributes

SGML allows attributes to be minimized in two different ways:

an attribute with no value specified evaluates to its name
by requiring that enumerated values be unique across all the attributes of an element, an enumerated value can be assigned to its proper attribute

HTML makes use only of the first but, as we will see below, we can take advantage of the second in writing a cleanup tagset.

There are basically five ways to ``pretty-up'' attributes:

Move the offending attributes into tag names. In other words, use <test.zero> instead of <test zero>.
Turn them from attribute names to values. For example, transform <numeric sum> into <numeric op="sum">.
Give their value a useful meaning.
Move from two attributes, one of which is boolean, to two attributes with values, one of which is optional. An example from the old language was <read file=filename> vs. <read href=url>.
Move data from attributes to sub-elements. We did that when we moved from <repeat list=list> to <repeat><foreach>list </foreach></repeat>.

Most likely, some combination of all of these will be required in order to redesign the language so that it is both fully XML-compliant and easily human-readable.

Implementation Note

GenericHandler already has a method called dispatch that can handle the first three methods above: it will recognize a keyword either as an attribute name, the value of a specified attribute, or a suffix to the tag name.

Stephen R. Savitzky <steve@rii.ricoh.com>