This document describes, at a very high level, where I think the PIA ought to be going over the next few months.
This section enumerates my plans for the immediate future. Numbers in parentheses give the current priority for the first few changes. (x) indicates an ongoing activity, (!) a completed one.
See naming.html
.xh
files. An alternative
requiring more work would have been to add an attribute to any element
that indicates that more attributes are contained in the content, or to
use a special sub-element to set attributes.
#FIXED
attributes
DPS:handler
attribute.
XML Namespaces
org.risource.sax
and
org.risource.dom
(which already exists) for these, in
order to give a clean separation between the ``user-level'' API's and
the internal ones. Ideally there should also be a stream interface at
this level; it's not clear what to call it.
getNode(level)
is
only used to implement retainTree
in BasicParser, and
there are other ways to do it.
ToProcessor
.
See porting.html and Design/c-port.html
mod_pia
mod_java
as a
base. This would be a ``quick-and-dirty'' implementation just to see
how the PIA fits into the Apache environment.
mod_java
and started with C-DPS.
org.risource.pia.Pia
class is a nightmare.
It needs to be transformed into something like a namespace or
hashtable, and all the access functions that are still useful should be
made static. There should be static caches for the most
frequently-used variables (e.g. verbosity and log).
I see the transition of the DPS to standard interfaces (DOM and SAX) as proceeding in these stages:
In either case, this would require constructing subtrees for
some (perhaps not all) active nodes. An alternative, event-oriented
action
method would probably be needed in the
Action
interface, or (more likely) a pair of them. In the
default case these would simply pass the event (in the form of a
cursor) to the Output. For active nodes, these would
construct a subtree and call the original 3-argument
action
routine when the end tag is seen, but eventually
many handlers would be rewritten for better performance.
FromDocument
,
ToDocument
, ToSAX
.
(Document, of course, refers to the corresponding DOM class).
ToDocument
already exists as a stub. FromSAX
would require the use of two threads and a queue, so instead we
implement DocumentProcessor, the DPS equivalent of SAX DocumentHandler.
getElementsByTagName
is still unimplemented.) Replace all
operations that create Nodes by calls on a Document's creation
methods.
getNode
and getActive
. More
work needs to be done in order to replace all (most?) uses of NodeList
with cursors as well.
Note that if Output extends SAX, and Processor extends Output, it becomes possible to use the DPS to process XML delivered by a SAX parser. You couldn't use it on documents with embedded DPS control structures or entities, but you could use it for other kinds of processing. Actually, you could use it for XXML as well if you had a way to avoid entities; this would have other advantages. There are three possibilities:
<o>
(for ``option'') comes to mind.
The event-driven Processor is somewhat related to the old
Token-oriented ``interform interpretor'' enshrined in
pia/src/java/crc/interform
-- not that it's a particularly
good example.
XP is a fast XML parser written in Java by James Clark; it provides almost everything we need except that it doesn't seem to report external entities in attribute values. (All of the DPS's entities are effectively external.)
The choices for the PIA are Servlet (which seems pretty obvious), EJB, and an Apache module. There are two different choices within Servlet.
The best is probably a hybrid approach: the Resolver is a Servlet in its
own right, but delegates the actual response (respond
method)
to Agent
s that are also Servlet
s. This
gives the option of wrapping ordinary third-party Servlet
s as
Agent
s. Essentially, both Resolver.push
and
Agent.respond
would get replaced by
Servlet.service
.
The Servlet package contains ServletRequest and ServletResponse interfaces that include most of the functionality of the PIA's Machine and Transaction classes. The thread machinery of Transaction and much of Content would still be required, and it's not clear whether Servlet is adequate for dealing with proxying. It's likely that we would have to retain Transaction in some form in order to support threads, XML representations, and some DPS features.
An Apache module is most easily implemented using mod_java
,
but any design should allow a C version of the PIA to be dropped in
instead. Apache uses a unified transaction structure that contains
both request and response information in one place. (Of course,
if the servlet interface is used, it becomes even easier to interface the
PIA to mod_java
.)
See porting.html
We need to produce a sufficiently complete design sketch of a C/C++ port that some other group can take over the job.
This is in progress. See Design/xml-world.html
The goal here is to represent as many PIA internal objects as possible in XML; in other words, to be able to write and read them as XML documents. (Agent and Namespace have been completed; Tagset is less critical since they are already defined in XML, so writing them out is less necessary.)
This is in progress. See Design/dom.html
The goal here is to make the internal representation of parse trees conform completely to the W3C's Document Object Model (this part has been completed as of 1999-04-15), and then to make the DPS almost completely independent of the DOM implementation by moving it to a cursor-based access model (still in progress).
`When I use a word,' Humpty Dumpty said in rather a scornful tone, `it means just what I choose it to mean -- neither more nor less.'
`The question is,' said Alice, `whether you can make words mean so many different things.'
`The question is,' said Humpty Dumpty, `which is to be master - - that's all.'
-- Lewis Carrol
For better or for worse, XML has moved farther apart from HTML. Its proponents would like to drag HTML along for the ride, and use strict XML for everything on the Web. I don't necessarily hold with this view, but as long as the PIA has been touted as an XML-based system we have to go along at least part of the way, and make our language as strictly XML-compliant as we can.
At this point there are really only two places where we differ from XML:
.xh
files. On the other hand it might;
XML doesn't appear to recognize namespaces in entity names.
foo="foo"
'' are often
particularly ugly in XML.
In order to accomodate XML namespaces, we may want to use a different character as our namespace separator. Dash, period, and underscore are available; period would seem to be the best substitute.
SGML allows attributes to be minimized in two different ways:
HTML makes use only of the first but, as we will see below, we can take advantage of the second in writing a cleanup tagset.
There are basically five ways to ``pretty-up'' attributes:
Most likely, some combination of all of these will be required in order to redesign the language so that it is both fully XML-compliant and easily human-readable.
GenericHandler
already has a method called
dispatch
that can handle the first three methods above: it
will recognize a keyword either as an attribute name, the value of a
specified attribute, or a suffix to the tag name.