This note discusses the status of the PIA with respect to the various standards it uses, follows, and in some cases fails to follow. In particular, it discusses the relationship between the PIA and:
Active documents in the PIA have two document types: the document type of the input, before processing, and the document type of the output, after processing.
It is our intention to ensure that output documents, i.e. the documents seen by a browser or other client, validate against whatever DTD their document type specifies. At the moment, we are using:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
Unfortunately, we have not actually validated any of our documents as of Release 2.0.2.
Active documents, e.g. .xh
files, should really have their
output document type generated automatically, with their input document
type designating the tags actually contained in the document, i.e. the
extended document type defined by the tagset.
As of Release 2.0.2, unfortunately, we don't do this. The document type, like other declarations, is simply passed through. It isn't even parsed, which is a (different) problem. We do plan to fix this.
Our intention is to generate SGML DTD's automatically from tagsets (using a tagset, naturally). We will have to add a few new constructs in order to make this happen.
We also intend to provide a utility to generate ``skeleton'' tagsets from DTD's. Naturally, all of the elements so defined would be passive, but it would be a good starting point for anyone developing a custom tagset.
First and foremost, what we call XHTML is not what the
W3C calls XHTML. Theirs is a totally reworked DTD, forcing all of the
original HTML tags into XML syntax. Ours is the original HTML
DTD, with our own active-document tags (the Basic tagset
basic.ts
plus a few extensions) added in a way that conforms
to HTML syntax, not XML.
It is our expectation that any parser that can handle HTML and recognize
the XML `/>
' convention for empty tags will be able to
handle our extended HTML. After we devise a DTD for it, any SGML
parser should be able to handle it.
Our XHTML is eXtended HTML, not XML! We should probably rename it. Like HTML and SGML, and unlike XML, our syntax is designed for human creation and editing. The differences are:
/>
in elements that the
tagset declares as being empty.
It is important to note, however, that it is trivial to make the
output of the PIA's document processor conform to the W3C's XHTML
DTD if that is considered desirable, and it probably will be at some
point. With an appropriate command-line option the stand-alone
process
command would make an excellent HTML clean-up tool.
As noted above, our extended HTML documents are not XML. Our extensions are somewhat in the XML style, but they are not (yet) XML.
Nevertheless, it would be desirable in the future to move toward converting all of our sample code to XML, so as to make it easier to process using (other) standard tools. It is our intention to do so.
It is not our intention to abandon support for general SGML, however. We intend always to
XSL (the XML StyleSheet Language) is very similar in intent to the PIA's document-processing language. The unfortunate problem is that XSL drops back into JavaScript to perform any operations except the few that XSL defines elements for. We don't need to do that.
We should be able to define a tagset that uses the XSL tags ``on the outside'' but uses the PIA's tags for processing in situations where XSL drops into JavaScript. It would be significantly harder to convert between these two formats. This might make a good project for someone in the developer community.
Note that the PIA makes an excellent alternative to XSL in situations where a JavaScript interpretor is not present.
A useful near-term step would be to use the SAX interface to allow drop-in replacement of our (ad-hoc and rather buggy) parser with any standard SGML or XML parser.
Unfortunately, our current interface to the parser is totally different from SAX. SAX is ``event-driven'' -- it reads an input stream and calls a callback function whenever a new token is encountered. Our interface treats the parser as a ``virtual tree iterator'' -- the processor calls on the parser to traverse the input document's parse tree depth first.
There are three ways of interfacing the PIA's document processor with a SAX parser:
Action
interface and its implementations, and some
changes to Context
and Processor
(all in the
org.risource.dps
package).
Our intention is to make the SAX-style parsers an alternative to our own technique, not to abandon tree iterators entirely: they're far too attractive when you have enough memory to cache whole parse trees. Things will get a little more complicated, but not excessively so.
It's worth noting that our Output
interface is very
similar to SAX. Making the PIA's document processor look like a SAX
parser will be trivial.
Also see Design/dom.html
.
As of Release 2.0.2 the PIA's Document Object Model (as defined in the
org.risource.dom
package) conformd to an earlier draft of the
W3C's Document Object Model specification, not the recommendation as
finally approved. Unfortunately, some classes on which we relied
heavily (e.g. TreeIterator) disappeared in the interim, and a number of
behavioral specifications (e.g. live nodelists) appeared that make an
efficient implementation difficult.
Moreover, it was never the DOM's intention to be a representation for SGML documents, only HTML and XML documents and only in the context of a client-side scripting language like JavaScript. The DOM is totally unable to deal with documents that do not fit in memory all at once, nor to deal with applications that want to move nodes between documents.
The following steps are being taken:
Cursor
and Output interfaces so that they do
not expose the current Node -- all components of the underlying Node
would then be accessible exclusively through the Cursor, with
appropriate Input
and Output
objects returned
where necessary instead of Nodelist
's and so on.
Input
and Output
that
traverse and build DOM trees explicitly.
active
and tree
packages under org.risource.dps
)
with a vastly simplified scheme that does not extend the DOM, making
them completely self-contained.
That said, I expect that the near-term plan should be a mixed strategy: try to stay independent at the interface level, but base the implementation on the DOM.
The PIA is currently implemented in Java. It was originally implemented using Java 1.0.1 but rather quickly made the transition to 1.1. It is now implemented almost entirely in the intersection of 1.1 and 1.2, and everything but the cryptographic handlers compiles and runs in either.
As of Release 2.1 the PIA's package namespace follows current conventions,
being entirely under org.risource.*
. Do not be decieved by
the existance of a src/java
directory; this does not
correspond to the java.*
package namespace, but rather to the
implementation language. Similarly, PERL modules will go into
src/perl
when we get around to writing them.
The naming of classes and methods is slightly less conventional, and reflects both history (the original implementation was in PERL) and architectural considerations. Specifically:
org.risource.ds.List
are at(int index)
and
at(int index, Object value)
rather than the more
conventional getItem(int index)
and setItem(int
index, Object value)
. Similarly, the access methods for the
``name'' attribute of an Agent are name()
and
name(String newName)
.
org.risource.content
are
derived from MIME types, and the names of tag handlers in
org.risource.dps.handle
are derived from tag names.