PIA and Standards

This note discusses the status of the PIA with respect to the various standards it uses, follows, and in some cases fails to follow. In particular, it discusses the relationship between the PIA and:

SGML and Document Types

Active documents in the PIA have two document types: the document type of the input, before processing, and the document type of the output, after processing.

Output Documents

It is our intention to ensure that output documents, i.e. the documents seen by a browser or other client, validate against whatever DTD their document type specifies. At the moment, we are using:

  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

Unfortunately, we have not actually validated any of our documents as of Release 2.0.2.

Input Documents

Active documents, e.g. .xh files, should really have their output document type generated automatically, with their input document type designating the tags actually contained in the document, i.e. the extended document type defined by the tagset.

As of Release 2.0.2, unfortunately, we don't do this. The document type, like other declarations, is simply passed through. It isn't even parsed, which is a (different) problem. We do plan to fix this.

Tagsets and DTD's

Our intention is to generate SGML DTD's automatically from tagsets (using a tagset, naturally). We will have to add a few new constructs in order to make this happen.

We also intend to provide a utility to generate ``skeleton'' tagsets from DTD's. Naturally, all of the elements so defined would be passive, but it would be a good starting point for anyone developing a custom tagset.


XML and XHTML

XHTML

First and foremost, what we call XHTML is not what the W3C calls XHTML. Theirs is a totally reworked DTD, forcing all of the original HTML tags into XML syntax. Ours is the original HTML DTD, with our own active-document tags (the Basic tagset basic.ts plus a few extensions) added in a way that conforms to HTML syntax, not XML.

It is our expectation that any parser that can handle HTML and recognize the XML `/>' convention for empty tags will be able to handle our extended HTML. After we devise a DTD for it, any SGML parser should be able to handle it.

Our XHTML is eXtended HTML, not XML! We should probably rename it. Like HTML and SGML, and unlike XML, our syntax is designed for human creation and editing. The differences are:

  1. Our parser permits optional end tags. We find that this greatly improves readability and makes documents easier to create with HTML-aware editors.
  2. Our parser does not require a /> in elements that the tagset declares as being empty.
  3. Attribute values that are names need not be quoted.
  4. Boolean attributes can be minimized.
  5. Entities need not be defined. We use entities as variables.
  6. Namespaces need not be declared. We use namespaces as scopes for variables and tags.

It is important to note, however, that it is trivial to make the output of the PIA's document processor conform to the W3C's XHTML DTD if that is considered desirable, and it probably will be at some point. With an appropriate command-line option the stand-alone process command would make an excellent HTML clean-up tool.

XML

As noted above, our extended HTML documents are not XML. Our extensions are somewhat in the XML style, but they are not (yet) XML.

Nevertheless, it would be desirable in the future to move toward converting all of our sample code to XML, so as to make it easier to process using (other) standard tools. It is our intention to do so.

It is not our intention to abandon support for general SGML, however. We intend always to

XSL

XSL (the XML StyleSheet Language) is very similar in intent to the PIA's document-processing language. The unfortunate problem is that XSL drops back into JavaScript to perform any operations except the few that XSL defines elements for. We don't need to do that.

We should be able to define a tagset that uses the XSL tags ``on the outside'' but uses the PIA's tags for processing in situations where XSL drops into JavaScript. It would be significantly harder to convert between these two formats. This might make a good project for someone in the developer community.

Note that the PIA makes an excellent alternative to XSL in situations where a JavaScript interpretor is not present.


SAX Parsers

A useful near-term step would be to use the SAX interface to allow drop-in replacement of our (ad-hoc and rather buggy) parser with any standard SGML or XML parser.

Unfortunately, our current interface to the parser is totally different from SAX. SAX is ``event-driven'' -- it reads an input stream and calls a callback function whenever a new token is encountered. Our interface treats the parser as a ``virtual tree iterator'' -- the processor calls on the parser to traverse the input document's parse tree depth first.

There are three ways of interfacing the PIA's document processor with a SAX parser:

  1. Build a complete parse tree and traverse it. This is trivial, so we will probably do it first. The disadvantage is that the entire tree has to live in memory, so it's unsuitable for large documents (fortunately we have very few of these) or limited-memory (e.g. embedded) applications.
  2. Run the parser and the processor in separate threads, with a queue of partial trees passing between them. This is not particularly hard, but it's not necessary, either.
  3. Extend the document processor to operate as a SAX back-end, i.e. in ``push mode.'' In fact, an earlier version of the processor did operate in this mode, so we know that it's not particularly difficult. It will require adding an additional method to the Action interface and its implementations, and some changes to Context and Processor (all in the org.risource.dps package).

Our intention is to make the SAX-style parsers an alternative to our own technique, not to abandon tree iterators entirely: they're far too attractive when you have enough memory to cache whole parse trees. Things will get a little more complicated, but not excessively so.

It's worth noting that our Output interface is very similar to SAX. Making the PIA's document processor look like a SAX parser will be trivial.


DOM

Also see Design/dom.html.

As of Release 2.0.2 the PIA's Document Object Model (as defined in the org.risource.dom package) conformd to an earlier draft of the W3C's Document Object Model specification, not the recommendation as finally approved. Unfortunately, some classes on which we relied heavily (e.g. TreeIterator) disappeared in the interim, and a number of behavioral specifications (e.g. live nodelists) appeared that make an efficient implementation difficult.

Moreover, it was never the DOM's intention to be a representation for SGML documents, only HTML and XML documents and only in the context of a client-side scripting language like JavaScript. The DOM is totally unable to deal with documents that do not fit in memory all at once, nor to deal with applications that want to move nodes between documents.

The following steps are being taken:

  1. Our code has been updated to extend the DOM Level 1 core interfaces defined in www.w3.org/TR/1998/REC-DOM-Level-1-19981001/, as soon as possible. In theory this would permit the PIA to use (by extension) any DOM implementation as a drop-in replacement. In practice it can't, because we still rely on some behavior that contradicts the spec.
  2. Revise the Cursor and Output interfaces so that they do not expose the current Node -- all components of the underlying Node would then be accessible exclusively through the Cursor, with appropriate Input and Output objects returned where necessary instead of Nodelist's and so on.
  3. Add implementations of Input and Output that traverse and build DOM trees explicitly.
  4. Replace the current parse tree implementation (the active and tree packages under org.risource.dps) with a vastly simplified scheme that does not extend the DOM, making them completely self-contained.

That said, I expect that the near-term plan should be a mixed strategy: try to stay independent at the interface level, but base the implementation on the DOM.


Java

The PIA is currently implemented in Java. It was originally implemented using Java 1.0.1 but rather quickly made the transition to 1.1. It is now implemented almost entirely in the intersection of 1.1 and 1.2, and everything but the cryptographic handlers compiles and runs in either.

As of Release 2.1 the PIA's package namespace follows current conventions, being entirely under org.risource.*. Do not be decieved by the existance of a src/java directory; this does not correspond to the java.* package namespace, but rather to the implementation language. Similarly, PERL modules will go into src/perl when we get around to writing them.

The naming of classes and methods is slightly less conventional, and reflects both history (the original implementation was in PERL) and architectural considerations. Specifically:


Copyright © 1999 Ricoh Innovations, Inc.
$Id: standards.html,v 1.4 2001-01-11 23:36:51 steve Exp $
Stephen R. Savitzky <steve@rii.ricoh.com>