The XML World

External XML Representations for PIA Objects

Architectural Notes: Tags ``All the Way Down''

``The world is supported on the backs of four elephants standing on the shell of a giant turtle.''

``And what is the turtle standing on?''

``You're very clever, young man, but it's turtles all the way down.''

-- told of William James, Bertrand Russell, and others; see here and here.

This section was originally in ../roadmap.html.

The goal here is to represent as many PIA internal objects as possible in XML. This is comparatively easy: we have an interface, ActiveElement, that satisfies both DOM and DPS requirements, and an implementation for it (currently called TreeGeneric) that is easy to subclass. (In a previous implementation phase, GenericAgent actually did descend from the class that implemented Elements.)

The parser is already capable of building elements with classes other than the default TreeElement, but this capability has not been tested and more work needs to be done. It would be particularly tricky to use something that looks like an element (e.g. <ENTITY>) and use it to construct something else, like a TreeEntity. Again, this machinery is in the parser, but hasn't been tested.

The current technique is slightly different: use a handler to construct the object. Entities are represented using the <bind> tag. This has the disadvantage that entities cannot presently contain anything that requires a parse-time constructor (e.g. a Namespace), because <bind> quotes its contents.

There are two main ways to represent PIA objects as XML: make them subclasses of an existing ActiveElement implementation, or make them independent implementations. It is significantly easier to put them into existing structures if they inherit from the same base class as our existing parse trees, and it simplifies the implementation as well.

A third technique exists: to wrap the PIA objects. This was done with Agent (using NamespaceWrap), but although it makes it possible to put the wrapped objects into an XML document, it means that you have to build a separate handler to do the construction. Agents have now been re-implemented as subclasses of BasicNamespace, and the AgentBuilder handler constructs them. They do, however, require the use of a Processor.

Benefits

Some of the benefits that would result from an XML representation are:

Targets

Some of the objects that could be represented in XML (in approximate order of increasing complexity of implementation):

Tagsets and Namespaces (done 1999-05-06)
Actually, BasicTagset and BasicNamespace already inherit from TreeGeneric. The only difficulty is that there is no set of handlers to construct them directly from XML documents, and writing them out doesn't produce something that could be read back in because the implementation is incomplete. The <define> tag imposes an extra layer of indirection.
Agents (done 1999-04-30)
Presently these are simply wrapped as namespaces; it might work better if some components (name, path, and the directories, for example) were attributes. Not essential, though. There is a complexity for agents: if you don't know both the home and user directories at initialization time, you don't know where to look for the XML metadata.
Match Criteria
Presently an agent's criteria are initialized by means of a rather simpleminded conversion from strings. It would be easier if they were real XML.
Transactions
There's a lot of complexity in Transaction and its subclasses that might go away if they were XML. There would be the further advantage that many of the computations that need to be done on them could be done via XXML rather than Java. It's not entirely clear that this would be a good idea, however.
Features
The little objects that compute transaction features could also be XML code, but this is probably not a good idea, since it would be very inefficient. It's not unreasonable to surface the features themselves as attributes, however.

Note that this list is not in order of implementation priority; it's not clear whether to do tagsets or agents first.

Advanced: XML serialization of arbitrary objects

There's actually a straightforward way to represent any Java object as an XML element:

It wouldn't surprise me much if there were already code available to do this. It has some problems: since it doesn't map easily into the DOM, one can't directly manipulate such an object from XXML code. Also, without some additional metadata, multiply-linked structures get very ugly and don't map into the most obvious XML representation. Probably what's needed is an id attribute on every node.


Design Notes

This section was originally in ../projects.html.

Agents really want to be elements, with sub-elements for hooks and criteria. (They should really be a subclass of Tagset, so that entities could be used for for ``options'' and element handlers for actions.) Similarly, tagsets want to be purely declarative, probably using the XML schema tags.

Aside: Tagset should be a subclass of Namespace. There's actually an ambiguity here: building a Tagset should normally just be a matter of loading it. This implies a shift from the imperative (<define>, <set>) to the declarative (<element>, <entity>).

In order to pull this off, handlers need the ability to specify the class of the element being constructed. But they already have this! The relevant method is Syntax.createElement.

Elements are created by the parser, so that one can just suck an object in and use it. The handler already knows, of course, whether to wait until the whole element is present, but we do need to add the parser hooks required to actually build the element even if we're streaming. The class name wants to be in a field in the handler.

There are some subtleties involved: even if one is processing on the fly, the object gets built ``offline'' as a parse tree, then passed as a whole. That is, of course, more efficient than executing definitions every time a document is processed. It requires a distinction between declarations and actions, though. All the construction is done at parse time using the addChild operation in the node under construction. All the parser needs is a flag to say whether to build or pass. It also needs to call the new object's initializer when the end tag is seen.

Clearly, if we want to use conditionals, repeats, or other computation to construct the object at run-time, we need to use actions instead. The pure declarative mode only works with quoted elements. We can of course mix them, and build a tree on the output after processing. There is a further point: entities in the attributes have to be expanded by the handler. (They should be anyway -- it's more efficient. A proper DOM implementation with getValue would do this automagically.)

There are advantages to the declarative approach, though. Apart from speed, there's reflection as well: an XML schema document would be trivial to analyze and transform. Clearly it's trivial for a tagset (in our current scheme) to output a purely declarative schema, so a transition would be simple. But this isn't really necessary: just construct the object as usual, and convert it to a string!

Once we get to this point, a lot of things can be simplified:

Similarly, transactions and their content should be represented as specialized DOM nodes. It's almost orthogonal, except that it so nicely unifies actions (with their various ways of handling content) and transactions, entities with machines, and so on.

We may continue to need ways of initializing agents (especially) using initialize.xh, at least as a bootstrapping mechanism. So the installer needs to check.


Implementation Notes

Entity Output

There's a problem: how do things like Entity nodes print?

The answer is that it really depends on their parent. If it's a DOCTYPE (declaration) node, then the Entity should get output as a declaration. If it's an Element (e.g. <NAMESPACE>), then the Entity should get output as an <ENTITY> element.

At the moment it's all kind of moot: we want XML representations, and we don't parse doctypes properly.

The current implementation uses <bind name="n">v</bind>

Agent and Entity Input

When an Entity is read in, it's really the parent (e.g. the Agent or Tagset) that should determine what to do with it, in the append method. There's a major complication in the fact that the parser will append the entity before its contents are seen, if a tree is being constructed. One would be tempted to do the construction with an Output, except that has the same problem!

The right thing is to initialize the new object after reading it. This could actually be done by putting a processing instruction at the end of the content! It's not a problem for Tagset or Namespace because the bindings just go right into the table, and nothing has to be done to their content after that. Agents have to be initialized.

This is currently sidestepped by making <AGENT> active, so that the action routine gets called before handling the content. This allows the handler to perform the necessary initialization. It is actually done in two places, because you need to load the new agent's tagset in order to parse the content correctly:

There's another complication: the Agent's tagset must be available while the Agent is being read in. Passing it as an attribute or the first entity wouldn't work, because the node under construction doesn't know about the parser! Another alternative is to update all the handlers when the Agent is initialized. That would be slow.

One correct solution is to put the tagset's filename into the DOCTYPE (where it belongs anyway) or into a processing instruction (targeted at the parser). Another is to separate construction from parsing, which is what we do.

 

Ideally one would like to be able to read an <ENTITY> element, for example, and construct a TreeEntity node. The problem is that the parser has called the Handler's createElement method and is really expecting an Element back. It would be better if it set the tagname variable first (so that the end tag has something to match), but then left the node alone to be whatever type got returned. The only alternative would be to make almost everything a TreeGeneric, which is ugly.

It seems as though createElement and createActiveElement are only called in a very small number of places, so it's pretty painless to have them return an ActiveNode (which in most cases is all the caller needs, anyway) and check the return type as needed. Document.createElement is the obvious exception; it may be necessary to resort to wrappers here.

 

Of course, the whole gamut of problems simply goes away if we use a tagset and handlers to construct the objects, or (perhaps even better) if we move the construction into an Output that can do the necessary translation. In the latter case the parse tree could be totally passive, and the constructed objects need not be even remotely related to DOM objects. Using an Output has the additional advantage that it can be driven by a SAX parser.

There's an intermediate position, though, that takes good advantage of the way the system works: Construct objects of appropriate classes in the Parser, and assemble them into structures in an Output. This requires that the intermediate objects, at least, be implementations of ActiveNode, but that's not really a problem in the PIA where even an Agent is basically just a specialized Namespace. This is similar to what TagsetProcessor is doing; the difference is that no processing needs to be done, just a copy.

This, in fact, is the way we do it. We use ToNamespace and ToAgent Outputs along with the <bind> element, which is processed normally.

There's also a more extreme version that requires even less intervention: make an I/O pair called ToTabular and FromTabular! That's disgusting: I like it. Not really suitable for things, like agents, that might contain real XML, but it's great for headers and Property objects.

Agent Representation

An Agent does not have to be an extension of TreeGeneric -- TreeElement would do. Note that its name as a namespace wants to be AGENT anyway. On the other hand, BasicNamespace is an extension of TreeGeneric, so it's probably moot. On the other other hand, BasicNamespace probably should not descend from TreeGeneric. The only things that do are AbstractHandler, BasicTagset, BasicNamespace, and NamespaceWrap. TreeDocument perhaps should. It seems no longer necessary for TreeGeneric to have a value component.

TreeGeneric should go back to the function of providing separate names for a node when used as an element or something else. In other words, when it's an element the nodeName should be the tagName, and the ``other'' name should be the name attribute.

The correct way to implement Agent is to make an implementation of Namespace called NamespaceExtend. The intent would be similar to NamespaceWrap except that instead of always going to get and set on the wrapped object, these would instead always go through the bindings. Just the opposite of NamespaceWrap, in fact. The bindings would then include specialized implementations of ActiveEntity that go directly to access methods on the extended object.


Copyright © 1997-1999 Ricoh Innovations, Inc.
$Id: xml-world.html,v 1.6 2001-01-11 23:36:55 steve Exp $
Stephen R. Savitzky <steve@rii.ricoh.com>