``The world is supported on the backs of four elephants standing on the shell of a giant turtle.''
``And what is the turtle standing on?''
``You're very clever, young man, but it's turtles all the way down.''
-- told of William James, Bertrand Russell, and others; see here and here.
This section was originally in ../roadmap.html.
The goal here is to represent as many PIA internal objects as possible in
XML. This is comparatively easy: we have an interface, ActiveElement,
that satisfies both DOM and DPS requirements, and an implementation for it
(currently called TreeGeneric
) that is easy to
subclass. (In a previous implementation phase, GenericAgent actually
did descend from the class that implemented Elements.)
The parser is already capable of building elements with classes other than the default TreeElement, but this capability has not been tested and more work needs to be done. It would be particularly tricky to use something that looks like an element (e.g. <ENTITY>) and use it to construct something else, like a TreeEntity. Again, this machinery is in the parser, but hasn't been tested.
The current technique is slightly different: use a handler to construct
the object. Entities are represented using the <bind>
tag. This has the disadvantage that entities cannot presently contain
anything that requires a parse-time constructor (e.g. a Namespace),
because <bind>
quotes its contents.
There are two main ways to represent PIA objects as XML: make them subclasses of an existing ActiveElement implementation, or make them independent implementations. It is significantly easier to put them into existing structures if they inherit from the same base class as our existing parse trees, and it simplifies the implementation as well.
A third technique exists: to wrap the PIA objects. This was done with Agent (using NamespaceWrap), but although it makes it possible to put the wrapped objects into an XML document, it means that you have to build a separate handler to do the construction. Agents have now been re-implemented as subclasses of BasicNamespace, and the AgentBuilder handler constructs them. They do, however, require the use of a Processor.
Some of the benefits that would result from an XML representation are:
defineHandler
to interpret those tags
defineHandler
An XML representation would eliminate all but the first step:
constructing the parse tree would in itself construct the
object. Even the current scheme, using <bind>
and a handler, is more efficient because bindHandler
is significantly simpler than defineHandler
.
Some of the objects that could be represented in XML (in approximate order of increasing complexity of implementation):
name
, path
, and the
directories, for example) were attributes. Not essential, though.
There is a complexity for agents: if you don't know both the
home and user directories at initialization time, you don't know where
to look for the XML metadata.
Note that this list is not in order of implementation priority; it's not clear whether to do tagsets or agents first.
There's actually a straightforward way to represent any Java object as an XML element:
It wouldn't surprise me much if there were already code available to do
this. It has some problems: since it doesn't map easily into the DOM,
one can't directly manipulate such an object from XXML code. Also,
without some additional metadata, multiply-linked structures get
very ugly and don't map into the most obvious XML
representation. Probably what's needed is an id
attribute on
every node.
This section was originally in ../projects.html.
Agents really want to be elements, with sub-elements for hooks and criteria. (They should really be a subclass of Tagset, so that entities could be used for for ``options'' and element handlers for actions.) Similarly, tagsets want to be purely declarative, probably using the XML schema tags.
Aside: Tagset should be a subclass of Namespace. There's actually an ambiguity here: building a Tagset should normally just be a matter of loading it. This implies a shift from the imperative (<define>, <set>) to the declarative (<element>, <entity>).
In order to pull this off, handlers need the ability to specify the
class of the element being constructed. But they already have
this! The relevant method is Syntax.createElement
.
Elements are created by the parser, so that one can just suck an object in and use it. The handler already knows, of course, whether to wait until the whole element is present, but we do need to add the parser hooks required to actually build the element even if we're streaming. The class name wants to be in a field in the handler.
There are some subtleties involved: even if one is processing on the fly, the object gets built ``offline'' as a parse tree, then passed as a whole. That is, of course, more efficient than executing definitions every time a document is processed. It requires a distinction between declarations and actions, though. All the construction is done at parse time using the addChild operation in the node under construction. All the parser needs is a flag to say whether to build or pass. It also needs to call the new object's initializer when the end tag is seen.
Clearly, if we want to use conditionals, repeats, or other computation to
construct the object at run-time, we need to use actions instead. The
pure declarative mode only works with quoted elements. We can of course
mix them, and build a tree on the output after processing. There is a
further point: entities in the attributes have to be expanded by
the handler. (They should be anyway -- it's more efficient. A proper DOM
implementation with getValue
would do this automagically.)
There are advantages to the declarative approach, though. Apart from speed, there's reflection as well: an XML schema document would be trivial to analyze and transform. Clearly it's trivial for a tagset (in our current scheme) to output a purely declarative schema, so a transition would be simple. But this isn't really necessary: just construct the object as usual, and convert it to a string!
Once we get to this point, a lot of things can be simplified:
agent.xml
file with the
agent
tagset. initialize.xh
goes away except
for agents that really need to execute code on startup. Even then, an
intialization entity should do most of the work.
Similarly, transactions and their content should be represented as specialized DOM nodes. It's almost orthogonal, except that it so nicely unifies actions (with their various ways of handling content) and transactions, entities with machines, and so on.
We may continue to need ways of initializing agents (especially) using initialize.xh, at least as a bootstrapping mechanism. So the installer needs to check.
There's a problem: how do things like Entity nodes print?
The answer is that it really depends on their parent. If it's a DOCTYPE (declaration) node, then the Entity should get output as a declaration. If it's an Element (e.g. <NAMESPACE>), then the Entity should get output as an <ENTITY> element.
At the moment it's all kind of moot: we want XML representations, and we don't parse doctypes properly.
The current implementation uses
<bind name="n">v</bind>
When an Entity is read in, it's really the parent (e.g. the Agent or
Tagset) that should determine what to do with it, in the
append
method. There's a major complication in the fact that
the parser will append the entity before its contents are seen, if a tree
is being constructed. One would be tempted to do the construction with an
Output, except that has the same problem!
The right thing is to initialize the new object after reading it. This could actually be done by putting a processing instruction at the end of the content! It's not a problem for Tagset or Namespace because the bindings just go right into the table, and nothing has to be done to their content after that. Agents have to be initialized.
This is currently sidestepped by making <AGENT>
active, so that the action
routine gets called
before handling the content. This allows the handler to perform the
necessary initialization. It is actually done in two places, because you
need to load the new agent's tagset in order to parse the content
correctly:
There's another complication: the Agent's tagset must be available while the Agent is being read in. Passing it as an attribute or the first entity wouldn't work, because the node under construction doesn't know about the parser! Another alternative is to update all the handlers when the Agent is initialized. That would be slow.
One correct solution is to put the tagset's filename into the DOCTYPE (where it belongs anyway) or into a processing instruction (targeted at the parser). Another is to separate construction from parsing, which is what we do.
Ideally one would like to be able to read an <ENTITY> element, for
example, and construct a TreeEntity
node. The problem is
that the parser has called the Handler's createElement
method
and is really expecting an Element back. It would be better if it set the
tagname variable first (so that the end tag has something to match), but
then left the node alone to be whatever type got returned. The only
alternative would be to make almost everything a TreeGeneric
,
which is ugly.
It seems as though createElement
and
createActiveElement
are only called in a very small number of
places, so it's pretty painless to have them return an
ActiveNode
(which in most cases is all the caller needs,
anyway) and check the return type as needed.
Document.createElement
is the obvious exception; it may be
necessary to resort to wrappers here.
Of course, the whole gamut of problems simply goes away if we use a tagset and handlers to construct the objects, or (perhaps even better) if we move the construction into an Output that can do the necessary translation. In the latter case the parse tree could be totally passive, and the constructed objects need not be even remotely related to DOM objects. Using an Output has the additional advantage that it can be driven by a SAX parser.
There's an intermediate position, though, that takes good advantage of the way the system works: Construct objects of appropriate classes in the Parser, and assemble them into structures in an Output. This requires that the intermediate objects, at least, be implementations of ActiveNode, but that's not really a problem in the PIA where even an Agent is basically just a specialized Namespace. This is similar to what TagsetProcessor is doing; the difference is that no processing needs to be done, just a copy.
This, in fact, is the way we do it. We use
ToNamespace
andToAgent
Outputs along with the<bind>
element, which is processed normally.
There's also a more extreme version that requires even less
intervention: make an I/O pair called ToTabular
and
FromTabular
! That's disgusting: I like it. Not really
suitable for things, like agents, that might contain real XML, but it's
great for headers and Property objects.
An Agent does not have to be an extension of
TreeGeneric
-- TreeElement
would do. Note that
its name as a namespace wants to be AGENT
anyway. On the
other hand, BasicNamespace
is an extension of
TreeGeneric
, so it's probably moot. On the other
other hand, BasicNamespace
probably should not
descend from TreeGeneric
. The only things that do are
AbstractHandler, BasicTagset, BasicNamespace, and NamespaceWrap.
TreeDocument
perhaps should. It seems no longer necessary
for TreeGeneric
to have a value component.
TreeGeneric
should go back to the function of providing
separate names for a node when used as an element or something
else. In other words, when it's an element the nodeName
should be the tagName
, and the ``other'' name should be the
name
attribute.
The correct way to implement Agent is to make an implementation of
Namespace
called NamespaceExtend
. The intent
would be similar to NamespaceWrap
except that instead of
always going to get
and set
on the wrapped
object, these would instead always go through the bindings. Just the
opposite of NamespaceWrap
, in fact. The bindings would then
include specialized implementations of ActiveEntity
that go directly to access methods on the extended object.