About Handlers

This document contains notes about Handlers in the PIA's Document Processing System (DPS), as implemented in the Java module org.risource.dps.handle.

What a Handler Does

The org.risource.dps.Handler interface is actually a composite of two other interfaces: org.risource.dps.Syntax and org.risource.dps.Action.

Every ActiveNode actually points to two (potentially different) handlers, accessible through the getSyntax() and getAction() methods.

Syntax

The Syntax interface of a Handler is invoked at parse time. In the typical case, for an Element, it tells the parser whether the element is empty, whether its contents are parsed or unparsed, and so on.

When the Parser constructs a new Node, for example an Element, it passes the tagname and attribute list to the Tagset using the createActiveElement method. The tagset obtains the appropriate handler using its own getHandlerForTag method, and calls on the handler to construct the node, using its createElement method.

handler.createElement normally constructs a default ParseTreeElement object, but may be overridden to construct a subclass. See tagsetHandler for an example of this.

The syntax Handler then sets the new element's action handler, using e.setAction(getActionForNode(e));. The default is simply to return this, but the handler has a chance to check for the presence (though not the value) of attributes at parse time and get some dispatching out of the way. See testHandler for a good example of this technique.

At this point, the syntax interface is out of the picture.

Semantics

An ActiveNode's associated Action handler is called from a Processor (or from processing utilities in org.risource.dps.aux.Expand, although in practice these almost invariably construct a sub-processor).

The relevant code in BasicProcessor is:

  public boolean run() {
    running = true;
    processNode();
    while (running && input.toNext()) processNode();
    return running;
  }

  /** Process the current Node */
  public final void processNode() {
    Action handler = input.getAction();
    if (handler != null) {
      doAction(handler.getActionCode(), handler);
      // MUST BE equivalent to: handler.action(input, this, output);
    } else {
      expandCurrentNode();
    }
  }

  /** Perform any additional action requested by the action routine. */
  protected final void doAction(int flag, Action handler) {
    switch (flag) {
    case Action.ACTIVE_NODE: action(input, this, output); return;
    case Action.COPY_NODE: copyCurrentNode(); return;
    case Action.EXPAND_NODE: expandCurrentNode(); return;
    case Action.EXPAND_ATTS: expandCurrentAttrs(); return;
    case Action.PUT_NODE: putCurrentNode(); return;
    }
  }

Inside an Action

Eventually we get down to calling the ``three-argument'' action method, which in GenericHandler (which is the parent of the handlers for all active elements) looks like this:

    public void action(Input in, Context aContext, Output out) {
        defaultAction(in, aContext, out);
    }

All this is doing is passing the real operation off to defaultAction, in case you want to

Handler Classes

There are four different kinds (classes?) of handler classes:

Handlers for generic nodes of a given type. These have capitalized names ending with Handler: for example, EntityHandler, which handles entity references.
Handlers for active Elements. These have names that match the node's tagname (typically), with Handler as a suffix to keep them from being confused with Java keywords. For example, ifHandler, which handles the <if> element. In general these are public classes.
Handler subclasses for handling elements with a particular attribute. These have names that look like: tagname_attribute. For example, numeric_sort, which handles the <numeric sort> element. These are almost invariably package-local classes, defined in the same file as their parent element handler.
Handler classes for sub-elements of specific elements. These follow the same naming convention as handlers for other active elements, but are often defined in the same file as their parent element. For example, fromHandler, which handles the <from> sub-element of <select>.

Note that several tags can share a handler class by specifying the classname explicitly. For example, <else-if> and <elsif> share the same handler. It is also possible to construct variant tagsets in which every element has a different name than the ``standard'' one. Because of this, when a parent handler wants to identify specific sub-elements, it will usually compare the class names of their handlers instead of their tagnames. See ifHandler for a good example.

Writing a Handler

Writing an Element handler

When writing a new handler from scratch, say for the ``<foo>'' element, the best way to start is with the command:

  make class tag=foo

This copies a skeleton called TypicalHandler, replacing all occurrances of ``typical'' with ``foo'' and so giving you a good place to start.

The new class will have a getActionForNode method to dispatch on attributes at parse time, and a sample attribute-handler subclass. Either edit the names, or delete them if you don't want them.

The skeleton gives you a ``five-argument'' action method to customize. This is almost always the right place to start; you can do anything with it, but it may be less efficient than a customized ``three-argument'' action. In particular, if you need the contents of the element as a string it is significantly more efficient to make a three-argument action routine; see testHandler and its subclasses for some typical examples.

If you need to do something involving control structure, take a look at repeatHandler and ifHandler. If you need to pass data between an element and its sub-elements, or from one sub-element to another, look at selectHandler.

Writing an Attribute or sub-element handler

When writing a handler for an attribute or sub-element, the best thing to do is to clone an existing one with the same parent.

If you need to add sub-elements to a new parent element, take a look at selectHandler.

Getting Information

If you need information about the current node (for example, its node type or tagname), use the current Input (usually passed as an argument called in). The input is also the right place to go for conversions, e.g.
```
           ActiveElement e = in.getActive().asElement();
       
```
If you need the value of an entity, or need to set an entity, use the current Context (usually called either cxt or aContext).
If you need something from the current Tagset, use cxt.getTopContext().getTagset().

Debugging

You can always get debugging information output using the debug or message methods on Context. Note, however, that any computations involved in computing the message will be executed whether debugging is turned on or not. It is usual, therefore, to comment out debugging statements after you're done with them.