RiSource.org / White Papers /
Adventures in Active Markup

Foreword

This white paper is an attempt at a roadmap for the evolution of the various instances of what might be called ``active markup languages'', particularly SSML, PIA, and cPIA. All three are systems for writing web applications as collections of ``active web pages'' that, rather than embedding constructs from another programming language, use a markup language (HTML or XML) extended to the point where complete server-side programs can be written directly in the markup language.

These systems can be viewed as ``macro languages'' for markup; they extend markup languages by adding control structures and data-processing functions with the same syntax as the underlying markup language.

Active markup systems stand in contrast to what might be called ``embedded-code'' systems like ASP, JSP, and PHP (which embed fragments of various programming languages in otherwise-ordinary web pages) and ``server-side programming'' systems like server API's and CGI programs. However, they are also distinct from style-sheet-based systems like Cocoon and XSLT; although they allow a complete separation between content and processing, they do not require it, and allow processing code to be embedded directly in pages.

Active markup is unique in its ability to support a mixture of separate (stylesheet-like) and embedded (ASP-like) styles in different parts of a web application. The tag languages discussed here, moreover, are unusual in not being confined to strict XML syntax, making them useful for processing ordinary HTML and making them human-writable as well as machine-readable.

Table of Contents:


    1: Roadmap: the Future of Active Markup
    2: Comparisons
        2.1: System-Level Comparison
        2.2: Implementation Comparison
        2.3: Language Comparison

1: Roadmap: the Future of Active Markup

(I'm going to present the roadmap first. The reasoning behind my suggestions may be a trifle sketchy; it may be necessary to refer to the detailed comparison of the various systems, below.)

I think that the main long-term goals include the following:

Long-term research and development goals include:


2: Comparisons

2.1: System-Level Comparison

This section compares the four active markup systems SSML, PIA, cPIA, and AMP.

  PIA cPIA SSML AMP
Implementation Language Java C C++ Perl
Tree Representation DOM DOM DOM-like Node=hash
Binding Time late late early late
Execution Environment stand-alone server/servlet Apache module Apache module Perl module
command-line operation? filter filter -- filter
Parser virtual-tree iterator virtual-tree iterator tree-builder tree-builder
Compiler? no no yes no
Entities strict strict extended strict

2.2: Implementation Comparison

This section compares the implementations of SSML and PIA/cPIA (which share their basic structure in spite of having been written in different implementation languages). AMP is mentioned briefly.

The Architecture Space

There are three ways to structure the internal workings of an active-markup processing system:

Currently, SSML can do either of the first two. PIA essentially does the first, except that the nodes the parser constructs are passed directly to the output and never actually linked into a real tree unless the content is needed for an active tag. The PIA parser, in other words, has the interface of a tree traverser. (The PIA also contains an ``event-driven'' parser API intended for use with SAX parsers, but it hasn't been tested.)

Ultimately one would like to move away from the first method and toward the second and third; the second (compilation) is more efficient for pages that rarely change; the third (event-driven parsing) is more efficient for on-the-fly expansion and allows large pages to be processed in a system with limited memory.

How They Work

SSML has a fairly conventional architecture: it generates a parse tree which is passed along to a LISP-like interpretor, which in turn invokes handlers for the active tags and sends anything it doesn't recognize along to the output. This makes it possible to save a compact binary representation of pages which can be interpreted very efficiently.

PIA, on the other hand, has a somewhat unusual architecture: instead of building and then traversing a parse tree, the parser has the interface of a tree walker. This makes it possible to process large documents without ever having to construct a complete tree. Similarly, output is done through an object with the interface of a tree constructor.

PIA's parse tree representation was inspired by the W3C's Document Object Model (DOM), and in fact the PIA includes a reasonably complete DOM implementation. Unfortunately, it turns out that the DOM is really unsuitable for server-side use; among other things, it includes bidirectional links that make allow the tree to be traversed in any order, and make reference counting much more complicated. LISP-like trees with unidirectional links are all that the PIA really needs.

AMP's parser generates parse trees, but is simple enough to be easily modified into a PIA-like architecture. Parse tree nodes, in keeping with Perl practice, are blessed hashtables that map attribute names to values. The tag and content are represented by specially-named attributes, and non-element nodes (e.g., declaration and comment) simply have specially-named tags (e.g., !doctype and !--).

Toward a Common Code Base

A common code base for parsers, parse trees, and output modules would allow parsers and tag handlers to be shared among systems. There will, of necessity, be different implementations for these in different language families (e.g. C, Java, and Perl), but it should at least (eventually) be possible to share modules freely between C and C++.

All active markup systems need parse trees at one stage or another, so this might be a good place to standardize an API. At one point it looked as though the DOM would provide a standardized API that we could simply drop into the PIA, but the DOM turns out to be more applicable to browsers (it's essentially the document model for ECMAscript) than to server-side programs. Many of its constructs are difficult, if not impossible, to implement efficiently, and its doubly-linked structure allows arbitrary navigation at the expense of efficiency in the sort of top-down traversal used for active markup.

My current leaning is toward a more generalized tree structure with a limited number of node types, mapping names to values that may be either strings, lists, or subtrees. (This is, of course, exactly the sort of structure commonly seen in Perl and used for parse trees in AMP. I've implemented parse trees in everything from assembly language to Smalltalk; generic nodes simply work better. In particular, the Java PIA has had two different parse tree implementations over its history; the first was more Perl-like and was much less clumsy to work with than the second, DOM-based one.)

Unlike a DOM-like system in which each type of node (Text, Element, Attribute, Comment, ...) is represented by a different class (descending from Node, of course) with its own unique set of operations, a generic-node system makes it particularly easy to add new node types. This makes it potentially applicable to markup languages other than the SGML family (for example LaTeX or WikiText), and to objects other than documents (for example, directories). It also makes it possible to change the type of a node, simplifying <make> and similar operations.

It's worth noting that SSML and AMP already use trees with a single node type (wpt/ssmlparser/XMLNode.h in SSML, lib/XML/Node.pm in AMP).

2.3: Language Comparison

This section compares the active markup languages of SSML and the PIA family. Linguistically, AMP is a member of the PIA family.

At a language level, the two systems have a great deal in common.

There are some significant differences, too, but these are mainly in the choice of tags available, and are strongly influenced by SSML's expression syntax and PIA's lack of it.

It should be mentioned that AMP allows tags to be defined as template files -- essentially, the files in a template directory are automatically defined as tags. (It's straightforward, though tedious, to write a tagset-generator with this behavior in a PIA system.)

Toward a Common Tagset

There are a very small number of choices that distinguish among the existing active-markup systems at the lexical level, e.g.,:

(It is theoretically possible to automatically translate among these: for example an expression in an extended entity can be eliminated by using <do> to construct the element.) Whether this would be useful enough to be worth doing is, of course, a separate question.

Beyond that, it's a matter of selecting a set of allowable operations: a tagset. It ought to be possible to develop a set of primitive tags that everyone can agree on (for example, <if>, <get>, <set>, <define> and maybe a few others are already common to both PIA and SSML).

The main reason for doing this is not so much to make applications more portable (although the ability to pass active-markup pages around will certainly be a good thing in the long run) as to make the necessary knowledge more portable: just as Java and C++ benefit greatly from their common syntactic and semantic legacy from C, active-markup languages will benefit from a shared core of common syntax that developers and power-users can become familiar and comfortable with.


Copyright © 2002 Ricoh Innovations, Inc.
$Id: wp-markup.html,v 1.3 2003/10/06 17:18:41 steve Exp $
Stephen R. Savitzky <steve@rii.ricoh.com>