Adventures in Active Markup

Foreword

This white paper is an attempt at a roadmap for the evolution of the various instances of what might be called ``active markup languages'', particularly SSML, PIA, and cPIA. All three are systems for writing web applications as collections of ``active web pages'' that, rather than embedding constructs from another programming language, use a markup language (HTML or XML) extended to the point where complete server-side programs can be written directly in the markup language.

These systems can be viewed as ``macro languages'' for markup; they extend markup languages by adding control structures and data-processing functions with the same syntax as the underlying markup language.

Active markup systems stand in contrast to what might be called ``embedded-code'' systems like ASP, JSP, and PHP (which embed fragments of various programming languages in otherwise-ordinary web pages) and ``server-side programming'' systems like server API's and CGI programs. However, they are also distinct from style-sheet-based systems like Cocoon and XSLT; although they allow a complete separation between content and processing, they do not require it, and allow processing code to be embedded directly in pages.

Active markup is unique in its ability to support a mixture of separate (stylesheet-like) and embedded (ASP-like) styles in different parts of a web application. The tag languages discussed here, moreover, are unusual in not being confined to strict XML syntax, making them useful for processing ordinary HTML and making them human-writable as well as machine-readable.

1: Roadmap: the Future of Active Markup

(I'm going to present the roadmap first. The reasoning behind my suggestions may be a trifle sketchy; it may be necessary to refer to the detailed comparison of the various systems, below.)

I think that the main long-term goals include the following:

Creating a sufficiently large body of sample (application) code to be interesting, useful, and inspiring to the web development community.
Building a community of active markup developers and users.

Long-term research and development goals include:

Making active markup languages sufficiently expressive to be, at least in part, self-implementing.
Converging the implementation code base, eventually using an active-markup version of Literate Programming (the Web/Tangle/Weave approach used by Knuth for TeX): mixing raw code with markup to produce well-documented programs.
Building a suite of tools that make web application development with active markup easier.

2: Comparisons

2.1: System-Level Comparison

This section compares the four active markup systems SSML, PIA, cPIA, and AMP.

	PIA	cPIA	SSML	AMP
Implementation Language	Java	C	C++	Perl
Tree Representation	DOM	DOM	DOM-like	Node=hash
Binding Time	late	late	early	late
Execution Environment	stand-alone server/servlet	Apache module	Apache module	Perl module
command-line operation?	filter	filter	--	filter
Parser	virtual-tree iterator	virtual-tree iterator	tree-builder	tree-builder
Compiler?	no	no	yes	no
Entities	strict	strict	extended	strict

2.2: Implementation Comparison

This section compares the implementations of SSML and PIA/cPIA (which share their basic structure in spite of having been written in different implementation languages). AMP is mentioned briefly.

The Architecture Space

There are three ways to structure the internal workings of an active-markup processing system:

Parse the active document and build a parse tree. Then traverse the parse tree, calling handlers for the active tags and the output routine for the rest.
Parse the active document and compile it into a program (possibly in some intermediate language such as byte codes) which can then be executed or interpreted.
Parse the active document and call ``event handlers'', SAX-style, on the fly. The handlers for <define>, <set>, and so on have to build pieces of parse tree, which are stored and later traversed.

Currently, SSML can do either of the first two. PIA essentially does the first, except that the nodes the parser constructs are passed directly to the output and never actually linked into a real tree unless the content is needed for an active tag. The PIA parser, in other words, has the interface of a tree traverser. (The PIA also contains an ``event-driven'' parser API intended for use with SAX parsers, but it hasn't been tested.)

Ultimately one would like to move away from the first method and toward the second and third; the second (compilation) is more efficient for pages that rarely change; the third (event-driven parsing) is more efficient for on-the-fly expansion and allows large pages to be processed in a system with limited memory.

How They Work

SSML has a fairly conventional architecture: it generates a parse tree which is passed along to a LISP-like interpretor, which in turn invokes handlers for the active tags and sends anything it doesn't recognize along to the output. This makes it possible to save a compact binary representation of pages which can be interpreted very efficiently.

PIA, on the other hand, has a somewhat unusual architecture: instead of building and then traversing a parse tree, the parser has the interface of a tree walker. This makes it possible to process large documents without ever having to construct a complete tree. Similarly, output is done through an object with the interface of a tree constructor.

PIA's parse tree representation was inspired by the W3C's Document Object Model (DOM), and in fact the PIA includes a reasonably complete DOM implementation. Unfortunately, it turns out that the DOM is really unsuitable for server-side use; among other things, it includes bidirectional links that make allow the tree to be traversed in any order, and make reference counting much more complicated. LISP-like trees with unidirectional links are all that the PIA really needs.

AMP's parser generates parse trees, but is simple enough to be easily modified into a PIA-like architecture. Parse tree nodes, in keeping with Perl practice, are blessed hashtables that map attribute names to values. The tag and content are represented by specially-named attributes, and non-element nodes (e.g., declaration and comment) simply have specially-named tags (e.g., !doctype and !--).

Toward a Common Code Base

A common code base for parsers, parse trees, and output modules would allow parsers and tag handlers to be shared among systems. There will, of necessity, be different implementations for these in different language families (e.g. C, Java, and Perl), but it should at least (eventually) be possible to share modules freely between C and C++.

All active markup systems need parse trees at one stage or another, so this might be a good place to standardize an API. At one point it looked as though the DOM would provide a standardized API that we could simply drop into the PIA, but the DOM turns out to be more applicable to browsers (it's essentially the document model for ECMAscript) than to server-side programs. Many of its constructs are difficult, if not impossible, to implement efficiently, and its doubly-linked structure allows arbitrary navigation at the expense of efficiency in the sort of top-down traversal used for active markup.

My current leaning is toward a more generalized tree structure with a limited number of node types, mapping names to values that may be either strings, lists, or subtrees. (This is, of course, exactly the sort of structure commonly seen in Perl and used for parse trees in AMP. I've implemented parse trees in everything from assembly language to Smalltalk; generic nodes simply work better. In particular, the Java PIA has had two different parse tree implementations over its history; the first was more Perl-like and was much less clumsy to work with than the second, DOM-based one.)

Unlike a DOM-like system in which each type of node (Text, Element, Attribute, Comment, ...) is represented by a different class (descending from Node, of course) with its own unique set of operations, a generic-node system makes it particularly easy to add new node types. This makes it potentially applicable to markup languages other than the SGML family (for example LaTeX or WikiText), and to objects other than documents (for example, directories). It also makes it possible to change the type of a node, simplifying <make> and similar operations.

It's worth noting that SSML and AMP already use trees with a single node type (wpt/ssmlparser/XMLNode.h in SSML, lib/XML/Node.pm in AMP).

2.3: Language Comparison

This section compares the active markup languages of SSML and the PIA family. Linguistically, AMP is a member of the PIA family.

At a language level, the two systems have a great deal in common.

Both are fundamentally HTML-based, rather than XML-based -- they both allow unquoted attributes and attributes without values. SSML explicitly allows, and indeed encourages, HTML-like structures in order to improve human read/writeability; PIA tries hard to be XML-compatible but allows HTML-like shortcuts.
Both abuse the XML entity-reference syntax for expansion-time variables. PIA uses strictly-compatible XML entity-reference syntax (&name;); SSML uses an extended syntax &(variable); and &{expression};. SSML's extensions are required for compilation, and useful in avoiding name collisions with existing entities; PIA's form works satisfactorily with late binding and can be processed with an unmodified (non-validating!) XML parser.
Both ignore the XML namespace architecture, although the PIA abuses the namespace syntax to specify context in entity names. Both can process XML with namespace references.
Both have a rich set of control-structure and data-processing tags, with some tags (e.g., <if>, <set>, and <get>) in common.
Both have a <define> tag for defining new tags, essentially as macros. And, of course, both have a way of implementing native-language (primitive) tag handlers.
Both include a way of executing Unix commands.

There are some significant differences, too, but these are mainly in the choice of tags available, and are strongly influenced by SSML's expression syntax and PIA's lack of it.

SSML allows scripting languages (Perl and Python handlers are provided) to be embedded in pages (using a <SCRIPT> tag), providing a convenient escape from the tag language.
PIA has a rich set of tags for arithmetic and logical operations. These are, of course, made necessary by its lack of a syntax for embedding expressions in entity references and its lack of an escape to a scripting language.
SSML's control structures are more C-like; PIA's are more verbose and somewhat LISP-like.
PIA includes operators for constructing and manipulating parse trees at run-time. In particular, it includes the <make> and <do> tags, which allow XML elements to be constructed and (with <do>) executed at run time, and the <extract> tag for extracting information from parse trees. These are of no use in SSML, which does not have a parser available at run-time.
Because it is basically a tree-manipulation language, PIA can be used to parse and manipulate arbitrary web pages. This includes repurposing existing tags (as can be seen in cPIA, where the <p> tag is redefined as a table with a white background).

It should be mentioned that AMP allows tags to be defined as template files -- essentially, the files in a template directory are automatically defined as tags. (It's straightforward, though tedious, to write a tagset-generator with this behavior in a PIA system.)

Toward a Common Tagset

There are a very small number of choices that distinguish among the existing active-markup systems at the lexical level, e.g.,:

Strict (PIA) entity syntax / extended (SSML) syntax
Strict XML parser / relaxed parser

(It is theoretically possible to automatically translate among these: for example an expression in an extended entity can be eliminated by using <do> to construct the element.) Whether this would be useful enough to be worth doing is, of course, a separate question.

Beyond that, it's a matter of selecting a set of allowable operations: a tagset. It ought to be possible to develop a set of primitive tags that everyone can agree on (for example, <if>, <get>, <set>, <define> and maybe a few others are already common to both PIA and SSML).

The main reason for doing this is not so much to make applications more portable (although the ability to pass active-markup pages around will certainly be a good thing in the long run) as to make the necessary knowledge more portable: just as Java and C++ benefit greatly from their common syntactic and semantic legacy from C, active-markup languages will benefit from a shared core of common syntax that developers and power-users can become familiar and comfortable with.

Stephen R. Savitzky <steve@rii.ricoh.com>

RiSource.org / White Papers / Adventures in Active Markup