Basic Tagset Manual

Quick Reference Table

Note:: The links in the ``Element Type'' column refer to sections of this manual. Links in the ``Element'' and ``Attribute'' columns take you to the definitive, automatically-generated documentation for that item, in tsdoc/basic.html

Element Type	Element	Attributes	Subelements
Construct Specification	`<define>`	`element` `attribute` `entity` `notation`	`<value>` `<action>`
Tagset	`<tagset>`	`name` `include` `tagset` `recursive`	None
Namespace	`<namespace>`	`name` `hide` `pass`	`<bind>`
	`<bind>`	`name`	None
	`<get>`	`name`	None
	`<let>`	`name`	None
	`<set>`	`name`	None
Documentation	`<doc>`	None	None
Documentation	`<note>`	`author`	None
Control Structure	`<if>`	None	None
Control Structure	`<repeat>`	`list` `start` `stop` `step` `entity`	`<foreach>` `<for>` `<while>` `<until>` `<first>` `<finally>`
Logical	`<logical>`	`op` `and` `or`	None
Logical	`<test>`	`not` `text`	None
Document Structure	`<extract>`	`sep` `all`	None
Expansion Control	`<expand>`	`hide`	None
	`<protect>`	`result` `markup`	None
	`<hide>`	`markup` `text`	None
	`<debug>`	None	None
	`<show-errors>`	None	None
	`<pretty>`	`hide-above-depth` `hide-below-depth` `hide-below-tag` `white-tag` `yellow-tag`	None
Data Manipulation	`<numeric>`	`op` `sum` `difference` `product` `quotient` `remainder` `power` `sort` `reverse` `pairs` `sep` `digits` `integer` `extended` `modulus`	None
	`<text>`	`pad` `trim` `width` `align` `sort` `reverse` `pairs` `sep` `split` `join` `encode` `decode`	None
	`<subst>`	`match` `result`	None
	`<parse>`	`usenet` `query` `element` `elements` `attr` `pairs`	None
External Resources	`<include>`	`src` `tagset` `entity`	None
	`<output>`	`dst` `append`	None
	`<connect>`	`method` `src` `mode` `tagset` `entity` `result`	None
	`<status>`	`src` `entity` `item`	None
Data Structure	`<DOCUMENT>`	`protocol` `version` `code` `message`	None
	`<HEADERS>`	`element`	None
	`<header>`	`name`	None
	`<QUERY>`	`element`	None
	`<URL>`	`protocol` `host` `port` `path` `reference` `query`	None

This document introduces the basic tagset, the most fundamental of several predefined tagsets available to users and developers working with the PIA. basic defines all of the elements that make documents in the PIA ``active.''

The other commonly-used tagsets are:

xxml
the basic tagset extended with some XML convenience functions
xhtml
the basic tagset extended with some HTML convenience functions
pia-xhtml
the default XHTML tagset used by the PIA

Some additional tagsets are defined for special-purpose applications. For example, the tsdoc tagset is used to automatically generate documentation files from tagset files. The basic constructs are all that the average author will need in most cases.

Currently the best source of examples of tag set use is the demo agent. If you wish to view these examples without using the PIA, look at the listing of demo files.

This document contains a quick reference table that contains links to the automatically-generated documentation in tsdoc/basic.html.

If you are unfamiliar with the documentation, you should first look at How to Read Tagset Descriptions.

The basic tagset consists of primitive tags that are predefined. These tags provide generally useful functions and can be used as is. Primitive tags can also be used as components of new tags that you create yourself.

Extending the primitive tagset requires the use of a special tag, <define> whose purpose is to construct other tags. The <define> tag is used to specify the tag name and all other necessary information about the tag including its attributes.

Tag is used here in much the same way that element might be used in an XML or SGML DTD. Defining a tag requires the same information one might supply for a DTD element. This includes the following:

the element name
any attributes that element might have
modifiers that specify whether the attributes are required or optional and what value types they may take on
what subelements can occur as children of this element
what parent element this element must have

XHTML tags are active tags in that they constitute not merely markup but a directive to take some action. and can require some additional information not required of XML/SGML element definitions. This information reflects the special uses to which these tags are put. This includes the tag's handler, a specification of the Java class that is associated with actions to be carried out when this element is present. The keyword handler can be followed with the name of the class to be used. If no class is named, the handler is assumed to have the same name as the tag.

When viewed as active tags, XHTML tags are much like Java or C++ methods. They have a name, a documentation field, optional arguments, and they perform some sort of action. A combination of HTML, XML, or other XTHML tags can be used in the action clause. The last item evaluated in the action clause is returned, so long as it has a return value.

Viewed in this light, a tag's attributes are analogous to the arguments that may be passed into a function.

A simple tag definition for the user-defined tag my_tag is presented here:

<define element=my_tag>
<doc> Given a string attribute, prints that string in a bold font.</doc><define attribute=my_attr required></define><action>
<b>&attributes:my_attr;</b>
</action>
</define>

The <doc> element serves to document the tag's actions. The <action> element specifies the action taken when this tag is evaluated. In this case it prints the tag's attribute using a bold font. The following example shows how, once defined, this tag might be used in an active document.

<my_tag my_attr="Hello World"></my_tag>

Noteworthy

A tagset file can be converted into HTML using the tsdoc tagset. Eventually it will be possible to process tagset files into DTD's as well. It is conceivable that we could even process them into Java or C, perhaps using embedded <code> tags with a language attribute.
SGML requires that the element named in the <!doctype...> declaration must be the highest level (outermost) element in the document.
This tagset includes documentation in HTML, in spite of the fact that HTML is not a superset of the tagset being defined. The syntax of the document that defines a tagset and the syntax defined by the tagset are, potentially, completely disjoint. Because of the included HTML, this document does not qualify as XML. It could, however, be described in SGML or converted to XML by outputting empty HTML tags with the XML empty-tag delimiter.

Tagset Overview

The tagset categories and their elements described in this document are listed in the quick reference table that follows. Each element and attribute name in the table is linked to a tagset definition file for that element. If you are unfamiliar with these files, read the following section before linking to a definition.

How to Read Tagset Descriptions

A tagset definition file consists of a sequence of <define> elements. Some of these statements are nested inside others. The outermost <define> elements define SGML elements and entities. Nested inside each element definition are the definitions of its attributes. Any definition can also contain documentation.

When converted to HTML for documentation purposes, the nested attribute definitions are indented. Documentation elements are nested one more level and typeset in italics. The </define>, <doc>, and </doc> tags are omitted.

Certain elements are only meaningful inside of other elements. For example, <then> elements only occur inside <if> elements. By convention, the definitions of these elements follow that of their parent in element.

Construct Specification Elements

The construct specification elements are used to create a tagset. They include the <define> element and its subelements.

Subelements of <define>
`<value>`
`<action>`

The <define> element can be used to specify any of the following tag types:

element
attribute
entity
word

The <define> element must be predefined for bootstrapping, but it is not in the tagset unless placed there.

The tagset is not recursive. For that reason, tags cannot be used as actions.

The <define> element can occur outside of a <namespace> or <tagset> element because there is always a "current" namespace and tagset in effect.

A <define> element that contains neither a <value> nor an <action> subelement defines only syntax. The defined construct is simply passed through to the output by the processor, with its contents and attributes, if any, processed in turn.

A <define> element can contain anything at all. All content with the exception of the <value>, <action>, and possibly <doc> elements are discarded. This means that a definition can contain arbitrary decorative markup, and that arbitrary computation can be done in the course of processing a definition.

A construct can be "defined" more than once. In such cases, the attributes are effectively merged. The associated value and/or action are replaced. This technique is used to associate a new value with a construct, and to associate an action with a construct that has already been defined.

Construct Specification Attributes

The following attribute of the <define> element is used to specify the type of construct being defined and whether it is required or optional. The name of the element defined is expressed as the value of the element attribute.

The attributes for <define element> are of the form

<define attribute='construct_type' optional>

They are summarized here and described in more detail below:

Construct Type
element
attribute
entity
notation

Modifiers for Structure Constructors

General Modifiers

The following sections describe the available modifiers for the structure constructor elements.

In working with these modifiers consider the following:

Modifiers for <define> element are meaningful only when defining an element. It is impossible to represent this constraint in SGML.
The use of parent= in subelements of <define>specifies that these elements only occur inside the given parent element; in this case, <define>. The value of the parent attribute is a list which is appended to with each use, allowing the DTD to be incrementally extended.
The use of parent greatly simplifies the construction of content models and the parser. An element with a parent implicitly terminates any unclosed elements between it and its innermost parent.

Modifier Type	Modifier
General modifiers for `<define>`	`handler`
For `<define element>`	`quoted`
	`literal`
	`text`
	`no-text`
	`empty`
For `<define attribute>`	`optional`
	`required`
	`fixed`
For `<define entity>`	`system`
	`public`
	`mode`
	`method`
	`NDATA`
	`tagset`
	`retain`
	`parameter`

Tagset and Namespace

The <tagset> and <namespace> elements provide the context in which <define> operates, i.e., in which elements, entities, and so on are defined.

It is, however, meaningful for <define> to occur outside of a <namespace> or <tagset> element because there is always a "current" namespace and tagset in effect.

Name Definition Elements

The <namespace> element provides the context in which <define entity>, <get>, <set>, <let> and <bind> operate, i.e., in which names are associated with values. It's best to think of a Namespace as a collection of what most programming languages call ``variables''.

It is, however, meaningful for <define>, etc. to occur outside of a <namespace> element because there is always a "current" namespace in effect. The outermost namespace in a document is called (has the prefix) ``VAR:'', because it contains the document processor's variables.

Namespaces are ``nested'' -- if a name is not defined in the current namespace, <get> will look ``up the stack'' to find a namespace that does contain it. Inner namespaces always have the name of the element (tag) that defines them; the following elements define namespaces:

<namespace>
<extract>
<repeat>
any tag defined by <define>

The difference between <let> and <set> is that if a variable already has a value, <set> will simply change its value no matter which namespace it's defined in. If no such variable already exists, <set> will create one in the outermost namespace, VAR:. On the other hand, <let> will always set a variable in the innermost namespace, and will create a new one there if necessary.

The <bind> element is used almost exclusively for initialization: its contents are not expanded and the name cannot contain a namespace prefix. It always defines its variable in the innermost namespace. Because it makes no attempt to expand its contents, <bind> is significantly more efficient when it can be used. You will usually see it in XML code resulting from the output of a namespace; this allows namespaces and things that resemble namespaces (e.g., Agents) to be read in efficiently.

Documentation Elements

The elements <doc> and <note> are subelements of <tagset> and <namespace>. They are processed by the tsdoc tagset to automatically construct the text portion of tagset documentation files.

Control Structure Elements

Control structure elements modify the control flow of an expansion, by selectively including, skipping, or repeating some content. The control structure elements are <if> and <repeat>.

The control structure elements are summarized here:

`<if>` and its Components

Subelement	Parent	Handler
`<then>`	`<if>` `<else-if>` `<elif>` `<elsf>`	`quoted`
`<else>`	`<if>`	`quoted`
`<else-if>`	`<if>`	`elsf`
`<elsf>`	`<if>`
`<elif>`	`<if>`	`elsf`

`<repeat>` and its Components

The contents of a <repeat> are repeatedly expanded. All of the following subelements are effectively iterating in parallel, which makes it easy to go through multiple lists and number the corresponding elements.

Subelement	Attribute
`<foreach>`	`entity`
`<for>`	`entity`
	`start`
	`stop`
	`step`
`<start>`	None
`<stop>`	None
`<step>`	None
`<while>`	None
`<until>`	None
`<first>`	None
`<finally>`	None

Logical Elements

The logical elements are <logical> and <test>.

Element	Attribute
`<logical>`	`op`
	`and`
	`or`
`<test>`	`text`
	`not`
	`zero`
	`positive`
	`negative`
	`numeric`
	`match`
	`exact`
	`case`
	`null`

Document Structure Elements

Document structure elements extract nodes or sets of nodes from a parse tree, and perform structural modifications on trees. The tree being operated on need not be part of the document being processed. It might be a namespace or the value of an entity.

These elements consist of the <extract> element and its subelements.

Element	Attribute
`<extract>`	`sep`
`<extract>`	`all`

Subelement Type	Subelement	Attributes
Starting Point	`<from>`	None
	`<in>`	None
	`<id>`	`case`
		`recursive`
		`all`
Extraction	`<name>`	`case`
		`recursive`
		`all`
	`<key>`	`sep`
		`recursive`
		`all`
Replacement	`<replace>`	`name`
	`<replace>`	`case`
	`<append>`	None
	`<remove>`	None
	`<unique>`	None

Extract and its Components

The subelements of extract fall into three groups:

Subelements of <extract>: Starting Points
Subelements of <extract>: Extraction
text can occur inside a <extract> element. Text is split on whitespace and interpreted as follows:
- If the text is a number N, it extracts the N^th node in the current set. The first node is zero, and negative numbers are counted from the last node.
- If the text starts with a pound sign (#), it is matched as a node type. The list of node types is defined in the XPointer specification, plus locally-defined types. In addition, #all is defined, matching any node. Type matching is case-insensitive.
- Otherwise, it is matched as a "name." The name of an entity or attribute is the name it is defined to have; the name of an element is its tag name.
Text items are applied sequentially, so that, for example, ... li -1 extracts the last <li> element in the current set.
Subelements of <extract>: Replacement

Expansion Control Elements

Expansion control elements modify the processing of their contents, but are not conditional in the same way that control-structure operations are. No tests are performed.

Element	Attribute
`<expand>`	`hide`
`<protect>`	`result`
`<protect>`	`markup`
`<hide>`	`text`
`<debug>`	None
`<show-errors>`	None
`<pretty>`	`hide-above-depth`
	`hide-below-depth`
	`hide-below-tag`
	`white-tag`
	`yellow-tag`

Data Manipulation Elements

Data manipulation elements perform operations on data, typically text, that depend on some non-structural features of its content (e.g. its value as a number).

Element	Attribute
`<numeric>`	`sum`
	`difference`
	`product`
	`quotient`
	`remainder`
	`power`
	`sort`
	`reverse`
	`pairs`
	`sep`
	`digits`
	`integer`
	`extended`
	`modulus`
`<text>`	`pad`
	`trim`
	`width`
	`align`
	`sort`
	`reverse`
	`pairs`
	`sep`
	`split`
	`join`
	`encode`
	`decode`
`<subst>`	`match`
`<subst>`	`result`

External Resources

External "resources" include both documents local to the system on which the document processor resides (i.e. files), and remote resources (specified with complete URLs).

Element	Attribute
`<include>`	`src`
	`tagset`
	`entity`
	`quoted`
`<output>`	`dst`
	`append`
	`directory`
`<connect>`	`method`
	`src`
	`mode`
	`tagset`
	`entity`
	`result`
`<status>`	`src`
	`entity`
	`item`

Data Structure Elements

Data structure elements perform no operations. They represent common forms of complex structured data. Strictly speaking, <tagset> and <namespace> are data structure elements. Often a data structure element has a representation that is a subclass of the representation of an ordinary element. (Currently org.risource.dps.active.ParseTreeElement).

Element	Subelements	Attribute
`<DOCUMENT>`	`<protocol>`	None
	`<version>`
	`<code>`
	`<message>`
`<HEADERS>`	None	`element`
`<HEADERS>`	None	`name`
`<Query>`	`<Query>`	`element`
`<URL>`	None	`protocol`
		`host`
		`port`
		`path`
		`reference`
		`query`