Basic Tagset Manual

Quick Reference Table

Note:
The links in the ``Element Type'' column refer to sections of this manual. Links in the ``Element'' and ``Attribute'' columns take you to the definitive, automatically-generated documentation for that item, in tsdoc/basic.html
Element Type Element Attributes Subelements
Construct Specification <define> element
attribute
entity
notation
<value>
<action>
Tagset <tagset> name
include
tagset
recursive
None
Namespace <namespace> name
hide
pass
<bind>
<bind> name
None
<get> name
None
<let> name
None
<set> name
None
Documentation <doc> None None
<note> author None
Control Structure <if> None None
<repeat> list
start
stop
step
entity
<foreach>
<for>
<while>
<until>
<first>
<finally>
Logical <logical> op
 and
 or
None
<test> not
text
None
Document Structure <extract> sep
all
None
Expansion Control <expand> hide None
<protect> result
markup
None
<hide> markup
text
None
<debug> None None
<show-errors> None None
<pretty> hide-above-depth
hide-below-depth
hide-below-tag
white-tag
yellow-tag
None
Data Manipulation <numeric> op
 sum
 difference
 product
 quotient
 remainder
 power
 sort
  reverse
  pairs
  sep
digits
integer
extended
modulus
None
<text> pad
trim
width
align
sort
reverse
pairs
sep
split
join
encode
decode
None
<subst> match
result
None
<parse> usenet
query
element
elements
attr
pairs
None
External Resources <include> src
tagset
entity
None
<output> dst
append
None
<connect> method
src
mode
tagset
entity
result
None
<status> src
entity
item
None
Data Structure <DOCUMENT> protocol
version
code
message
None
<HEADERS> element None
<header> name None
<QUERY> element None
<URL> protocol
host
port
path
reference
query
None

Introducing the Basic Tagset

This document introduces the basic tagset, the most fundamental of several predefined tagsets available to users and developers working with the PIA. basic defines all of the elements that make documents in the PIA ``active.''

The other commonly-used tagsets are:

Some additional tagsets are defined for special-purpose applications. For example, the tsdoc tagset is used to automatically generate documentation files from tagset files. The basic constructs are all that the average author will need in most cases.

Currently the best source of examples of tag set use is the demo agent. If you wish to view these examples without using the PIA, look at the listing of demo files.

This document contains a quick reference table that contains links to the automatically-generated documentation in tsdoc/basic.html.

If you are unfamiliar with the documentation, you should first look at How to Read Tagset Descriptions.

The basic tagset consists of primitive tags that are predefined. These tags provide generally useful functions and can be used as is. Primitive tags can also be used as components of new tags that you create yourself.

Extending the primitive tagset requires the use of a special tag, <define> whose purpose is to construct other tags. The <define> tag is used to specify the tag name and all other necessary information about the tag including its attributes.

Tag is used here in much the same way that element might be used in an XML or SGML DTD. Defining a tag requires the same information one might supply for a DTD element. This includes the following:

XHTML tags are active tags in that they constitute not merely markup but a directive to take some action. and can require some additional information not required of XML/SGML element definitions. This information reflects the special uses to which these tags are put. This includes the tag's handler, a specification of the Java class that is associated with actions to be carried out when this element is present. The keyword handler can be followed with the name of the class to be used. If no class is named, the handler is assumed to have the same name as the tag.

When viewed as active tags, XHTML tags are much like Java or C++ methods. They have a name, a documentation field, optional arguments, and they perform some sort of action. A combination of HTML, XML, or other XTHML tags can be used in the action clause. The last item evaluated in the action clause is returned, so long as it has a return value.

Viewed in this light, a tag's attributes are analogous to the arguments that may be passed into a function.

A simple tag definition for the user-defined tag my_tag is presented here:

<define element=my_tag>
<doc> Given a string attribute, prints that string in a bold font.</doc><define attribute=my_attr required></define><action>
<b>&attributes:my_attr;</b>
</action>
</define>

The <doc> element serves to document the tag's actions. The <action> element specifies the action taken when this tag is evaluated. In this case it prints the tag's attribute using a bold font. The following example shows how, once defined, this tag might be used in an active document.

<my_tag my_attr="Hello World"></my_tag>

Noteworthy

Tagset Overview

The tagset categories and their elements described in this document are listed in the quick reference table that follows. Each element and attribute name in the table is linked to a tagset definition file for that element. If you are unfamiliar with these files, read the following section before linking to a definition.

How to Read Tagset Descriptions

A tagset definition file consists of a sequence of <define> elements. Some of these statements are nested inside others. The outermost <define> elements define SGML elements and entities. Nested inside each element definition are the definitions of its attributes. Any definition can also contain documentation.

When converted to HTML for documentation purposes, the nested attribute definitions are indented. Documentation elements are nested one more level and typeset in italics. The </define>, <doc>, and </doc> tags are omitted.

Certain elements are only meaningful inside of other elements. For example, <then> elements only occur inside <if> elements. By convention, the definitions of these elements follow that of their parent in element.

Construct Specification Elements

The construct specification elements are used to create a tagset. They include the <define> element and its subelements.

Subelements of <define>
<value>
<action>

The <define> element can be used to specify any of the following tag types:

The <define> element must be predefined for bootstrapping, but it is not in the tagset unless placed there.

The tagset is not recursive. For that reason, tags cannot be used as actions.

The <define> element can occur outside of a <namespace> or <tagset> element because there is always a "current" namespace and tagset in effect.

A <define> element that contains neither a <value> nor an <action> subelement defines only syntax. The defined construct is simply passed through to the output by the processor, with its contents and attributes, if any, processed in turn.

A <define> element can contain anything at all. All content with the exception of the <value>, <action>, and possibly <doc> elements are discarded. This means that a definition can contain arbitrary decorative markup, and that arbitrary computation can be done in the course of processing a definition.

A construct can be "defined" more than once. In such cases, the attributes are effectively merged. The associated value and/or action are replaced. This technique is used to associate a new value with a construct, and to associate an action with a construct that has already been defined.

Construct Specification Attributes

The following attribute of the <define> element is used to specify the type of construct being defined and whether it is required or optional. The name of the element defined is expressed as the value of the element attribute.

The attributes for <define element> are of the form

<define attribute='construct_type' optional>

They are summarized here and described in more detail below:

Construct Type
element
attribute
entity
notation

Modifiers for Structure Constructors

General Modifiers

The following sections describe the available modifiers for the structure constructor elements.

In working with these modifiers consider the following:

Modifier Type Modifier
General modifiers for <define> handler
For <define element> quoted
literal
text
no-text
empty
For <define attribute> optional
required
fixed
For <define entity> system
public
mode
method
NDATA
tagset
retain
parameter

Tagset and Namespace

The <tagset> and <namespace> elements provide the context in which <define> operates, i.e., in which elements, entities, and so on are defined.

It is, however, meaningful for <define> to occur outside of a <namespace> or <tagset> element because there is always a "current" namespace and tagset in effect.

Name Definition Elements

The <namespace> element provides the context in which <define entity>, <get>, <set>, <let> and <bind> operate, i.e., in which names are associated with values. It's best to think of a Namespace as a collection of what most programming languages call ``variables''.

It is, however, meaningful for <define>, etc. to occur outside of a <namespace> element because there is always a "current" namespace in effect. The outermost namespace in a document is called (has the prefix) ``VAR:'', because it contains the document processor's variables.

Namespaces are ``nested'' -- if a name is not defined in the current namespace, <get> will look ``up the stack'' to find a namespace that does contain it. Inner namespaces always have the name of the element (tag) that defines them; the following elements define namespaces:

The difference between <let> and <set> is that if a variable already has a value, <set> will simply change its value no matter which namespace it's defined in. If no such variable already exists, <set> will create one in the outermost namespace, VAR:. On the other hand, <let> will always set a variable in the innermost namespace, and will create a new one there if necessary.

The <bind> element is used almost exclusively for initialization: its contents are not expanded and the name cannot contain a namespace prefix. It always defines its variable in the innermost namespace. Because it makes no attempt to expand its contents, <bind> is significantly more efficient when it can be used. You will usually see it in XML code resulting from the output of a namespace; this allows namespaces and things that resemble namespaces (e.g., Agents) to be read in efficiently.

Documentation Elements

The elements <doc> and <note> are subelements of <tagset> and <namespace>. They are processed by the tsdoc tagset to automatically construct the text portion of tagset documentation files.

Control Structure Elements

Control structure elements modify the control flow of an expansion, by selectively including, skipping, or repeating some content. The control structure elements are <if> and <repeat>.

The control structure elements are summarized here:

<if> and its Components

Subelement Parent Handler
<then> <if>
<else-if>
<elif>
<elsf>
quoted
<else> <if> quoted
<else-if> <if> elsf
<elsf> <if>
<elif> <if> elsf

<repeat> and its Components

The contents of a <repeat> are repeatedly expanded. All of the following subelements are effectively iterating in parallel, which makes it easy to go through multiple lists and number the corresponding elements.

Subelement Attribute
<foreach> entity
<for> entity
start
stop
step
<start> None
<stop> None
<step> None
<while> None
<until> None
<first> None
<finally> None

Logical Elements

The logical elements are <logical> and <test>.

Element Attribute
<logical> op
 and
 or
<test> text
not
zero
positive
negative
numeric
match
exact
case
null

Document Structure Elements

Document structure elements extract nodes or sets of nodes from a parse tree, and perform structural modifications on trees. The tree being operated on need not be part of the document being processed. It might be a namespace or the value of an entity.

These elements consist of the <extract> element and its subelements.

Element Attribute
<extract> sep
all
Subelement Type Subelement Attributes
Starting Point <from> None
<in> None
<id> case
recursive
all
Extraction <name> case
recursive
all
<key> sep
recursive
all
Replacement <replace> name
case
<append> None
<remove> None
<unique> None

Extract and its Components

The subelements of extract fall into three groups:

Expansion Control Elements

Expansion control elements modify the processing of their contents, but are not conditional in the same way that control-structure operations are. No tests are performed.

Element Attribute
<expand> hide
<protect> result
markup
<hide> text
<debug> None
<show-errors> None
<pretty> hide-above-depth
hide-below-depth
hide-below-tag
white-tag
yellow-tag

Data Manipulation Elements

Data manipulation elements perform operations on data, typically text, that depend on some non-structural features of its content (e.g. its value as a number).

Element Attribute
<numeric> sum
difference
product
quotient
remainder
power
sort
reverse
pairs
sep
digits
integer
extended
modulus
<text> pad
trim
width
align
sort
reverse
pairs
sep
split
join
encode
decode
<subst> match
result

External Resources

External "resources" include both documents local to the system on which the document processor resides (i.e. files), and remote resources (specified with complete URLs).

Element Attribute
<include> src
tagset
entity
quoted
<output> dst
append
directory
<connect> method
src
mode
tagset
entity
result
<status> src
entity
item

Data Structure Elements

Data structure elements perform no operations. They represent common forms of complex structured data. Strictly speaking, <tagset> and <namespace> are data structure elements. Often a data structure element has a representation that is a subclass of the representation of an ordinary element. (Currently org.risource.dps.active.ParseTreeElement).

Element Subelements Attribute
<DOCUMENT> <protocol> None
<version>
<code>
<message>
<HEADERS> None element
name
<Query> <Query> element
<URL> None protocol
host
port
path
reference
query

Copyright © 1999 by Ricoh Innovations, Inc.