PIA Tagsets

This document describes the PIA tagset. It covers the following:

Tagsets and DTDs

A DTD or Document Type Definition is an XML or SGML document that contains the rules that specify the allowable content for a particular class of documents. Included in any DTD is a list of specifications for all of the elements that are needed by the document class. To learn more about DTDs, XML and SGML visit the W3C website.

A tagset definition file is essentially a mapping into XHTML of an SGML DTD. Tags are similar, in many respects, to the elements that make up a DTD. Much like a DTD, a tagset contains the definitions of the entities, elements and attributes that can be used in a document, along with their values and details of their syntax.

Tagsets differ from DTDs, however, in that they associate actions with the elements that make up the set. A single DTD can map to multiple tagsets as tagsets are concerned not simply with document structure but with the actions associated with active tags. For example the PIA has two tagset that have the same elements and DTD but different uses. The tagset tagset is used to define other tagsets. The tsdoc tagset is used to document the elements that make up the tagset.

A tagset specifies a superset of the information in a document's DTD (Document Type Definition), and is contained in a file with a .ts extension.

Tagsets are Self-Documenting

A tagset definition file can also be seen as an HTML file that documents a markup language. The standard HTML tags are augmented with a handful of XML constructs that actually define the language. Because both the formal and informal descriptions are mixed in the same file, it is easy to derive some of the documentation directly from the formal components, thus ensuring its correctness. This allows the developer to confine the informal parts of the documentation to descriptive material, without having to worry about keeping two equivalent versions of every definition in sync.

Tagset Names

Element tags and attributes start with a letter and contain any sequence of letters, digits, period (.), hyphen (-) , underscore (_), and colon (:) characters. Case is ignored in HTML, but significant in XML. XXML tag names are not case sensitive. Attribute names are always case sensitive.

Within active element tag names, the hyphen - is usually used to separate words. The colon (:) character serves a special function. That portion of the name before the colon designates a "namespace,'' and the portion after it a name within that space. Each user-defined active element, for example, defines a namespace with the element's tag name as its namespace name, and the names attributes and content defined within it.

Actions and Tagsets

Actions can be associated with "active'' elements. Whenever such an element appears in a document, a corresponding piece of code termed its "action'' is "expanded.'' Locally-defined entities are bound to the element's attributes and content.

A collection of entities and elements designed to work together in an active document is called a "tagset,'' largely for historical reasons. (MetaHTML also uses "tagset'' in this sense.) A tagset is essentially an extension of an SGML or XML Document Type Definition (DTD), although tagsets are usually defined using an XML-based notation which is easier to parse and to extend.

Predefined Tagsets

The PIA defines a standard tagset, called pia-xhtm for use in creating active documents. Some agents may define their own tagsets as extensions to the standard one. Distinct tagsets can be defined for different kinds of document processing. Possible uses include parsing HTML, formatting, and translating SGML documents into HTML.

The following tagsets are currently predefined:

Any agent can have a private tagset. Its name is type-xhtml, where type is the agent's type. For most agents this is the same as the agent's name. DOFS and Toolbar are examples of type names.

How Resources are Located and Loaded

Tagsets make use of the Java's resource mechanism that allows arbitrary data files to exist in the same namespace as classes. Resources can be shipped around in JAR files, downloaded from the Internet, and, in most cases, obtained from directories in the CLASSPATH. This means that common tagsets can be defined in the same directory as the classes of the org.risource.dps.tagset package.

Tagsets and other resources (including external entities) can exist either as Java resources or as files located relative to the document being processed. They can also be located in the DPS, in an internal namespace.

When a DTD is specified in a document using a system identifier, the PIA searches for the corresponding .ts file. A tagset specified on a command line takes precedence over one specified in the document's doctype declaration. This allows documents to be processed with tagsets other than the one they were originally written for.

The search order for tagsets is as follows:

  1. First, the namespace is identified. An identifier followed by a colon indicates an explicit namespace. Otherwise, if the name contains slashes it is considered to be a file located relative to the current document; otherwise if it contains periods it is considered to be a resource. Tagsets with neither are sought first in the current directory (the explicit namespace file:), then in the package org.risource.dps.tagset in the explicit namespace DPS:.
  2. If the namespace is specified explicitly, it performs the lookup.
  3. The tagset's name is first sought along the internal processing context chain defined by the topProcessor links in Context objects.
  4. The class tsname.class is sought provided that the name could be in the package/resource namespace. If it exists and implements the Tagset interface, it is loaded and instantiated.
  5. tsname.ts is sought, first in the current directory, then as a resource, to identify the correct namespace.
  6. tsname.obj is loaded as a serialized object, if it exists in the same location and is newer than tsname.ts
  7. tsname.tss is processed using the minimal bootstrap tagset containing <tagset>, <define>, and their sub-elements, if it exists in the same location and is newer than tsname.ts. This is assumed to be a tagset "stripped" of its documentation.
  8. tsname.ts is processed using the tagset tagset.

Once a tagset is loaded, its name is added to the appropriate namespace.

The search order used for entities is similar, except that entities cannot be represented as classes, and external entity names are expected to have an extension.

While processing a tagset's start tag, the tagset mentioned in its context is also loaded, followed by any tagsets mentioned in its include attribute. Tagsets and entities mentioned in resources are not looked for in locations relative to the current document, but only in those relative to the current resource.


Copyright © 1999 Ricoh Innovations, Inc.. Open Source at <RiSource.org/PIA>.
$Id: tagsets.html,v 1.5 2001-01-11 23:36:47 steve Exp $