A DTD or Document Type Definition is an XML or SGML document that contains the rules that specify the allowable content for a particular class of documents. Included in any DTD is a list of specifications for all of the elements that are needed by the document class. To learn more about DTDs, XML and SGML visit the W3C website.
A tagset definition file is essentially a mapping into XHTML of an SGML DTD. Tags are similar, in many respects, to the elements that make up a DTD. Much like a DTD, a tagset contains the definitions of the entities, elements and attributes that can be used in a document, along with their values and details of their syntax.
Tagsets differ from DTDs, however, in that they associate actions with the elements that make up the set. A single DTD can map to multiple tagsets as tagsets are concerned not simply with document structure but with the actions associated with active tags. For example the PIA has two tagset that have the same elements and DTD but different uses. The tagset tagset is used to define other tagsets. The tsdoc tagset is used to document the elements that make up the tagset.
A tagset specifies a superset of the information
in a document's DTD (Document Type Definition), and is contained
in a file with a .ts
extension.
A tagset definition file can also be seen as an HTML file that documents a markup language. The standard HTML tags are augmented with a handful of XML constructs that actually define the language. Because both the formal and informal descriptions are mixed in the same file, it is easy to derive some of the documentation directly from the formal components, thus ensuring its correctness. This allows the developer to confine the informal parts of the documentation to descriptive material, without having to worry about keeping two equivalent versions of every definition in sync.
Element tags and attributes
start with a letter and contain any sequence of letters, digits,
period (.)
, hyphen (-)
, underscore (_)
,
and colon (:)
characters. Case is ignored in HTML, but
significant in XML. XXML tag names are not case sensitive.
Attribute names are always case sensitive.
Within active element
tag names, the hyphen
is usually used
to separate words. The colon -
character
serves a special function. That portion of the name before the colon
designates a "namespace,'' and the portion after it a
name within that space. Each user-defined active element,
for example, defines a namespace with the element's tag name as
its namespace name, and the names (:)
attributes
and content
defined
within it.
Actions can be associated with "active'' elements. Whenever such an element appears in a document, a corresponding piece of code termed its "action'' is "expanded.'' Locally-defined entities are bound to the element's attributes and content.
A collection of entities and elements designed to work together in an active document is called a "tagset,'' largely for historical reasons. (MetaHTML also uses "tagset'' in this sense.) A tagset is essentially an extension of an SGML or XML Document Type Definition (DTD), although tagsets are usually defined using an XML-based notation which is easier to parse and to extend.
The PIA defines
a standard tagset, called pia-xhtm
for use in creating
active documents. Some agents may define their own tagsets as extensions
to the standard one. Distinct tagsets can be defined for different
kinds of document processing. Possible uses include parsing HTML,
formatting, and translating SGML documents into HTML.
The following tagsets are currently predefined:
Any agent can have a private tagset. Its name
is type-xhtml
, where type
is
the agent's type. For most agents this is the same as the agent's
name. DOFS and Toolbar are examples of type names.
Tagsets
make use of the Java's resource mechanism that allows arbitrary
data files to exist in the same namespace as classes. Resources
can be shipped around in JAR files, downloaded from the Internet,
and, in most cases, obtained from directories in the CLASSPATH
.
This means that common tagsets can be defined in the same directory
as the classes of the org.risource.dps.tagset
package.
Tagsets and other resources (including external entities) can exist either as Java resources or as files located relative to the document being processed. They can also be located in the DPS, in an internal namespace.
When
a DTD is specified in a document using a system identifier, the
PIA searches for the corresponding .ts
file. A tagset
specified on a command line takes precedence over one specified
in the document's doctype
declaration. This allows
documents to be processed with tagsets other than the one they were
originally written for.
The search order for tagsets is as follows:
file:
), then
in the package org.risource.dps.tagset
in the explicit namespace DPS:
.
topProcessor
links in Context
objects.tsname.class
is sought provided
that the name could be in the package/resource namespace. If it
exists and implements the Tagset
interface, it is loaded
and instantiated. tsname.ts
is sought, first in the current
directory, then as a resource, to identify the correct namespace.
tsname.obj
is loaded as a serialized
object, if it exists in the same location and is newer than tsname.ts
tsname.tss
is processed using the minimal
bootstrap tagset containing <tagset>
, <define>
,
and their sub-elements, if it exists in the same location and is newer
than tsname.ts
. This is assumed to be a tagset
"stripped" of its documentation. tsname.ts
is processed using the tagset
tagset. Once a tagset is loaded, its name is added to the appropriate namespace.
The search order used for entities is similar, except that entities cannot be represented as classes, and external entity names are expected to have an extension.
While processing a tagset's
start tag, the tagset mentioned in its context
is also
loaded, followed by any tagsets mentioned in its include
attribute.
Tagsets and entities mentioned in resources are not looked for
in locations relative to the current document, but only in those
relative to the current resource.