WOAD's API

Extending and Customizing WOAD

Abstract:

This document describes the structures and interfaces of the WOAD system at the level required for extending and customizing it. The intended audience is web-page and software developers -- its purpose is both to help you integrate WOAD into an existing suite of applications, and to help you make existing applications and documentation (such as cross-indexers, pretty-printers, and web-accessible reference material) useful in the WOAD environment.

The structures and interfaces described here correspond loosely to the data structures and application programming interfaces of a conventional but customizable software application (EMACS is perhaps the classic example). However, WOAD is a web application, not a word processor or software development kit -- the structures and interfaces described here are for the most part such things as directory structures, file formats, XML schemata, and naming conventions.

Contents


    1: The WOAD Tree
        1.1: The .notes subtree
        1.2: The .source subtree
        1.3: The .word subtree
        1.4: The .Woad subtree
    2: WOAD Web Pages and their Tagsets
    3: Index Files and Databases
    4: Word Contexts and Indices
        4.1: Specifics
        4.2: Occurrences
    5: Annotation and Documentation Files
    6: Listers, Parsers and Formatters

1: The WOAD Tree

Note: Watch out for empty files in Woad's annotation directories.
If somebody has done, e.g., touch ~/.pia/.words/foo and you later ask for /.words/foo you get a ``document contains no data'' error, which is perfectly correct but potentially confusing, especially in .words where you're expecting a directory listing. This can easily happen if someone is ``playing around'' in the annotation directories, creating files just to see what happens.

1.1: The .notes subtree

The ``subtree'' that contains Woad's page (URL) annotations has by far the simplest structure: every URL on the server corresponds to a directory in the .notes subtree that contains the annotations for that URL. Note that this is true even if the URL corresponds to an ordinary file with an extension: the annotation for /.../foo.html will be contained in a directory called /.notes/.../foo.html/.

There's only one additional ``catch'' -- most Woad servers don't use the prefix /.notes: the root of the annotation tree is the root of the Woad web server. This makes for a particularly trivial mapping between target URL's and Woad annotation URL's: just change the server name. If the Woad and target servers are on the same machine, it's even easier: just change the port number.

There are two exceptions to this. The first is for applications where the Woad annotations are on the same server as the target. (This is true of the PIA, for example; it might also be done on an intranet.) The second is for Woad trees that don't correspond to a web application at all -- for example, you might be using Woad as a directory-tree browser with annotations. This second technique allows Woad to be added to any development environment.

1.2: The .source subtree

Things are slightly more complex in the .source subtree, because it effectively ``overlays'' the actual source directory tree -- a URL of .source/.../foo.html is supposed to retrieve the source listing of foo.html, and this is done by actually retrieving foo.html and marking it up on the fly.

Underneath it all there are two parallel directory trees: one under the Woad server's document root (typically ~/.woad for a personal Woad server running on Unix), making the source annotation directory ~/.woad/.source, and one under the source root, passed to the Woad server on the command line with the source=path option. Whenever a URL starting with /.source is requested, the Woad server looks in both places: first under its own root, then under the source root.

Hence, annotations for foo.html have to be contained in a special .notes subdirectory of the annotation directory that corresponds to foo.html's parent directory. This could also have been done for directories (putting the annotations for the source directory .../bar/ under the annotation directory .../bar/.notes/), but it wasn't.

One advantage of this seemingly clumsy arrangement is that the necessary index files, and even Woad annotations, can be prepackaged and shipped with the source files of a Woad-aware web application. The biggest advantage, though, is that it lets the Woad server format HTML and XML source files by simply redefining the tags, in effect applying a (rather drastic) stylesheet-like transformation to them.

In fact, Woad goes farther still: it handles other languages, for example Perl and C, by parsing them into the same internal representation used by the XML and HTML parsers and then applying essentially the same set of style transformations.

Finally, some Woad administrators may choose to eliminate the distinction between source tree and source annotation tree, and keep the annotations directly in the sources all the time. This is not recommended when there are multiple developers and each developer has their own complete working copy of the source code (under a version control system like CVS) -- in this case it's better to share the annotations but keep the code separate. It does work when everyone is working in the same source tree, and a version control system like RCS is used to lock the files being edited.

Eventually we hope to support multiple developers by allowing both private (individual) and public (shared) notes, but implementing this will require significant extensions to the PIA's site-structure package.

1.3: The .word subtree

The characters permitted in WOAD words are the same as those permitted in XML tagnames and identifiers. A WOAD word may contain the characters ``_'' (underscore), ``-'' (hyphen), and ``. (period)'' in addition to letters and digits. Hence identifiers in most programming-languages can be treated as words. A phrase can be turned into a word by, for example, replacing spaces with underscores and punctuation characters with hyphens. A word is not allowed to start or end with a period.

Not allowing words to end with period prevents ambiguities at the end of sentences and avoids trouble with operating systems that append a gratuitous period to filenames. Not allowing words to begin with a period follows the Unix convention of hiding filenames that start with a period -- some web servers will refuse to serve them, and most will leave them out of directory listings.

Note, too, that case is significant in word directory names (WOAD is designed to run on Unix), but lookup is normally case-folded and will find all words that differ only in case. In some contexts (e.g. English text), words may be automatically case-smashed in case at the beginnings of sentences, but this is not always the correct thing to do: I am grateful to Isaac Asimov for pointing out the fact that the pronunciation of POLISH is ambiguous, depending on whether you're talking about shoe polish or Polish shoes. Hence it is correct to lowercase all the words in ``Polish my shoes, please,'' but not in ``Polish shoes please me.''

The .word subtree is comparatively ``flat,'' initially having only two directory levels underneath it. (More levels can be added, forming a hierarchy of topics like Yahoo or DMOZ.)

The first level contains directories corresponding to ``contexts;'' the second level contains directories corresponding to words and phrases in those contexts. ``Miscellaneous'' words having no specific context have directories at the same level as the context directories: in this way every word potentially names a context, and every context's name is necessarily a word. This saves us the trouble of figuring out ahead of time whether an identifier is a context name or not.

The first and second levels can also contain index files that map words into URL's. All Woad-maintained annotations for words are contained in directories -- the index files are only used to refer to documents (including both documents on other servers, and source documents in the application).

It would be possible to allow files with names like word.ww, but this would lead to a conflict: is foo/definition.ww the definition of the word ``foo'', or is it the word ``definition'' in the context ``foo''? If words are always directories we can guarantee the first interpretation.

The complete set of Woad annotations for word can thus be found by looking for ``.word/word/'' and ``.word/*/word/''. The remaining (non-Woad) information could theoretically be found by looking for elements with the attribute name="word" in the index files ``.word/*/*.wi''.

This, however, would be unbearably tedious, so instead we use one of the following alternatives (exactly which alternative is used is up to the Woad server administrator -- in other words, you):

In the short term, the combination of file buffering and the caching of directories in the PIA's site package make inverted index files quite efficient; in the long term the technique will probably break down somewhere in the low thousands of words.

At some point (possibly quite soon, since it's simple) we will probably extend the scheme to allow a hierarchy of context directories.

It's not entirely clear how to handle cross-references, i.e. finding all the places where a given identifier is used. Conceptually, at least, this is just an index, but maintaining it could be a problem. Probably it should have <xref> entries instead of <Word> entries.

1.4: The .Woad subtree

The .Woad subtree of the Woad tree is mapped in from its ``home'' location in the PIA install tree, pia:/Apps/Woad. The top level contains the following sorts of files:

It also contains the following subdirectories:

Note that, as usual, putting a file of your own somewhere under real-root/.Woad will override the original in PIA/Apps/Woad -- this means that you can easily customize your copy of the forms, tagsets, and so on that Woad is using.


2: WOAD Web Pages and their Tagsets

Outside of the .Woad subtree, WOAD only has five kinds of files that it treats specially: index files (.wi), annotation ``web'' pages (.ww), preprocessed source listings (.wl), active documents (.wh), and directories.

In the .source subtree, all other files are ``listed'' by processing them with an appropriate tagset. Elsewhere, they are simply ignored.

WOAD file types
 ext tagset description
.wh woad-xhtml active web pages (equivalent to .xh pages in the PIA, but renamed to prevent conflicts).
.ww woad-web ``woad web'' pages: annotations.
.wi woad-index ``woad index'' files. Index files have no enclosing element, and hence require a document-wrapper element in the tagset.
.wl Tools/src-file ``woad listing'' pages derived from source files.

 

Source-listing tagsets
tagset description
Tools/src-wrapper document wrapper element, shared by all source listing tagsets.
Tools/src-file Tagset for generic (pre-formatted) listing files.
Tools/src-html Tagset for HTML (and variants with the same tags, e.g. PHP and shtml).
Tools/src-xhtml Tagset for the PIA's extended HTML.

 

Directory-listing files
subtree file description
.source Tools/src-listing.xh Directory-listing document for source directories.
.notes Tools/woad-listing.xh Directory-listing document for page annotation directories.
.word Tools/word-listing.xh Directory-listing document for word index directories.

3: Index Files and Databases

An index file, with an extension of ``.wi'', is a simple list of XML elements. An index file has no enclosing element, no XML declaration, and no document type, making it a well-formed external parsed entity rather than a well-formed XML document. This convention greatly simplifies on-the-fly construction of index files, since they can be appended to without having to worry about removing and replacing the final end tag.

The present implementation of WOAD makes extensive use of the file system, including directories and XML index files, where another application might be expected to use a database. The PIA, the framework on which WOAD is built, makes this kind of thing easy and efficient, and using files makes it possible to use ordinary Unix tools (including shell and PERL scripts) to construct indices.

External Indexers

What's emerging in my investigation of LXR and other indexing tools is a realization that we need better ways of integrating WOAD with other tools.

One observation about indexers like LXR is that each one essentially defines a context within which certain classes of names are recognized. For example, LXR recognizes identifiers in C and C++ code, but ignores comments. Full-text indexers like FreeWAIS-sf avoid that problem, but are unable to identify declarations. So both have their place.

Indexing programs generally also come with a compatible search engine (typically a CGI, though in the case of the WAIS family there might also be a server version), and often a cross-reference listing program (for example LXR's) that makes recognized identifiers into links.

Integrating indexing programs into WOAD is complicated by the fact that they all use different database files, including application-specific binary files (WAIS), dbx and other database formats, and files tied to Perl hashes (LXR).


4: Word Contexts and Indices

4.1: Specifics

This section lists the default contexts and gives some information about how the indexing information is obtained:

files
all filenames and directory names need to be indexed, both with and without extensions.
javadoc
A comprehensive list of methods can be found in the top-level file index-all.html. Unfortunately it isn't always present; in particular, it's missing in Sun's online Java documentation. The list can be reliably computed only by recursively expanding the documentation and selecting name anchors that contain types in parentheses.
HTML and its derivatives
The simplest way to index general HTML is to index only phrases enclosed in name anchors. In addition, ``active'' pages may contain declarations; these also have to be indexed.
Programming languages
In object-oriented languages all package, class and function (method) declarations need to be indexed. Instance variables and global variables may need to be indexed in languages that have them (e.g. C and C++).
English text
Normal text does not contain definitions. Places where words are used are treated in the next section. Files that do contain definitions (e.g. glossaries and manuals) may need special treatment, although in most cases they will actually be in HTML and have the necessary name anchors already in place in one form or another.

4.2: Occurrences

Occurrences, i.e., places where a word is used (as opposed to defined), are best turned into links (to the word's page) ``on-the-fly'' while a page is being viewed. Locating and listing all of the places where an identifier occurs is an expensive process.

However, it's not impossible -- it ``merely'' requires a complete pass through all the files in the system. However, carrying this off requires either (1) having a good way to determine which words are going to need indexing, or (2) indexing everything (full text indexing).

So the simplest method is almost certainly going to be full-text indexing, which allows us to use a totally separate index for it. This implies that only words with indexed definitions will get links; others (e.g. ``this'' will have to be looked up via the form. There's a big advantage to this: it shows the user which words are actually worth looking up (in the sense that they have local definitions).

There are two ways to anchor occurrences:


5: Annotation and Documentation Files


6: Listers, Parsers and Viewers

External Viewers

One of the things that emerged in my investigation of LXR and other indexing tools is a realization that we need better ways of integrating WOAD with other tools. This section explains how to do that.

Just as we will often want to use external tools for indexing, so we will often want to integrate external viewing and formatting tools. Examples include Javadoc and tsdoc, which generate documentation from source code, and Bonsai , which adds CVS annotation to source files.

There are three cases that need to be considered:

The main thing that needs to happen in order to integrate an external viewer is defining the mapping between the .source tree and the URL's provided by the viewer. There are several plausible methods:


Copyright © 2000 Ricoh Innovations, Inc.
$Id: api.html,v 1.6 2001-01-11 23:36:36 steve Exp $