This document describes the structures and interfaces of the WOAD system at the level required for extending and customizing it. The intended audience is web-page and software developers -- its purpose is both to help you integrate WOAD into an existing suite of applications, and to help you make existing applications and documentation (such as cross-indexers, pretty-printers, and web-accessible reference material) useful in the WOAD environment.
The structures and interfaces described here correspond loosely to the data structures and application programming interfaces of a conventional but customizable software application (EMACS is perhaps the classic example). However, WOAD is a web application, not a word processor or software development kit -- the structures and interfaces described here are for the most part such things as directory structures, file formats, XML schemata, and naming conventions.
.notes
subtree
.source
subtree
.word
subtree
.Woad
subtree
touch ~/.pia/.words/foo
and you later ask for /.words/foo
you get a ``document
contains no data'' error, which is perfectly correct but potentially
confusing, especially in .words
where you're expecting a
directory listing. This can easily happen if someone is ``playing
around'' in the annotation directories, creating files just to see what
happens.
.notes
subtree The ``subtree'' that contains Woad's page (URL) annotations has by far the
simplest structure: every URL on the server corresponds to a
directory in the .notes
subtree that contains the
annotations for that URL. Note that this is true even if the URL
corresponds to an ordinary file with an extension: the annotation for
/.../foo.html
will be contained in a directory
called /.notes/.../foo.html/
.
There's only one additional ``catch'' -- most Woad servers don't use the
prefix /.notes
: the root of the annotation tree is
the root of the Woad web server. This makes for a particularly trivial
mapping between target URL's and Woad annotation URL's: just change the
server name. If the Woad and target servers are on the same machine, it's
even easier: just change the port number.
There are two exceptions to this. The first is for applications where the Woad annotations are on the same server as the target. (This is true of the PIA, for example; it might also be done on an intranet.) The second is for Woad trees that don't correspond to a web application at all -- for example, you might be using Woad as a directory-tree browser with annotations. This second technique allows Woad to be added to any development environment.
.source
subtree Things are slightly more complex in the .source
subtree,
because it effectively ``overlays'' the actual source directory tree -- a
URL of .source/.../foo.html
is supposed to retrieve the
source listing of foo.html
, and this is done by actually
retrieving foo.html
and marking it up on the fly.
Underneath it all there are two parallel directory trees: one under the
Woad server's document root (typically ~/.woad
for a personal
Woad server running on Unix), making the source annotation directory
~/.woad/.source
, and one under the source root,
passed to the Woad server on the command line with the
source=path
option. Whenever a URL starting with
/.source
is requested, the Woad server looks in both
places: first under its own root, then under the source root.
Hence, annotations for foo.html
have to be contained in a
special .notes
subdirectory of the annotation directory that
corresponds to foo.html
's parent directory. This
could also have been done for directories (putting the annotations for
the source directory .../bar/
under the annotation directory
.../bar/.notes/
), but it wasn't.
One advantage of this seemingly clumsy arrangement is that the necessary index files, and even Woad annotations, can be prepackaged and shipped with the source files of a Woad-aware web application. The biggest advantage, though, is that it lets the Woad server format HTML and XML source files by simply redefining the tags, in effect applying a (rather drastic) stylesheet-like transformation to them.
In fact, Woad goes farther still: it handles other languages, for example Perl and C, by parsing them into the same internal representation used by the XML and HTML parsers and then applying essentially the same set of style transformations.
Finally, some Woad administrators may choose to eliminate the distinction between source tree and source annotation tree, and keep the annotations directly in the sources all the time. This is not recommended when there are multiple developers and each developer has their own complete working copy of the source code (under a version control system like CVS) -- in this case it's better to share the annotations but keep the code separate. It does work when everyone is working in the same source tree, and a version control system like RCS is used to lock the files being edited.
Eventually we hope to support multiple developers by allowing both private (individual) and public (shared) notes, but implementing this will require significant extensions to the PIA's site-structure package.
.word
subtree The characters permitted in WOAD words are the same as those permitted in
XML tagnames and identifiers. A WOAD word may contain the characters
``_
'' (underscore), ``-
'' (hyphen), and
``.
(period)'' in addition to letters and digits. Hence
identifiers in most programming-languages can be treated as words. A
phrase can be turned into a word by, for example, replacing
spaces with underscores and punctuation characters with hyphens. A word
is not allowed to start or end with a period.
Not allowing words to end with period prevents ambiguities at the end of sentences and avoids trouble with operating systems that append a gratuitous period to filenames. Not allowing words to begin with a period follows the Unix convention of hiding filenames that start with a period -- some web servers will refuse to serve them, and most will leave them out of directory listings.
Note, too, that case is significant in word directory names (WOAD is designed to run on Unix), but lookup is normally case-folded and will find all words that differ only in case. In some contexts (e.g. English text), words may be automatically case-smashed in case at the beginnings of sentences, but this is not always the correct thing to do: I am grateful to Isaac Asimov for pointing out the fact that the pronunciation of POLISH is ambiguous, depending on whether you're talking about shoe polish or Polish shoes. Hence it is correct to lowercase all the words in ``Polish my shoes, please,'' but not in ``Polish shoes please me.''
The .word
subtree is comparatively ``flat,'' initially having
only two directory levels underneath it. (More levels can be added,
forming a hierarchy of topics like Yahoo or DMOZ.)
The first level contains directories corresponding to ``contexts;'' the second level contains directories corresponding to words and phrases in those contexts. ``Miscellaneous'' words having no specific context have directories at the same level as the context directories: in this way every word potentially names a context, and every context's name is necessarily a word. This saves us the trouble of figuring out ahead of time whether an identifier is a context name or not.
The first and second levels can also contain index files that map words into URL's. All Woad-maintained annotations for words are contained in directories -- the index files are only used to refer to documents (including both documents on other servers, and source documents in the application).
It would be possible to allow files with names likeword.ww
, but this would lead to a conflict: isfoo/definition.ww
the definition of the word ``foo'', or is it the word ``definition'' in the context ``foo''? If words are always directories we can guarantee the first interpretation.
The complete set of Woad annotations for word can thus be found
by looking for ``.word/word/
'' and
``.word/*/word/
''. The remaining (non-Woad)
information could theoretically be found by looking for elements with the
attribute name="word"
in the index files
``.word/*/*.wi
''.
This, however, would be unbearably tedious, so instead we use one of the following alternatives (exactly which alternative is used is up to the Woad server administrator -- in other words, you):
.word/.inv/word.wi
''. This is by far the
simplest to implement. The .inv
subdirectory keeps the
inverted indices out of the top level so that they don't show up on
listings.
In the short term, the combination of file buffering and the caching of
directories in the PIA's site
package make inverted index
files quite efficient; in the long term the technique will probably break
down somewhere in the low thousands of words.
At some point (possibly quite soon, since it's simple) we will probably extend the scheme to allow a hierarchy of context directories.
It's not entirely clear how to handle cross-references, i.e. finding all the places where a given identifier is used. Conceptually, at least, this is just an index, but maintaining it could be a problem. Probably it should have <xref> entries instead of <Word> entries.
.Woad
subtree The .Woad
subtree of the Woad tree is mapped in from its
``home'' location in the PIA install tree, pia:/Apps/Woad
.
The top level contains the following sorts of files:
.ts
Tagsets -- These files are part of
the PIA's executable web-page
framework: they specify the actions (expansions) associated
with tags in active pages.
.xh
Active Web Pages -- These are the
pages that provide the help and ``home'' pages for Woad. These pages
are in ``XHTML'' and contain Woad- and PIA-specific tags; they can only
be viewed correctly inside a running Woad server.
.html
HTML Documents -- for example, this file.
The main difference between the HTML pages and the active
pages is that the HTML pages are pure HTML, and can be read
``stand-alone'' by pointing a web browser directly at the corresponding
file. This makes them useable before installing Woad or in
situations where Woad is, for one reason or another, not working and
you need to figure out why.
.xcf
XML Configuration Files --
.inc
, .xci
XML Include Files -- These
are meant to be included (automatically inserted using the
<include> tag) in other XML files, and do not stand alone. They
are useful when a block of code needs to be used on several different
web pages. For example, app-info.inc
implements the table
of current command-line parameters and Woad-tree subtrees. Another
common use for include files is to isolate a block of code or text
that's more likely to be customized than its enclosing document.
It also contains the following subdirectories:
Tools
-- This subdirectory contains the active
web pages that do things, including the form for creating new
notes and the active pages that list directories in
Words
-- This subdirectory is re-mapped as the virtual
root of the .word subtree. This allows word index files, for
example, to be preconstructed and shipped as part of the Woad
distribution.
Note that, as usual, putting a file of your own somewhere under
real-root/.Woad
will override the original in
PIA/Apps/Woad
-- this means that you can easily customize
your copy of the forms, tagsets, and so on that
Outside of the .Woad
subtree, WOAD only has five kinds of
files that it treats specially: index files (.wi
), annotation
``web'' pages (.ww
), preprocessed source listings
(.wl
), active documents (.wh
), and directories.
In the .source
subtree, all other files are ``listed'' by
processing them with an appropriate tagset. Elsewhere, they are simply
ignored.
ext | tagset | description |
---|---|---|
.wh |
woad-xhtml |
active web pages (equivalent to .xh pages in
the PIA, but renamed to prevent conflicts).
|
.ww |
woad-web |
``woad web'' pages: annotations. |
.wi |
woad-index |
``woad index'' files. Index files have no enclosing element, and hence require a document-wrapper element in the tagset. |
.wl |
Tools/src-file |
``woad listing'' pages derived from source files. |
tagset | description |
---|---|
Tools/src-wrapper |
document wrapper element, shared by all source listing tagsets. |
Tools/src-file |
Tagset for generic (pre-formatted) listing files. |
Tools/src-html |
Tagset for HTML (and variants with the same tags, e.g. PHP and shtml). |
Tools/src-xhtml |
Tagset for the PIA's extended HTML. |
subtree | file | description |
---|---|---|
.source |
Tools/src-listing.xh |
Directory-listing document for source directories. |
.notes |
Tools/woad-listing.xh |
Directory-listing document for page annotation directories. |
.word |
Tools/word-listing.xh |
Directory-listing document for word index directories. |
An index file, with an extension of ``.wi
'', is a simple list
of XML elements. An index file has no enclosing element, no XML
declaration, and no document type, making it a well-formed external
parsed entity rather than a well-formed XML document. This
convention greatly simplifies on-the-fly construction of index files,
since they can be appended to without having to worry about removing and
replacing the final end tag.
The present implementation of WOAD makes extensive use of the file system, including directories and XML index files, where another application might be expected to use a database. The PIA, the framework on which WOAD is built, makes this kind of thing easy and efficient, and using files makes it possible to use ordinary Unix tools (including shell and PERL scripts) to construct indices.
What's emerging in my investigation of LXR and other indexing tools is a realization that we need better ways of integrating WOAD with other tools.
One observation about indexers like LXR is that each one essentially defines a context within which certain classes of names are recognized. For example, LXR recognizes identifiers in C and C++ code, but ignores comments. Full-text indexers like FreeWAIS-sf avoid that problem, but are unable to identify declarations. So both have their place.
Indexing programs generally also come with a compatible search engine (typically a CGI, though in the case of the WAIS family there might also be a server version), and often a cross-reference listing program (for example LXR's) that makes recognized identifiers into links.
Integrating indexing programs into WOAD is complicated by the fact that
they all use different database files, including application-specific
binary files (WAIS), dbx
and other database formats, and
files tied to Perl hashes (LXR).
This section lists the default contexts and gives some information about how the indexing information is obtained:
index-all.html
. Unfortunately it isn't always present; in
particular, it's missing in Sun's online Java documentation. The list
can be reliably computed only by recursively expanding the
documentation and selecting name anchors that contain types in
parentheses.
Occurrences, i.e., places where a word is used (as opposed to defined), are best turned into links (to the word's page) ``on-the-fly'' while a page is being viewed. Locating and listing all of the places where an identifier occurs is an expensive process.
However, it's not impossible -- it ``merely'' requires a complete pass through all the files in the system. However, carrying this off requires either (1) having a good way to determine which words are going to need indexing, or (2) indexing everything (full text indexing).
So the simplest method is almost certainly going to be full-text indexing, which allows us to use a totally separate index for it. This implies that only words with indexed definitions will get links; others (e.g. ``this'' will have to be looked up via the form. There's a big advantage to this: it shows the user which words are actually worth looking up (in the sense that they have local definitions).
There are two ways to anchor occurrences:
<a name="foo.50" href="/.word/foo">foo</a>
One of the things that emerged in my investigation of LXR and other indexing tools is a realization that we need better ways of integrating WOAD with other tools. This section explains how to do that.
Just as we will often want to use external tools for indexing, so we will
often want to integrate external viewing and formatting tools. Examples
include Javadoc and tsdoc
, which generate documentation from
source code, and Bonsai
, which adds CVS annotation to source files.
There are three cases that need to be considered:
<include>
(which has higher
overhead, but allows WOAD to add its own annotation and identifier
links).
The main thing that needs to happen in order to integrate an external
viewer is defining the mapping between the .source
tree and
the URL's provided by the viewer. There are several plausible methods:
map.wi
, which can be included by the listing page.
woad.xcf
configuration files.
src-listing
and so on. This is by far the most
versatile technique, so that's what we do.