DOFS: Representing Documents Efficiently

Architectural Notes

One of the dreams of people in the PIA project has always been an efficient representation for collections of documents. Such a ``Document Oriented Filing System'' (DOFS) would allow SGML documents, and portions of documents, to be accessed by programs such as the PIA without the need for parsing.

External Representation

Possible solutions to this problem break down into the following general categories:

OODB
Probably the first thing that occurs to anyone thinking about the problem is an object-oriented database. This gives a very close coupling between the documents and their internal representation as parse trees. It may be difficult, though, to find a good, open source OODB to use for this purpose.

An additional problem with this approach is that OODB's suffer from the ``schema migration'' problem. If a change is made in one of the classes used to represent documents, the entire database has to be modified.

Also, the obvious object representation to use is the DOM, but the DOM has serious problems when ripped out of a single process and put into a client-server system like a database.

Serialized Objects
Serialized Java objects suffer from the same problems as an OODB. Worse, they are known to be less efficient than simply parsing XML files! Why bother?
Compiled Code
Going even farther down the garden path to specialized representations, we come to compiled code. It is relatively easy to map an XML document into a C program that, when compiled and run, supports any desired document-access API or protocol very efficiently. Effectively, the program might be a specialized server, or even a library. This approach may have merit for embedded systems or high-performance applications, but of course is not a general-purpose solution.
Indexed Text Files
Something that goes ``all the way'' in the opposite direction from an OODB is some way of keeping the object metadata in files that are separate from the actual documents. Such an index file would list the elements and their attributes in some convenient format, and their offsets in the actual documents.

This technique has the obvious problem of keeping the document and its index synchronized. Also, if a document has a high ratio of tags to text, the index file will actually take longer to read than the document itself.

Binary File Formats
A third direction to go is toward an efficient binary representation for document files. Such a binary representation should include an element index toward the front, element sizes (rather than end tags), and should allow multiple documents to be aggregated into a single file. It's worth noting that a binary interchange format for SGML files exists, and this needs to be investigated.

Unlike an OODB, the data in a binary file does not need to be the same as the in-memory representation (which is presently derived from the DOM). This allows the two representations to be optimized independently.

My present leaning is toward a binary file format. However, I think that compiled code also bears investigation; this is described in a separate document.

Access Interface

There are also several possible choices for the API used to access documents.

CORBA
CORBA obviously needs to be part of any solution. This will give a uniform, network-oriented API for the system.
Cursors
The DPS's Cursor interface also needs to be part of any solution (and this interface needs to be improved to make it complete).
Streams (i.e. HTTP)
Another interface that needs to be part of any solution is an efficient interface to character streams. This will allow a DOFS to be accessed as a web server. Document streams also need to be positionable, for efficient access to any portion of a document.
File System
Taking the stream interface to yet another extreme is the idea of making a document collection into a virtual filesystem. This, of course, is not particularly portable across operating systems.

Design Notes

This section will concentrate on design considerations for a binary file format, since that seems to be the best direction in which to proceed at this point.

Note:
An interchange format for SGML exists, called SDIF; this should be investigated.

Design Considerations

Time-space tradeoffs
The classical tradeoff in computing is time vs. space. By throwing additional space at a problem (for example, by padding the file so that each element starts on a block boundary) one can make it more efficient. Similarly, by throwing computation at it (say, by compressing the text) one can make it more space-efficient. I suspect that the proper balance is probably somewhere close to a packed linear file, with embedded navigational aids such as indices and element lengths but without compression.
Read-write tradeoffs
Another tradeoff is reading vs. writing. It is possible to design a format that is easy to read sequentially, and even easy to search, but is not efficient to write. Similarly, one might devise s format that is easy to write sequentially, but not to modify. Appending to a file presents its own set of problems.

My current thinking is that we should have a parametrized representation so that different files can occupy different points along these two tradeoff continua. Each file in the system can be optimized for its particular task.

Here are a couple of more-or-less random things to consider:

Structure

The typical node will consist of:

  1. A fixed-size header, followed by:
  2. A variable-size attribute list (for Element nodes)
  3. A variable-size contents list or string.
The header, in turn, will contain:

The tagname could be either a string, or a reference to a descriptor node; the latter is probably more useful as well as having a fixed length.

Note that without loss of generality, indices as well as content may (should) be represented as nodes.

It is straightforward to either normalize files by combining adjacent text nodes, or to split text into words. In that case it would be useful to have preceeding and following spaces indicated by flag bits to save space.

Reference content could be used to compress text by making every word a reference.

The node type, header size, content type, and flag bits should fit into 32 bits, and might even fit into 16 bits, which would allow 16 bits for the nodename. That might not be enough for some files; there should be a global option.

Note that total size = header size + content size + attribute list size, but it may be more efficient to make this explicit.

It's an open question whether it would help to have an explicit "content-end" pseudo-node at the end of elements and text nodes. It would make it possible to seek to the end of a file and locate a node being appended to, but after appending we would then have to rewrite the end nodes.

Perhaps the deciding factor is that end nodes would also make it possible to perform a stackless tree traversal.


Copyright © 1997-1999 Ricoh Innovations, Inc.
$Id: dofs.html,v 1.2 2001-01-11 23:36:54 steve Exp $
Stephen R. Savitzky <steve@rii.ricoh.com>