DOFS: Representing Documents Efficiently
Architectural Notes
One of the dreams of people in the PIA project has always been an
efficient representation for collections of documents. Such a ``Document
Oriented Filing System'' (DOFS) would allow SGML documents, and
portions of documents, to be accessed by programs such as the PIA
without the need for parsing.
External Representation
Possible solutions to this problem break down into the following general
categories:
- OODB
- Probably the first thing that occurs to anyone thinking about the
problem is an object-oriented database. This gives a very close
coupling between the documents and their internal representation as
parse trees. It may be difficult, though, to find a good, open source
OODB to use for this purpose.
An additional problem with this approach is that OODB's suffer from
the ``schema migration'' problem. If a change is made in one of the
classes used to represent documents, the entire database has to be
modified.
Also, the obvious object representation to use is the DOM, but the
DOM has serious problems when ripped out of a single process and put
into a client-server system like a database.
- Serialized Objects
- Serialized Java objects suffer from the same problems as an OODB.
Worse, they are known to be less efficient than simply parsing XML
files! Why bother?
- Compiled Code
- Going even farther down the garden path to specialized representations,
we come to compiled code. It is relatively easy to map an XML document
into a C program that, when compiled and run, supports any desired
document-access API or protocol very efficiently. Effectively, the
program might be a specialized server, or even a library. This
approach may have merit for embedded systems or high-performance
applications, but of course is not a general-purpose solution.
- Indexed Text Files
- Something that goes ``all the way'' in the opposite direction from an
OODB is some way of keeping the object metadata in files that are
separate from the actual documents. Such an index file would list the
elements and their attributes in some convenient format, and their
offsets in the actual documents.
This technique has the obvious problem of keeping the document and
its index synchronized. Also, if a document has a high ratio of tags
to text, the index file will actually take longer to read than the
document itself.
- Binary File Formats
- A third direction to go is toward an efficient binary representation
for document files. Such a binary representation should include an
element index toward the front, element sizes (rather than end tags),
and should allow multiple documents to be aggregated into a single
file. It's worth noting that a binary interchange format for SGML
files exists, and this needs to be investigated.
Unlike an OODB, the data in a binary file does not need to be the
same as the in-memory representation (which is presently derived from
the DOM). This allows the two representations to be optimized
independently.
My present leaning is toward a binary file format. However, I think that
compiled code also bears investigation; this is described in a separate document.
Access Interface
There are also several possible choices for the API used to
access documents.
- CORBA
- CORBA obviously needs to be part of any solution. This will
give a uniform, network-oriented API for the system.
- Cursors
- The DPS's
Cursor
interface also needs to be part of any
solution (and this interface needs to be improved to make it
complete).
- Streams (i.e. HTTP)
- Another interface that needs to be part of any solution is an efficient
interface to character streams. This will allow a DOFS to be accessed
as a web server. Document streams also need to be positionable, for
efficient access to any portion of a document.
- File System
- Taking the stream interface to yet another extreme is the idea of
making a document collection into a virtual filesystem. This, of
course, is not particularly portable across operating systems.
Design Notes
This section will concentrate on design considerations for a binary file
format, since that seems to be the best direction in which to proceed at
this point.
- Note:
- An interchange format for SGML exists, called SDIF; this should be
investigated.
Design Considerations
- Time-space tradeoffs
- The classical tradeoff in computing is time vs. space. By throwing
additional space at a problem (for example, by padding the file so that
each element starts on a block boundary) one can make it more
efficient. Similarly, by throwing computation at it (say, by
compressing the text) one can make it more space-efficient. I suspect
that the proper balance is probably somewhere close to a packed linear
file, with embedded navigational aids such as indices and element
lengths but without compression.
- Read-write tradeoffs
- Another tradeoff is reading vs. writing. It is possible to design a
format that is easy to read sequentially, and even easy to search, but
is not efficient to write. Similarly, one might devise s format that
is easy to write sequentially, but not to modify. Appending
to a file presents its own set of problems.
My current thinking is that we should have a parametrized
representation so that different files can occupy different points along
these two tradeoff continua. Each file in the system can be optimized for
its particular task.
Here are a couple of more-or-less random things to consider:
- If each node is stored in the document as a header (containing a
length) followed by its contents, any node on the right-hand edge of
the parse tree can be appended to without having to do anything more
than seek back and fix up the length. In particular, the <body>
node can be appended to without having to rewrite any end ``tags''.
- Nodes may need to contain (at least) four lengths:
- size
- the number of bytes in the file occupied by the node and its
contents
- length
- the total width of the node when ``flattened'' into the minimum
number of lines of text.
- width
- the minimum number of columns it takes to represent the node.
- height
- the minimum number of lines it takes to represent the node.
- Indices should either be at the front of the file (with some extra
padding if the file can be appended to), or possibly in a separate
file.
- It would be possible to have, as an option, a journalled form of a file
in which indexing information is distributed, (and possibly in which
nodes are stored non-sequentially) so that all writes can be done by
appending. Such a journalled file that has only been appended to can
be re-arranged into its natural form (indexes at the front) in at most
two passes; more efficient formats exist that require
one-and-a-fraction passes. Randomized insertions may take longer or
use more space.
- A good binary representation for files may (but need not) be closely
related to a good representation for compile-time-initialized data
structures.
Structure
The typical node will consist of:
- A fixed-size header, followed by:
- A variable-size attribute list (for Element nodes)
- A variable-size contents list or string.
The header, in turn, will contain:
- node type
- header size
- content type (nodes, characters, references)
- flag bits
- node name/tagname
- total size (in bytes) (?)
- content size (in bytes)
- number of content nodes
- attribute list size
- number of attributes
- additional fields that depend on the node type
The tagname could be either a string, or a reference to a descriptor node;
the latter is probably more useful as well as having a fixed length.
Note that without loss of generality, indices as well as content may
(should) be represented as nodes.
It is straightforward to either normalize files by combining adjacent text
nodes, or to split text into words. In that case it would be useful to
have preceeding and following spaces indicated by flag bits to save
space.
Reference content could be used to compress text by making every word a
reference.
The node type, header size, content type, and flag bits should fit into 32
bits, and might even fit into 16 bits, which would allow 16 bits for the
nodename. That might not be enough for some files; there should be a
global option.
Note that total size = header size + content size + attribute list
size, but it may be more efficient to make this explicit.
It's an open question whether it would help to have an explicit
"content-end" pseudo-node at the end of elements and text nodes. It would
make it possible to seek to the end of a file and locate a node being
appended to, but after appending we would then have to rewrite the end
nodes.
Perhaps the deciding factor is that end nodes would also make it possible
to perform a stackless tree traversal.
Copyright © 1997-1999 Ricoh Innovations, Inc.
$Id: dofs.html,v 1.2 2001-01-11 23:36:54 steve Exp $
Stephen R. Savitzky <steve@rii.ricoh.com>