Design notes for a C/C++ port

(for embedded systems)

See also the Notes on Porting the PIA.

Introduction

This document describes a design for a C or C++ port of the PIA. The design goals are:

  1. Integrate the PIA with existing open-source applications.
  2. Use as much existing open source code as possible.
  3. Port as much as possible of the DPS (document-processing system).
  4. Support a large fraction of the PIA server's functionality, but not necessarily all of it.

Available Resources

Design Sketch

There are two plausible approaches:

  1. Implement the DPS using Jade. This involves writing the bulk of the DPS in DSSSL, which is basically an extension of Scheme. This is likely to be unfamiliar territory for C programmers, and may result in an inefficient implementation. Also, Jade is rather bulky, weighing in at over 6MB of source code. About half of that is SP.
  2. Implement the DPS directly in C or C++ using Expat as the parser. This can be done quite efficiently if we give up the sub-elements of <extract> that modify structure, somewhat less efficiently if we don't. If an HTML parser is needed, it can be implemented using Expat's tokenizer as a base.

In view of Jade's size and complexity, it seems best to implement the DPS in C starting with Expat. The design sketch follows:

Storage Format

What I have in mind is an implementation that stores trees, not as individual nodes connected by pointers, but as flattened arrays of characters. Each node would have a binary header and trailer that includes the length of its contents. The only place pointers would be necessary would be in namespaces.

The advantages of a flattened tree are that, although navigation is nearly as fast as a linked tree, depth-first traversal is actually faster, and storage-management is trivial. In addition, by padding strings to some reasonable length (8 bytes comes to mind), one could ensure that numeric fields always start on word boundaries; this would also allow the occasional pointer for things like handlers.

Note that storage management for a flat, compact array of what amount to strings is particularly simple, although keeping the array compact does involve copying. In the most common case, where input is from events and output is to a stream, holes will only occur when a variable is set inside of something being expanded.

Some kind of interleaving scheme (with "skip" spans as well as "content" spans) would allow variables to be written into at the same time as intermediate (processed content) parse trees. (Variables would end up as something akin to comment nodes.) This is only an issue, of course, when something like a repeat or extract is embedded in something else, e.g. a set, that constructs a nodelist.

It may be best, though, to do the initial implementation with ordinary nodes to avoid unnecessary complexity.

Note also that in embedded applications it will often be possible to pre-parse files, and even embed them in the server in the form of initialized C data structures. Constructing the C can be done using a tagset. In particular, we can reasonably expect that tagsets will be preloaded in this way.

Apache Module Interface

Apache includes a ``transaction-oriented'' storage-allocation model: all storage associated with a transaction is kept together, and returned when the transaction is completed. This makes things very simple.

One problem with Apache is that documents are served out of multiple, single-threaded processes. This makes it difficult to share data at the ``Agent'' or ``PIA'' (global) level. If this is necessary we could make yet another process, running as a kind of ``nano-PIA'' server, to manage the shared entities. (This is similar to what mod_java does, for example.) Alternatively we could use the file system for shared data, but there are locking issues involved in that approach.

Note that the ``flattened tree'' binary format is almost ideal for communicating with the ``nano-PIA'' entity server.

Implementation Plan

  1. Incorporate expat into the tree (as src/c/com/jclark/expat) and build it.
  2. Design data structures and interfaces for a minimal set of DPS components: Node, NodeList, Input, Output, Processor, Context, Namespace, and a few others. Use expat as a style guide.
  3. Write a pair of drivers: a C main program equivalent to process, and another that works as a CGI. Implement a simple mod_pia if it's easy.
  4. Implement a skeleton version of Processor that simply copies input to output.
  5. Implement those handlers that can operate in pass-through mode without constructing an intermediate parse tree. These include if, hide, and protect.
  6. Implement the linked-list parse-tree representation.
  7. Implement Processor, building parse trees for active nodes that can't be run in pass-through mode.
  8. Implement the rest of the handlers. There are about 40 top-level handler classes in the DPS, and about twice that many sub-handlers. A few will be unnecessary; others can easily be postponed.
  9. Implement the flat-file parse-tree representation. This may fall out of what Mark is doing on the storage system.
  10. Implement a simple shared-variable server for use with mod_pia in Apache.

Copyright © 1999-2000 Ricoh Innovations, Inc.
$Id: c-port.html,v 1.5 2001-01-11 23:36:53 steve Exp $
Stephen R. Savitzky <steve@rii.ricoh.com>