Notes on Porting the PIA
(to embedded systems)
Alternatives
OS Alternatives:
- eCOS -- attractive, but
requires C++ for some components; hasn't been ported to many processors
yet.
- Linux -- becoming the OS of choice for high-end embedded applications.
Several different embedded versions exist.
Language Alternatives:
- C: Hard to port, but size/performance likely to be acceptable.
- Scheme+DSSSL.
DSSSL is the standard style-sheet language for SGML; it's based on the
Scheme (a Lisp variant) programming language. Most implementations, I
believe, are based on either the Guile or SIOD implementation of
Scheme, both of which are in C and fairly lightweight. Parsers exist;
data structures and memory management are free.
- Java: trivial to port, but probably too big/slow for most embedded
applications.
Cygnus is working
on it, but too preliminary at this point. VxWorks JVM's exist, but
probably not for Hitachi CPU's.
- C++: Easier to port to than C; possibly worse size/speed.
- Perl/Python/Smalltalk: Easier to port to than C/C++, but unfamiliar to
many programmers.
Data structures and memory management come free; XML and HTML parsers
exist. (There are web browsers and servers in Smalltalk and PERL
already). Smalltalk has a history of being used in embedded
applications, but is likely to be slow and bulky. The Squeak version
of Smalltalk, however, is extremely easy to port.
Note that the EGCS compilers now
supports the SH4 processor; this comes from Cygnus's GNUPro. It may not support the SH3.
C Porting Considerations
Memory Management
- This is by far the hardest problem in a C port. You will have to use
one of the following techniques (or possibly some mixture of them):
- Zoned allocation. While you're processing a document, allocate
all of its memory from one or more contiguous blocks. When
you're finished with a document, throw away all of the memory
you allocated for it. Keep a special block for global
structures, which never get de-allocated. This is probably
the best technique.
- Reference counts. Straightforward but somewhat error-prone.
- Being careful. If you can guarantee that every data
structure is a tree, and that no objects are shared among trees,
then you know that when you are finished with a document you can
recursively de-allocate its entire tree.
- Garbage collection. Hard unless you can find a pre-packaged
version.
- In some cases, it can be guaranteed that certain objects are
only referenced in one place. For example, the variable-sized portion
of growable buffers or arrays. These are easily managed with a
compacting garbage-collector.
- Fixed-size nodes can be allocated off a free list.
Threads
- Multi-threading interacts (badly, in most cases) with memory
management.
- Probably the techniques easiest to make thread-safe are zoned
allocation and being careful.
- Even if a thread-safe allocator is in use, sharing data structures
(and hence documentsa) among threads is likely to be difficult.
- Currently, the only place where documents are shared among threads is
when proxying is being done; it may be possible to avoid this.
Processor and Cursor Interfaces
- With a suitably node-free implementation for Cursor and a SAX-like
event-driven interface to Processor, it may be possible to re-implement
most (not all) primitives in a fully-cursorial, event-driven form.
- The main exceptions, which will require nodes, are extract and repeat.
- Text operations will require nodes as well, unless we extend Cursor to
handle strings specially, as virtual Text nodes (maybe not a
good idea, but possible).
Available Resources
- Mozilla
Has a DOM in C++ (unfortunately), along with XML and HTML parsers.
- A C DOM interface exists in the form of a header file -- see
http://www.sinica.edu.tw/~ricko/src/dom_interface.h. It doesn't
include an implementation, however.
- James Clark's SP parser or one of
its derivatives is probably the one to use. It's written in C++ and
has an event-oriented interface. Clark also has an XML parser called
Expat, and a Java XML parser called XP.
- It is tempting to use Jade, a C++ implementation of DSSSL also by James
Clark. Unfortunately, it apparently doesn't implement the (tree)
``transformation language'' of DSSSL, which is presumably what the PIA
would want to use, but it does have an XML flow object tree
back-end which might be sufficient. In any case, Jade almost certainly
includes enough pieces to implement the DPS without dragging in all of
DSSSL.
- Other XML parsers exist. There are some additional requirements on a
parser if the application needs to process HTML documents obtained from
other sites:
- An XML parser would have to be extended slightly to handle tags
that the tagset flags as "empty", i.e. not requiring either a
"/" delimiter or an end tag. These can occur in HTML, but not
in XML.
- An HTML parser would also want to handle missing end tags (for
example, on list items). This is not essential in an embedded
application, because you can ensure that all end tags are
supplied where needed. If necessary, you could use a working
Java PIA as a pre-processor.
- PERL
has data structures very similar to the ones in
org.risource.ds
; in
fact the ds
module was originally designed to make a port
from PERL to Java as simple as possible.
- Apache
is a well-known web server written in C, and has some marketing
advantages as well. It would replace the basic web server level
(Acceptor, Transaction, Resolver) with its own machinery; Agents would
probably be mapped into modules.
Translating Java to C
- Translating method calls.
These translate into a function call with the object being operated on
as the first argument.
- Translating an Interface
(to a struct
of pointers to functions).
- Translating a Class
(to a struct
plus interface functions). Every instance of
a class has a pointer to the class-definition structure, which in turn
has pointers to the method functions.
- Dynamic Method Dispatching
This is done by going indirect through an object's class-definition
structure to find the appropriate function. A good example of this can
be found in the X Toolkit.
Translating XML to C
- There are basically two ways to do this:
- Translate XML into initialized C data structures. This
eliminates the parsing step, and makes your system much more
efficient. It requires creating a lot of initialized data
structures with funny names.
- Translate (active) XML into C procedures. A ``document'' would
simply be a procedure that outputs (probably to an Output object
rather than to a stream) the result of expanding the document
with the DPS. This is essentially the same process that a
parser-generator goes through. Provides very high performance
but very little flexibility -- the document cannot be changed at
run-time.
What is Essential?
This is roughly equivalent to ``where to start.''
- Data structures defined in
org.risource.ds
. These are
basically a
portable data structure library, and everything in the PIA can be built
out of them. In particular, List
and Table
are almost essential.
- You will probably have to add a string datatype.
- The basic server classes:
org.risource.pia.Acceptor
,
org.risource.pia.Transaction
, the
org.risource.pia.Content
interface and at least some of its implementations. These can probably
be omitted if Apache is used as the server base.
- The
org.risource.pia.Agent
interface and its implementations
org.risource.pia.GenericAgent
,
org.risource.pia.agent.Admin
,
and org.risource.pia.Root
. Possibly
org.risource.pia.agent.Dofs
.
These are essential for actually serving pages, no matter how they are
eventually implemented.
- The DOM objects. In embedded applications, it should be possible to
pre-parse all the documents and convert them to C data structures,
though this is not necessarily a good itea.
- The Document Processing System, in the
org.risource.dps
package. The main loop, in
org.risource.dps.process.BasicProcessor
, is
essentially trivial. You will also need the handlers, tagsets, and
some of the Input and Output implementations.
What can be Left Out?
- Certainly, any agents (in
pia/Agents
and
org.risource.pia.agent
) that you don't need. Probably the
only
ones you'll need are Admin
, Root
and
DOFS
. Admin
should be severely restricted,
and Root
will probably be totally changed.
- The Resolver (in
org.risource.pia.Resolver
) can be left
out if you
don't want to use your device as a proxy. A lot of other machinery
could go away if you got rid of this -- essentially you'd be left with
an ordinary web server that replaces CGI scripts with the DPS and its
active documents.
- Everything having to do with Java serialized objects (implemented using
org.risource.util.Utilities.readObjectFrom
and
writeObjectTo
. Unfortunately there are other things in
Utilities
that you may need.
- You could leave out the parser (
org.risource.dps.parse
) if
you used
pre-build data structures for the parse trees of all your documents.
This would make your system significantly faster and reduce the amount
of RAM needed, but would leave it impossible to customize.
- Logging and debugging code should be implemented using macros so that
it can easily be removed from the production system.
What's Disorganized?
Several parts of the system are ``disorganized'' -- left in an incomplete or
confusing state because we simply haven't had time to give them the
attention they deserve. If you're doing a port, these parts should be done
right instead of simply being copied.
-
org.risource.dom
and org.risource.dps.active
implement an
obsolete version of the W3C's Document Object Model. The most
recent version of the interfaces is in org.w3c.dom
; the
entire DPS needs to be rewritten to use it. This will probably be
done sometime in late March, 1999. The names of several classes and
methods will have to be changed. The basic algorithms will stay
intact, however.
-
org.risource.pia
needs to be re-organized to make better
use of
interfaces. org.risource.pia.Agent
is far too complicated;
org.risource.pia.Transaction
is also too complex and
should be
redone as an interface.
-
org.risource.pia.GenericAgent
is far too complex; it needs
to be
split up and simplified:
- The methods in
org.risource.pia.GenericAgent
that put
together various kinds of response transaction could also go
into a utility class.
-
dirAttribute
and fileAttribute
need to
be worked on; they're left over from an older version.
- Quite a lot of test scaffolding has been left in some classes.
Recommended Sequence
- org.risource.ds.{Table, List}
- org.risource.dps.active.{ParseTreeNode, ParseTreeText,
ParseTreeElement, ...}
- org.risource.dps.output.ToParseTree
at this point you can build parse trees in memory. Any parser can
construct them.
- org.risource.dps.input.FromParseTree
- org.risource.dps.output.ToExternalForm
at this point you can traverse a parse tree and output it as a
character stream.
- org.risource.dps.handle.{BasicHandler, GenericHandler}
- org.risource.dps.util.{BasicNamespace, BasicEntityTable}; other classes
in util
as needed. Most of them are simple. Many are just class wrappers for
a lot of global functions.
- org.risource.dps.process.BasicProcessor
At this point you can actually process documents. You can
parse them offline and create initialized C data structures if you
like.
- org.risource.dps.parse.BasicParser
This is optional -- you only need it if you are going to process
documents that are represented as character strings. If you can stick
with parse trees, you can leave it out.
- Handler classes for the tags you need.
In parallel with this you should be prototyping the user
interface using a working PIA -- this will give you something you can
interact with and test, and will tell you exactly which parts of
the PIA you need to implement.
Scheme/DSSSL Porting Considerations
- I would like to see this done on general principles. My guess is that
an average Lisp programmer could put a first cut together in a couple
of days.
- A DSSSL port of the DPS, in particular, should be almost completely
trivial. Probably just a matter of translating each active tag's
handler into a Scheme function.
- A tagset-to-DTD translator would have to be written, since the existing
parsers all run off a standard DTD.
- There are certainly some web servers written in Common Lisp; it
wouldn't be surprising if there's at least one in Scheme as well. This
would give us the front end.
Copyright © 1999 Ricoh Innovations, Inc.
$Id: porting.html,v 1.9 2001-01-11 23:36:50 steve Exp $
Stephen R. Savitzky <steve@rii.ricoh.com>