RiSource.org / White Papers / Web Applications and the PIA

Foreword

This White Paper is intended for web developers: it describes a design and development style well-suited to creating customizable Web applications. While this paper discusses development in the context of the PIA technology, the fundamental philosophy, which can be roughly characterized as "Web application maintenance is mostly a task of document management and should not require specialized programming skills or tools," applies equally well to other platforms. Several other technologies, such as Meta-HTML, PHP, and the proposed Apache XML initiative, support some of the "active XML" features of the PIA. We encourage the evolution and convergence of these technologies to a standard, platform-independent design for Web applications that facilitates shared development and ongoing maintenance.

Table of Contents:


    1: Introduction: customizable Web applications
        1.1: What are Web applications?
        1.2: Types of Web applications
        1.3: Technologies for Web Applications
        1.4: The life cycle of Web applications
        1.5: Tools and Methodologies
        1.6: Customization
    2: The PIA Approach
        2.1: The PIA's Document-Processing System
        2.2: Flexible site configuration
        2.3: Agents and event based processes
    3: Comparison with Other Systems
        3.1: Conventional platforms
        3.2: Server-side extension languages
        3.3: XML-based approaches
    4: Technical details
        4.1: Performance and portability
        4.2: Server integration
        4.3: Integration with other applications
        4.4: Embedded applications
    5: Conclusions
        5.1: Would the PIA be useful for your web application?

Abstract

The World Wide Web has given rise to a new category of applications: programs whose primary user interface is a web browser, and which consist of a large amount of information or ``documents'' mixed in with a small amount of behavior expressed as ``code.'' In effect, applications have become specialized web servers.

Given the ever-changing nature of information and the Web environment, easy customization and long-term maintenance should be considered key factors for Web application development. This paper describes an XML based design approach and development methodology that supports ongoing maintenance while minimizing the need for special technical or programming skills. Most appropriate for applications deployed by smaller organizations without significant IT support, these design principles are embodied by the Platform for Information Applications (PIA).

PIA based Web applications consist primarily of XML documents written using a set of domain specific tags. (<h1> and <a> are examples of HTML tags. XML, which stands for eXtensible Markup Language, is similar to HTML but allows tags to be defined and used as needed.) Maintenance means editing these documents using any standard XML editor. The flexible PIA engine serves pages in response to client requests by dynamically processing these documents in accordance with developer defined semantics. This processing may include simple tag substitution, page transformation, database lookup and insertions, or any other functions appropriate to the application domain. In essence, the tags provide a specialized vocabulary available to use in customizing the application.

This approach promises platform independent, easily customizable Web applications. XML support in the form of editors and other tools already exists on essentially all platforms and continues to grow. Specifying the processing in XML not only makes the logic more accessible but restricts the dependence on a particular computing environment to the implementation of a few tags. In the PIA, semantics for the small set of primitive or "basic" tags are defined in Java, while the majority of application-specific (defined by developers) tags are specified in terms of these primitives.

Freely available as open source, the PIA software, written in pure Java, is available from RiSource.org and includes interfaces that conform to all relevant standards, including the W3C's Document Object Model (DOM), and the Simple API for XML (SAX). RiSource.org also distributes and helps coordinate the development of open source Web applications including as a workflow system, shared calendar, Web site management tool, and a personal ``browsing assistant.'' As the technology for developing and deploying XML based Web applications becomes standardized, RiSource.org will help facilitate the collaboration of the expanding pool of developers who create and maintain Web applications.

table of contents

1: Introduction: customizable Web applications

1.1: What are Web applications?

The World Wide Web as we know it today represents not one but two revolutions in the world of computing. The first was to transform the entire internet into one huge collection of documents, each potentially no more than a mouse-click away from any other. The second, less obvious revolution was to transform services into collections of active documents.

A web application is a computing service accessed through the web. It consists of a web server -- the ``engine'' that makes documents available to browsers -- and a collection of documents, some of which are ``active,'' i.e. they make the engine ``do things'' like order a book or buy stocks for the client.

This highly efficient method of delivering services represents a potentially huge productivity gain for both providers and customers. Amazon's book recommendations and 1-Click ordering can literally pay for the cost of a book in the amount of time and effort it saves.

Unfortunately the overhead costs of developing and maintaining Web applications limits them to large organizations well supported by an IT infrastructure. Even though most Web applications consist primarily of (unstructured) information with small amounts of embedded software (what Tim O'Reilly has called Infoware), they tend to require the same development tools and skills as ``traditional'' software applications. Development and modification of an application requires a programmer, someone whose primary job is translating the desired behavior into a machine language. This is not unlike the state of numeric computing before the advent of spreadsheets when financial analyst required assistance to translate their formulas into Fortran programs. This paper describes an approach to building Web applications that reduces the overhead costs making it possible for smaller organizations and individuals to realize the efficiency gains of providing their services via the Web.

1.2: Types of Web applications

Consider five different types of web applications:

  1. Public applications
    Sites like the ``killer apps'' mentioned above that provide services for thousands or even millions of clients around the world.

  2. Enterprise applications
    ``Intranet'' sites, operating inside an organization at the enterprise level. Similar in many ways to public applications, these might include HR systems, procurement systems, and other companywide functions.

  3. Group applications
    Applications that serve a workgroup or office. These include a number of common functions such as scheduling or document management systems, and ad-hoc functions specific to a particular group e.g. a Web site for an architectural design project, job control for a print shop, or access to CVS repository for a software development group. Usually accessed over a local network by a small number of users, these applications are generally maintained by the members of the group they serve, with little or no IT support.

  4. Personal applications
    Applications used by a single user including calendaring and other personal productivity functions. They usually run on a desktop PC, but the web interface makes them easily accessible from anywhere on the local network, and potentially from anywhere in the world. (The distinction between group and personal applications is a bit blurry -- personal applications generally have a single owner of the data. These might include everything from "MyYahoo" to AvantGo to StarOffice with the latter pointing to another area for consideration, "application service providers." We probably want to focus on local applications and put things like myyahoo in the category of public sites. So the examples here might be history management.)

  5. Embedded applications
    Embedded applications run on an appliance-like device that is not a general-purpose computer; examples include printers, scanners, facsimile machines, and networked storage devices. Increasingly, these devices are being built with network access and a web-based user interface.

Each type of application has different requirements, support, and performance characteristics. Public and enterprise applications must handle large numbers of simultaneous users with 24/7 accessibility. These applications generally have full-time staff devoted to their maintenance. Group applications generally have a relatively small number of users but must constantly evolve as the information and group needs change. Embedded applications must be very robust and may be deployed in harsh environments with no local support.

Existing Web application designs and development tools are most appropriate for public and enterprise applications. These applications generally have well-defined software components that are developed and maintained by a support group. In contrast, our design methodology seems more appropriate for group applications intended to be customized and maintained by the people who use them. Furthermore, this XML based approach provides a safe mechanism for extending and customizing applications in the field, a useful feature for embedded applications.

Before describing our approach in detail, we first review the standard technologies and life-cycle for Web applications.

1.3: Technologies for Web Applications

There are three main ways of making documents on a web site ``active:''

  1. CGI scripts.
    This is the ``traditional'' method of building a web application: all the actions are performed by programs, ``scripts,'' that are started by the server in response to particular URL's. The script then performs any necessary form processing, computes a response, and sends that response back to the browser. The interface between the server and the script is the ``Common Gateway Interface'' -- CGI.

  2. Servlets and other server extensions.
    Servlets are little pieces of Java code that the server transfers control to in response to particular URL's. Apart from the fact that they run ``inside'' the server rather than as separate processes (and hence take much less time to start up), there is little to distinguish servlets from CGI scripts. Similar techniques are used with other programming languages, for example C and Perl.

  3. Server Pages.
    This technique involves embedding little bits of code in the web pages, to be interpreted by the server. Typical embedded extension languages include Visual Basic (``active server pages''), Perl, PHP, and Javascript. It is also possible to embed scripts in compiled languages such as Java (Sun's JSP) -- the pages are compiled offline into servlets. Many servers also provide their own ``mini-languages'' (Apache's ``server-side includes,'' for example).

1.4: The life cycle of Web applications

In this section we will examine the development ``life cycle'' of a web application, and touch briefly on how it differs from that of a software application.

  1. Initial Development

    Web applications require a different kind of development than traditional software applications. Software applications have always (of necessity) been developed by programmers, sometimes with the assistance of a few user-interface designers (all too often called in at the last minute to figure out why nobody wants to use the new software). Most interactive applications end up being about 70% user interface.

    Web applications, on the other hand, typically exceed 95% ``content'' -- as seen by the user, a web application consists entirely of documents: HTML ``pages'' viewed in a browser. Scattered through these documents are the buttons and text boxes of a more traditional user interface.

    The developer's perspective depends in part on which technology they choose. In the early days of the Web, applications consisted primarily of standard HTML documents with a few CGI scripts to handle <form> submissions. Initial development consisted of creating HTML pages and then creating the CGI scripts to handle specific forms.

    It soon became obvious that this approach was too restrictive since the HTML documents were static and could not reflect updated information. Morevover, developers grew tired of maintaining the correspondence between the HTML forms and CGI processing. Servlets and similar technologies solved these problems by generating every page programmatically. In essence, a method or set of methods is written in your favorite programming language to generate all of the pages that constitute the application. Applications could be arbitrarily powerful and all the information was contained in a single set of (source code) files making it easier to maintain the correspondence between the (generated) HTML and processing. Development fit very well with traditional software development -- generate some specifications and then write software to meet the specifications.

    Eventually developers realized that modifying source code to fix a missing </table> tag or modify the navigation bar was a tedious job. Besides that, programming tools do not match very well the linking structure of Web applications. Server pages provide a compromise solution with regular HTML pages containing bits and pieces of embedded code that the server interprets dynamic. Developers who are comfortable with the syntax for both HTML and the embedded language can freely intermix the two. Page layouts and prototype applications can be first developed with static HTML and then augmented with embedded code to provide the desired functions.

    Server pages work reasonably well assuming that all of the developers have similar skill sets, which includes the ability to write in the pages' embedded programming language, and a standard development environment. Oftentimes though, the embedded programming language makes the documents incompatible with standard structured editors (or the editors produce HTML/XML which breaks the embedded language) and inaccessible to non-programmers.

    As described in detail below, the PIA approach uses a type of extensible server page. In contrast with current systems, the embedded ``language'' is pure XML which has several key advantages, such as widespread support from designers and development tools.

  2. Deployment

    Once a web application has been developed, it has to be deployed: uploaded to a public server and integrated with the company's existing web site. Workgroup applications are deployed on an intranet server, but the principle is essentially the same.

    Deployment of a web application is a much more traumatic event than the deployment of a piece of software. Software's availability can be controlled by controlling its distribution. It goes first to a select group of beta testers, and only then (after a few rounds of bug fixing and refinement) to wide distribution. Even after the software ``hits the shelves'' it will take a long time to ``ramp up'' to full deployment.

    A web application is different. After it is installed on a public server it may be only a matter of minutes -- days at the most -- before it is found by the search engines, and shortly thereafter by a horde of eager users. If a public announcement is made, the ``Slashdot effect'' may innundate the server with a ``flash crowd'' -- Britannica's web site was all but inaccessible for a week after its introduction.

    This means that the maintenance cycle is much shorter for a web application than for a software application. The application's designers may have to respond to problems within minutes, rather than months. Support for rapid customization is essential. In effect, what software developers call ``rapid prototyping'' goes on even after a web application is released.

  3. Maintenance

    Once a web application has been deployed, the maintenance begins. It has been said that ``software is the only field in which adding a new wing to a building is considered maintenance.'' This is even more true of web applications: a web site may be completely redesigned and rebuilt several times over its lifetime.

    There are three aspects to maintaining a web application: monitoring the server to ensure that it's operating properly, editing and updating the documents, and modifying the software to keep up with the changes in the documents.

    One of the unpleasant facts of maintenance is that at some point the original developers usually move on to other projects. This is not a significant problem for the ``content'' portion of a web site: professional writers and designers are good at maintaining a consistent style at an organization. It is a problem for the application's software -- this is often obscure and poorly documented, and the programmers are usually the first people to move on.

    The software portion of a web application, because it is usually written in ``scripting languages'' like Perl, or is broken up into tiny fragments embedded in the site's documents, is almost always harder to maintain than a traditional all-software application. Sometimes it's easier to scrap it and have a new programming team rewrite large portions, than to figure out some earlier programmer's tricks.

    The PIA's document-processing framework, which allows the designer to define special-purpose tags that are shared among many documents, simplifies maintenance in several ways:

    The line between ``application maintenance'' and ``customization'' is extremely fuzzy for web applications, especially for group or personal applications. The PIA helps make many maintenance tasks more like the kind of customization that users and web developers are familiar with: modifying documents rather than software.

  4. The Next Project

    One of the major differences between software and infoware shows up when a design and development team moves on to its next project. A software team's next project is almost certain to be a variation on its previous one -- another compiler, say, or another printer controller. The team accumulates knowledge, a suite of tools, and a library of re-useable code that grows with each new project.

    A web design team, on the other hand, is more likely to move on to something very different. A few basic tools and scripts may be carried over from one project to the next, but most of the content -- the information -- will be new. Many server-side scripting languages, in fact, were designed by web design consulting companies in order to provide a framework for code re-use.

    The PIA's tagsets and configuration files give web designers the equivalent of the software team's code library. Entire sub-applications (for example, a calendar) are also easily portable to new projects, and very easily customized.

1.5: Tools and Methodologies

Software development is situated somewhere between a craft and an engineering discipline, and many design and development tools and methodologies exist to assist the process in all of its stages. Software designers may use CASE (Computer Aided Software Engineering) tools; programmers can count on syntax-checking compilers, interactive debuggers, and even automatic documentation extractors (Javadoc being a recent example). For maintenance there are tools like profilers for improving the software's performance, version control systems such as CVS for archiving changes, and bug-tracking systems for managing requests.

Even though the linear ``waterfall model'' of the software development lifecycle has been partially abandoned, software development is still fairly straightforward. There may be a flurry of ``rapid prototyping'' while designing the user interface, and major additions may be made during maintenance, but on the whole the picture is one of steady progress and gradual evolution.

Even in the design phase software is relatively straightforward. Whatever specific methodology is being followed, the application at the end is usually fairly close to what the original requirements specified.

Web application design and development are significantly more chaotic. Whereas it's almost inconceivable for a software application to start out as a compiler and end up as web browser (Emacs may be one of the few exceptions), it's not unusual to find a simple search engine that has transformed itself overnight into a ``portal site.'' Unlike software development, few (if any) methodologies exist to guide this process.

Then again, the tools available for building web applications are still very primitive. WYSIWYG editors are good for documents whose ultimate destination is ink on paper, but they fail miserably when applied to a web page that may be viewed on anything from a Palm Pilot to a 21-inch monitor. There is little if any support for complex operations like changing the header and footer of every document on a web site. There's no support at all for editing documents that may contain bits of scripting mixed in (especially if some of it is meant for the browser and some for the server, in two different languages).

Some proprietary Web development platforms, such as Cold Fusion, do support some aspects of application development and maintenance. Unfortunately these tend to be very expensive, have their own nonstandard programming component, and of course lock the designers in to a single vendor.

1.6: Customization

The largest problem facing a workgroup or small business is customizing their web applications. An enormous amount of effort can go into designing and maintaining a public or enterprise-level web site, but the project has the full support of the company's IT department, a large budget for designers and consultants, and so on. Since the revenues from a public web application are likely to be large (they may even be the company's only revenue stream, as in the case of an Internet-based business), it's easy to justify a large expenditure. Similarly, an enterprise-wide intranet site is going to be run by the IT and HR departments, which can easily justify its cost.

The situation is far different in a workgroup or small office. The software complexity of the application is likely to be almost as great as that of a large public web site, but the amount of information associated with it is far less, and there may not be even a single full-time person dedicated to maintaining the entire network, let alone the web applications that make it useful.

As a result, small-scale web applications are usually ``home-grown'' in somebody's spare time. (They might be bought ``off the shelf,'' but shrink-wrapped web applications are exceedingly rare at this point.) They will tend to grow haphazardly, as the result of a series of customizations to meet the group's changing requirements. Of course, it's almost trivial to customize the information part of a small web site. The software is another matter, and will often consist of small CGI scripts downloaded from the Net and changed as little as possible.

Web applications that are built using extensible server pages (for example, ASP, JSP, PHP3, and Meta-HTML) tend to be easier to customize than those based on standard programming techniques such as CGI scripts or servlets. The PIA's approach to extensible server pages is particularly simple because, being XML-based, it is well adapted to existing authoring tools and techniques.

table of contents

2: The PIA Approach

The PIA is a highly versatile platform for web applications: it is able to function as either a ``traditional'' web server, a client, or a proxy.

Furthermore, the PIA can combine these aspects, enabling totally new kinds of web applications. The PIA approach to web applications is based on three main ideas, which we will examine in more detail below:

  1. Server-side document processing
  2. Flexible server configuration
  3. Agents and event-based processing

2.1: The PIA's Document-Processing System

The PIA's server-side document processing system (DPS) is essentially a form of extensible server pages, with three main differences from other systems.

  1. The syntax of the extensions is pure XML or HTML -- there are no constructs like assignment statements or arithmetic expressions. This means that documents can be created, edited, and processed using existing tools. It also means that the PIA's processing feels more like document processing and formatting than like programming.
  2. None of the tags in a PIA document has a fixed meaning. This allows the PIA to, for example, take ordinary HTML documents and apply ``styles'' or other processing to them. The processing instructions -- the semantics -- of the tags are usually specified in a separate document which we call a tagset (itself an XML document). The configuration files associate particular classes of documents with particular tagsets (can also be done dynamically). Of course, tag definitions and processing instructions can also be embedded directly in a document.
  3. The PIA's document processing deals with parse trees, not strings (text). Every syntactically correct document can be represented by a ``tree structure'' in which every start tag has a matching end tag, all start and end tags are properly nested, and so on. Getting this structure correct is easy to do in a WYSIWYG HTML or XML editor, but not in most programming languages -- in particular it's perfectly feasible for any of the other systems to send an incorrect page to the browser, which may then do unpredictable things with it. This can't happen in the PIA.

Let's examine some of the consequences of these features in more detail:

The operation of the PIA's document-processing system is described further in a companion White Paper, Document Processing in the PIA. The PIA's documentation can be found online at www.RiSource.org/PIA/Doc. There are several features of the PIA, including its flow-through architecture and its use of open API's, that are of interest for other areas besides customizable Web applications.

2.2: Flexible site configuration

Like most web servers, the PIA has a configuration file in which all of its many options and parameters can be specified. Also like most servers, additional configuration files can be supplied in any directory to supply local options.

Unlike other servers, however, the PIA's configuration files are pure XML (which should come as no surprise), and any of the standard document-processing tags can be used in them. In particular, the <include> tag can be used to include other files (for example, the standard mappings for filename extensions), and the <if> tag can be used to make parts of the configuration optional (for example, setting up authorization only if a password file can be found).

Three aspects of the site configuration mechanism are particularly interesting:

  1. Shadow Directories

    The PIA has a mechanism that allows two directories to be ``overlaid'' on top of one another. The one ``on top'' is the real directory, and any files the PIA writes go into it. The one ``underneath'' is called the virtual directory, and the PIA looks there for any file it can't find in the real one.

    Although this seems confusing at first, it makes local customization enormously easier. For one thing, it gives greatly increased protection to the PIA's own files, and to an application's documents. Usually an application is shipped with a configuration file that puts all of its documents in a virtual directory. Any local customizations then go into the real directory. The virtual directory might even be on a CD-ROM, or in some location shared by many users, each with their own real directory full of personal data and customizations.

  2. Virtual Documents

    The same configuration mechanism that allows for shadow directories can also be used to make ``virtual documents'' appear in a directory. The main use of this is to create ``aliases'' or ``symbolic links'' in an OS-independent way: the virtual document or directory can be brought in from anyplace on the system.

  3. Extension Mappings

    The configuration file also defines the mapping between filename extensions and both MIME types and tagsets. It is also possible to hide files from the client; hidden files can still be accessed from inside the applicaton.

    The extension mapping also defines a ``search order'' -- a URL without an extension causes the PIA to try each of the listed extensions in order until a document is found. Among other advantages, this means that document names in URL's don't need extensions, making them shorter and easier to type and remember.

    A further advantage of omitting extensions is that a document's extension, and hence the tagset that processes it, can be changed at any time by the application designer without invalidating a user's bookmarks.

2.3: Agents and event based processes

The PIA also serves as a platform for running software agents. These are small XML documents that specify actions to be performed in response to events rather than specific client requests. In this role the PIA is sometimes referred to in its documentation as an ``agency.''

There are several uses for agents, corresponding to different kinds of events that activate them.

table of contents

3: Comparison with Other Systems

In this section we will compare the PIA's approach to constructing web applications to that of other systems currently in use.

3.1: Conventional platforms

The conventional platforms for web applications break down into two broad categories: separate code, and embedded code:

  1. Separate code
    Systems that separate code from include CGI scripts, Apache modules, Java servlets, and so on. All require considerable programming skills in order to change or customize the behavior associated with an application's document. All have the additional disadvantage that, because code and data are in separate files, it is easy for changes in a document to ``break'' the associated code. Modules and servlets also pose the risk of a bug in the code causing a server crash.

  2. Embedded code
    Systems that embed code in documents include ASP (Visual Basic) and JSP (Java); similar systems exist for embedding other languages, including shell scripts (Apache server-side includes), Perl and Python. All still require a certain level of programming skill, with at least passing familiarity with a programming or scripting language and its syntax. Editing tools designed for HTML and XML give little or no assistance when editing the embedded scripts.

It is difficult, using embedded languages, to perform processing on arbitrary documents. It is almost impossible to use them to define new tags, or new meanings for old tags.

3.2: Server-side extension languages

Server-side extension languages are complete programming languages that are designed to generate web pages; in general ordinary text and HTML tags are treated as ``constants'' and passed through to the client. The PIA falls squarely in this category, but with some significant differences which we will examine in more detail below. For the moment, let's compare the PIA to two well-known server-side languages, Meta-HTML and PHP.

FeatureMeta-HTMLPHP3PIA
Syntax HTML-like C-like pure XML
Embedded processing yes yes yes
Tagsets yes no yes
redefine document tags? yes no yes
treats documents as: strings strings parse trees

Neither Meta-HTML nor PHP, nor any other server-side scripting system that we know of, shows any awareness of the structure of the underlying document. The document is simply treated as a ``character string;'' it is possible to generate syntactically incorrect documents, and it is not particularly easy to manipulate the document's parse tree. This makes it difficult to take advantage of the document's structure. (Of course, XSL does explicitly manipulate the documents structure but it does not provide the traditional scripting capabilities.)

Note the latest version of PHP does include an XML parser, which can be used for processing documents, but it can't handle ordinary HTML. Unlike Meta-HTML, PHP lacks the notion of tagsets.

3.3: XML-based approaches

The available XML-based approaches to server-side document processing fall into two broad categories: embedded code inside of XML constructs (like PHP), and ``style-sheet'' languages. As we will see, the PIA, which is also XML-based, has aspects of both.

The main disadvantage of the style-sheet languages is that they cannot be embedded in a document. In addition, they tend to be difficult to learn, and cannot easily be manipulated as data. XSL, and the closely related Cascading Style Sheets of HTML, are not complete programming languages. We feel that both expressive power and embeddability are important. It's worth noting, though, that style-sheet processing could easily be added to the PIA's document processor.

FeatureXSLPIA
local transformations no: assumes a complete parse tree yes: documents can be streamed
arithmetic operations counters only yes
text manipulation sorting and concatenation sort, split, join, trim, subst
iteration only over node sets general
tests tree matching numeric, string
embedded expressions {xpointer} &entity;
definitions only in stylesheet tagset or document
processing only in stylesheet tagset or document
native code extensions no yes, through tag handlers
interface to files read-only read/write
interface to web docs read-only read/write/query
interface to server no: documents only yes: can operate on transactions
interface to database no yes
     
Learnability complex simple
Security N/A flexible

table of contents

4: Technical details

A full technical analysis of the PIA is beyond the scope of this paper. Here we mention a few of the technical aspects of the PIA's implementation that have direct consequences for the designer of web applications.

4.1: Performance and portability

Currently the PIA is highly portable, but fairly large and not particularly fast; this makes it most suitable for personal and group applications. We are actively working on the size and speed problems, with the ultimate goal of developing a much faster and smaller version that can easily be integrated with Apache and other high-end web servers.

The PIA is currently written in Java, which makes it highly portable; it is known to run under Sun's JDK 1.1 and 1.2 on Linux, Solaris, and Windows 98 and NT, as well as under Kaffe on Linux. Adding to the PIA's portability is the fact that the user interface is completely web-based, avoiding Java's user interface classes.

Unfortunately, Java is an interpreted language, which imposes a performance penalty. There are several factors in the PIA's design which partially compensate for this:

In its present Java implementation, the PIA is not well suited for large public websites or other applications where extreme high performance is needed. At least, not yet. There are three possible ways of improving the PIA's performance dramatically:

  1. Full compilation
    Some Java compilers exist that can produce machine-language applications instead of interpreted ``byte codes.'' These have their limitations, but nothing would prevent using one to give the PIA an immediate performance boost.

  2. Conversion to C.
    Because the PIA's document processing engine is based on a small number of primitive tags, it would be quite simple to rewrite it in another programming language. C is the obvious choice: it is universally available, easily interfaced with other web servers, and delivers very high performance. (See RiSource.org for status of the conversion effort.)

  3. Document format conversion.
    Although XML is easy to parse, it's not completely trivial. The PIA imposes an additional step, namely looking up the action associated with each tag. A ``pre-parsed'' binary document format would eliminate this step, and possibly speed up text searches as well. The ultimate extension of this, which has the potential to deliver the ultimate in performance, would be to compile pages into C programs. Running such a program would perform any necessary processing, and send the resulting page directly to an output stream. (Note this is similar to the page compilation of PHP 4.0. Further investigation needs to be done to determine whether the significant features of the PIA can be integrated with the PHP module.)

4.2: Server integration

In order to integrate PIA-based applications into an existing web site, it is usually necessary to integrate the PIA with the server that is already present. There are three ways of doing this:

  1. Proxying
    Apache has a technique for seamlessly integrating another web server into a website; this is done by adding a single line to Apache's configuration file. Apache will invisibly proxy requests to selected sub-directories to the other server. The PIA server can then be configured to run on a different port from Apache's, but to report Apache's port to the browser when constructing a URL. This technique allows portions of a website to be served by the PIA, and the rest by Apache.

  2. Servlets
    Java servlets are a common, standardized technique for including active documents into a web server. Since the PIA is written in pure Java, it's easy to wrap a servlet interface around the PIA's web-server engine.

  3. Modules
    Apache has an extremely powerful extension technique in which ``modules'' -- essentially shared libraries with a standard interface -- are linked in with the main server. Many of Apache's standard functions are already implemented in this form, as are several existing extension languages such as PHP. By rewriting the PIA's document processing and site-structuring subsystems in C, it will be possible to integrate the PIA into Apache as a module. When complete, this effort will give the tightest integration and the highest performance.

  4. Offline Processing
    As we mentioned above under ``performance,'' it is possible to process documents outside of the server environment. If a web site consists entirely of static pages, it is possible to use the PIA's document processing purely as a formatter and have the existing server deliver the resulting static pages.

4.3: Integration with other applications

It is just as easy to integrate the PIA with existing applications, web-based and otherwise, as it is to integrate it with a web server. As usual, there are several ways of doing this:

4.4: Embedded applications

We expect one of the major uses of the PIA to be ``embedded'' applications, with the PIA providing the primary user interface for some piece of equipment that is not a general-purpose computer. There are several features of the PIA that make it suitable for embedded applications.

table of contents

5: Conclusions

The PIA approach offers a way to build Web applications that can be easily customized and maintained. It leverages existing and future XML tools and allows developers to create application-specific vocabularies so that the documents which comprise Web application may be created and modified without customized programming skills or tools.

Server-side Document Processing
XML-compliant Standard tools (editors, parsers, etc.) can be used for development. Processing can be separate or embedded in documents.
Embeddable Documents can contain their own processing, embedded in the portion of the document that has to be processed.
Separable Separate tagsets can be shared among documents.
User-extensible User-Defined tags have the same syntax as built-in operations.
Operates on parse trees Impossible to generate a syntactically-incorrect output document.
Small set of primitives Easily ported to other programming languages. Easy to learn.
Turing complete It is possible to write arbitrary programs in this XML language (not that you would want to, but it it is nice to not have to worry about running into brick walls)
Efficient Documents can be streamed through, meaning that the browser can get the first part of a page while the rest is still being generated.
Web-based application platform
Flexible configuration Active documents and data easily separated, but can be shown in the same URL tree.
XML configuration files Full power of document-processing language available at configuration time.
Java-based Platform-agnostic. Runs on Linux, Unix, Windows.
Agent-based event handling Processing can occur based on transaction features (request and response headers) or time. Agents can modify transactions.
Client / Server / Proxy Web engine operates in multiple modes for maximum flexibility.

5.1: Would the PIA be useful for your web application?

table of contents

Copyright © 1999 Ricoh Innovations, Inc.
$Id: wp-webapp.html,v 1.12 2001/01/12 01:45:43 steve Exp $

Stephen R. Savitzky <steve@rsv.ricoh.com>