Implementing The Online Edition Of The Chicago Hittite Dictionary

The eCHD is a Web-based resource, using XML, that provides an indexed, searchable version of the printed dictionary. This transformation of the printed dictionary, starting with the 'P' volume, into an electronic resource is a multi-step process summarized in the diagram below.

A. The word processing document (Word .doc file) is the primary input file.

B. OpenOffice is used to convert the document format into low-level XML.

C. The low-level XML is changed into meaningful XML based on the eCHD schema.

D. Some manual editing is needed to correct and finetune the automated output.

E. The XML is loaded into a Tamino database for storage, indexing, and querying.

F. A Java-based interface is being developed to provide access to the eCHD online.

Supporting Notes:

A. The word-processing document currently uses ASCII substitution characters to represent special symbols used in the CHD. We may soon go to full Unicode representation as this standard is developing and is increasingly accepted by the scholarly community.

B. OpenOffice.org is an open source, community-developed office productivity suite available via Sun Microsystems (http://wwws.sun.com/software/star/openoffice/). The importance of OpenOffice to this process is due to its use of XML as its underlying document format. The CHD document is read into OpenOffice, then merely saved in OpenOffice format. This transforms it into an XML file with each formatted word or character converted into tagged data. For example, a "12 pt, all uppercase, superscripted 'URU'" gets tagged as such. This is very detailed, low-level, XML but nonetheless the transformation has begun.

C. A complex, custom-written XSL Transformation is applied to the low-level XML. The result is semantically meaningful XML based on the eCHD Schema. For example, the 'URU' above is transformed into a determinative with an appropriate tag.

D. The automated process handles, to a surprising extent, the required transformation, but some manual editing is necessary to finetune the output at this stage.

E. Tamino is a native XML database which stores and manipulates data as XML, rather than as relational, or table-based data. Indexes can be created and queries performed against the CHD data.

F. A front-end application is being developed in Java to allow scholars to access this data in a variety of ways, using the indexes and queries provided by the database format. XSLT is also being used to reformat the underlying XML back into presentation-quality documents.

Actual Progress to Date (September 2002):

A. Word processing documents are available for P, S (as it is written) and to a lesser extent some of the earlier volumes. A subset of P (the first 67 articles) is being used as the test case for the project.

B. All of the P volume has been transformed into low-level XML via OpenOffice. A coherent set of low-level formatting-based tags has been established.

C. The subset of P has been transformed into meaningful XML, the schema for which has been defined.

D. Manual editing is essentially completed for the subset of P.

E. Work in Tamino is underway. The eCHD Schema has been loaded into the database, as well as some test data.

F. Specifications for the front-end tool, as well as some of the requisite XSL transformations, have been developed. Java programming of the front-end interface is ongoing.

Note: The Electronic CHD is part of the XSTAR (XML System for Textual and Archaeological Research) initiative at the Oriental Institute of the University of Chicago.