NETWORKING THE PAST
ARCHAEOLOGICAL DATA MODELS AND ELECTRONIC PUBLICATION ON THE WORLD-WIDE WEB
By David Schloen, Assistant Professor of Syro-Palestinian Archaeology
The Oriental Institute and the Department of Near Eastern Languages and Civilizations
The University of Chicago
(This article originally appeared in The Oriental Institute News and Notes, No. 160, Winter 1999, and is made available electronically with the permission of the editor.)
Now that "Internet" is a household word and people routinely "get on the Web" to find information of all kinds, it is time for archaeologists to make effective use of computer network technology in order to publish their discoveries and interpretations. Indeed, many archaeologists, including Oriental Institute archaeologists, have been actively experimenting with electronic publication in recent years. But there is still a long way to go. To make good use of the Internet, archaeologists need to sit down and do the hard work of devising a common approach to electronic publication, rather than just adopting generic off-the-shelf techniques that were invented for other purposes. True, it is very easy nowadays to produce what is essentially a computerized imitation of the traditional printed page - a relatively static, unchanging collection of text and pictures intended to be read and referenced by human readers - and to disseminate this world-wide over the Internet. This is certainly a big step forward, because when the familiar "page" metaphor is enhanced by "hyperlinking" (the ability to click on a word or picture and jump immediately to another related page), as it is in the World-Wide Web, a wide variety of published information can be referenced far more quickly and efficiently than with printed books. In many cases, also, information can be made available that would never see print because of the prohibitive expense of traditional publication, e.g., large numbers of color photographs. The advantages for publishing complicated archaeological reports, in particular, are obvious, because a reader can move easily from the general interpretation of an excavated site to detailed maps, photographs, and other data, and back again, without a lot of time-consuming paper shuffling.
The Web as it exists today is a valuable tool and many archaeologists are making use of it. But at present the Web falls far short of its potential as a means of sophisticated archaeological research. This is because it is not "searchable," except in a very limited way. The Web is great for "human navigation" in which one sits down and clicks manually to jump from page to page, perusing what is of interest, but there is no way to use the power of the computer to trawl through huge quantities of data in order to retrieve information according to specified search criteria. In a way this defeats the purpose of electronic publication, because information that is published on computers should be "machine readable" as well as "human readable." Archaeologists, in particular, would benefit from computerized searches and analyses of the large amounts of data that they record on a routine basis and that they and their colleagues could easily place on the Web. Archaeological research would be greatly enhanced by the ability to get quick answers to specific questions about exactly what has been discovered and where it was found, searching either within individual archaeological sites or across many sites. Such answers could be obtained with a rapidity, level of detail, and comprehensiveness that are not currently feasible, and the automatically generated results could be presented in a variety of forms, e.g., lists, charts, statistical graphs, or color-coded distribution maps. Time and energy would then be freed to ask further questions and to detect patterns that might otherwise be missed. In this way a real qualitative improvement in research could be achieved simply by increasing the speed and scope of information retrieval, in comparison to the laborious and much less comprehensive kind of manual search-and-retrieval that is done with printed reports.
Internet experts have long been aware of the Web's deficiencies in terms of "searchability" and they have recently addressed this problem. The standard page-oriented method of putting information on the Web using the "Hypertext Markup Language" (abbreviated as "HTML") does not provide the computer with any information concerning the meaning of the text or pictures that it is handling. HTML merely specifies the way in which text and pictures are to be presented on the computer screen. The computer has no knowledge of what a Web page it displays is actually about - it might be an archaeological report, a stock market quotation, or a cooking recipe, for all it knows. To correct this deficiency of HTML, therefore, a new "Extensible Markup Language" (abbreviated as "XML") has been invented. Information stored in XML format can be "marked up" or "tagged" to whatever degree is necessary in order to convey not just the data itself but also detailed "metadata" which tells a computer what the information is about. For example, a description of an ancient bronze tool unearthed in an excavation could be tagged in such a way that a computer would know not just how to display the description as human readable text in a certain font, color, and point size (which is all that HTML can do now), but would also know that the description referred to a metal artifact with certain characteristics that was found in a specific location. Computers could then retrieve this information in the course of automated searches based on the characteristics and locations of individual artifacts. Similarly, a photograph of the same bronze tool could be tagged so that the computer would know not just where to display the photograph on the screen, but would also know what it was a photograph of, so that photographs of all bronze artifacts, for example, could be retrieved automatically.
Archaeologists must do some work, however, before they can make use of the "searchability" features of the new XML format. XML itself does not specify how archaeological data should be organized and tagged for delivery over the Web. It simply provides a way for Web publishers to define their own tagging schemes suitable to the type of information that they are publishing (hence the name "Extensible Markup Language"). In order to take advantage of the opportunity to create fully searchable Web publications, archaeologists must collaborate to create a standard tagging scheme for archaeological data - an archaeological "markup language" which captures all of the information that they want to publish, and also captures all of the interrelationships among different pieces of information that they wish to record.
With the advent of XML the need to establish common standards for representing archaeological data is greater than ever. At the same time, however, XML creates an environment in which standards-setting efforts can be highly focused and efficient. Over the past few years the tremendous success of the World-Wide Web and the rapid development of related software and techniques have created optimal conditions for the development of data representation standards in many different information domains. It is timely, therefore, that the Institute of Archaeology of the University of California at Los Angeles has just announced plans to convene a "working group" to devise standards for the electronic publication of primary archaeological data. A variety of archaeological professionals and institutions, including the Oriental Institute, will be represented in this working group.
The task is daunting, however. Although archaeologists have used computers for years, there has been a decided lack of standardization to date. Archaeologists are notoriously individualistic, and every archaeologist understandably wishes to customize his or her database system for the project at hand. Unfortunately, the resulting chaos of incompatible file formats prevents easy electronic merging of detailed information from different projects, thus hindering computer-based archaeological research conducted on broad spatial and temporal scales. The situation is not improved simply by publishing idiosyncratically structured databases piecemeal on the Web without any modification, because there is no consistent framework within which they can be searched and analyzed. If the new XML format were used by archaeologists simply to encode their own favorite terminologies and recording systems, thus duplicating their existing database structures, the potential of Web publication would not be realized. Some kind of common data representation standard is needed.
On the other hand, almost all archaeologists would agree that no detailed prescriptive scheme adopted at the outset for recording the data from diverse excavation and survey projects can or should be enforced, even within a single geographical region. Every site is different and each investigator should be free to employ the terminology and the recording system that is best suited to his or her project. Furthermore, as archaeological methods develop, there is a danger that any standard which is adopted now could become obsolete and counterproductive in the future. The development of archaeological data standards has been hindered because it is not immediately obvious how any standardized format could be sufficiently flexible and open-ended and still provide the benefits of standardization, in terms of a consistent framework that would permit automatic searching and retrieval. Thus the legitimate requirements of individual projects appear to conflict with the widely-shared ambition to combine many different archaeological databases for detailed multi-site comparison and analysis.
There is a way of resolving this conflict, however. Despite the inevitable variety of archaeological recording systems and terminology, there are basic features common to all archaeological data which permit a standardized format or "data model" and a correspondingly uniform and intuitive user interface - although at a more abstract level than has usually been considered. Moreover, the standardized data model which I am advocating does not prescribe the use of any particular terminology or recording system. The requisite level of standardization can be achieved by using a flexible "item-based" data model instead of the more rigidly structured "class-based" data models that have been common in archaeology. Class-based databases typically provide one data structure (usually represented as a two-dimensional table) for each class or subclass of archaeological observations - ceramic, lithic, faunal, botanical, architectural, stratigraphic, etc. In each class's table there is a fixed number of columns which predetermine the variables (e.g., "length," "weight," "color") that are available to describe the items belonging to that class. Each item (i.e., unit of observation) is therefore represented as a row in a table with a predefined structure (fig. 1). Each cell of the table, at the intersection of a row and a column, contains the value of a given variable for a given item. Rigidly structured databases of this sort employ what has been called a "strictly typed" data model. As applied to archaeology, this means that decisions about the typology of archaeological observations - how many classes of observations will be considered and how many and what kind of attributes each class will possess - are all "hard-coded" into the structure of the database from the beginning and cannot be changed very easily afterwards.
Conventional class-based systems are not well suited to representing the variety and complexity of archaeological data, as a number of researchers have pointed out. Unfortunately, strictly typed class-based data formats are nearly universal in archaeology today, not for any archaeological reason but largely because readily available database software tends to encourage this approach. To be sure, commercial software does not actually require a class-based data model, but in most business applications and in the standard working practice that has been derived from them, database tables tend to be equated with particular classes of data in the manner described above. Yet it is the rigidity of the prevailing class-based data model, in which a predetermined set of attributes is prescribed for each of a limited set of predetermined classes, which prevents the automated searching of information derived from multiple projects, each of which might employ a different typology for recording archaeological observations. This is because combining such databases within a uniform, searchable framework requires much more than just specifying equivalences or translations between the different terms used by different archaeologists. It requires mapping the entire class structure of each database onto the possibly quite different class structures of every other database that is being considered.
An item-based data model, by contrast, makes the automated combination of multiple databases much easier. In an item-based database the fundamental structural component is not the predefined class of items but rather the abstract archaeological "item" itself as a unit of observation with which any number of descriptive variables may be associated. A "class" is thus not a fixed structural component of the database but merely an ad hoc grouping of items based on a particular set of search criteria. The building blocks of the database are the individual items themselves, whose specific attributes the archaeologist defines by associating each item with a potentially unique set of descriptive variables and with any relevant images, documents, or other data (fig. 2). For this reason an item-based database can be easily adapted and extended as needed by the individual archaeologist, without special programming, by permitting him or her to add new variables and values to the pool of available attributes and to rename or delete these as necessary. Similarly, the description of any item can be changed by associating different variables with it without affecting any other items. Indeed, in an item-based database system, it is easy to assign multiple sets of variables to any item, representing distinct and even conflicting observations made by different persons or at different times. A full description of any item, no matter how atypical, can be achieved simply by creating appropriate variables and attaching them to it, thereby minimizing the need for ad hoc prose descriptions that are stored as unstandardized free-form notes (although such notes may also be associated with each item, of course).
In an item-based database, therefore, the information produced by an excavation or survey project is not represented as a collection of rigid tables corresponding to a limited set of classes but is represented instead as a large collection of independent data elements which correspond to the individual archaeological items to be described. Every separately registered find becomes an independent item in the database, whether it be a layer, feature, or smaller find. Each such item may then be described by a unique set of any number of variables together with their associated values for that item. New variables may be defined individually by the investigator at any time as new subtypes or unusual finds appear, without the need to restructure tables or create new tables.
An important advantage of the item-based data model is that a clear separation is maintained between the relatively primary attribution of descriptive variables to potentially unique individual items, on the one hand, versus the multitude of possible secondary and overlapping classifications of those items, defined according to investigators' changing interests and assumptions, on the other. This approach therefore respects the tremendous variability of archaeological data, because researchers might want to create hundreds if not thousands of overlapping classifications of the many items observed in any large excavation. In this way the item-based model takes into account the special characteristics of archaeological data. But most importantly from the point of view of electronic publication, databases from different projects can be combined quite easily in order to do a comprehensive search on the Web. All that is needed is to establish synonyms among the individual terms used in each project. There is no need to map the entire class structure of each database onto that of every other database.
In the absence of predefined classes or database tables, however, it is necessary to find some way to group and organize the many different items in an item-based archaeological database. This is done by organizing the individual items into a hierarchical "tree" that represents the spatial containment relationships among the various units of observation (fig. 3). Of course, spatial hierarchy is only one possible view of the relationships among archaeological items, but it is the most comprehensive and inclusive view, in the sense that every archaeological observation may be located at some place in a spatial hierarchy. In addition, because a tree structure is self-replicating and has the same properties recursively at all levels, the spatial hierarchy of archaeological items is infinitely extensible in both directions, both macroscopically and microscopically. This means that a hierarchical item-based data model can easily accommodate data from multiple artifacts, sites, and regions on all spatial scales, and from both excavation and survey projects, using the same simple design of independently linked items-with-their-attributes.
Note that the concept of an archaeological "item" is completely generalizable here. A unit of observation encompassing an archaeological site as a whole would be represented as a single item with its own description in the form of associated variables, images, and documents. Indeed, an entire geographical region could be represented as a single item that has its own position in the hierarchy and spatially contains a number of archaeological sites, represented as lower-level "child" items. Each site in turn contains sub-sites (e.g., excavation areas), stratigraphic units (features and layers), small finds (artifacts and ecofacts), and artifact features or components, in a descending spatial hierarchy. Thus the basic spatial relationship - "this is found within that" - is represented simply and consistently at every level of detail, from the broadest region of interest to the smallest aspect of an individual find. Furthermore, the comprehensive spatial tree of independent archaeological items itself serves as the primary interface for entering and displaying information and is the principal means of navigating among the large number of items in a typical archaeological database. This kind of interface also has the advantage of being familiar to many computer users because of the use of hierarchical trees in Microsoft Windows and other operating systems for organizing computer files into folders and subfolders.
An important point to remember about this hierarchical item-based data model is that because of its abstract structure it provides robust standardization in terms of a basic framework consisting of a tree of items with their attributes, but it does not force standardization in terms of specific content. It is left to the creator of each database to determine the arrangement and labeling of the items in the tree, the names of variables and values, and the association of items with variables, images, documents, and other information. What is more, the hierarchical item-based data model is especially suitable for archaeological purposes. First, the hierarchical tree structure has an obvious archaeological interpretation in terms of spatial containment and thus is intuitively grasped by any archaeologist. Secondly, the flexible item-with-its-attributes data structure can capture the idiosyncrasies of highly variable archaeological data in a way that class-based data models cannot. Finally, the open-ended extensibility of this data model facilitates the electronic publication of archaeological data by making it easy to combine data from multiple sources for search and retrieval purposes.
Electronic publication on the Web or elsewhere will be of limited value in archaeology unless its intended audience can easily view and analyze published data in full detail using visual interfaces and complex queries, with the goal of testing investigators' interpretations and combining data from disparate sources to permit more broadly based retrieval and analysis. There is a long history in archaeology of creating localized special-purpose databases in order to test specific hypotheses or construct particular models, but what is needed to enhance future research is a tool that will permit rapid, efficient, and open-ended "exploratory data analyses" on broader spatial and temporal scales. In this way patterns in the data may be detected that currently go unnoticed, and patterns that are found may be explored further with a speed and rigor hitherto impossible. The achievement of such benefits is what makes the adoption of a flexible yet standardized archaeological data model so desirable.
I contend that archaeological publication on the Web would be greatly enhanced by the adoption of a hierarchical item-based data model, but this data model must first be expressed in a Web-oriented format; i.e., in terms of a specific XML tagging scheme. I am currently refining the technical details of such a scheme as part of my proposal to a newly formed working group on archaeological data standards sponsored by UCLA. This XML scheme is closely based on existing Microsoft Windows software for archaeological data management that I have developed over the past several years using the hierarchical item-based data model. This Windows software is named "INFRA," which is an acronym for "Integrated Facility for Research in Archaeology." In addition to a primary tree diagram showing the spatial containment relationships among archaeological items (fig. 4), INFRA uses other diagrammatic interfaces to represent various kinds of inter-item relationship. These are shown in separate window "panes" or "frames" beside the tree pane.
Like spatial containment the temporal sequence of archaeological items is readily represented in an item-based data model (fig. 5), although instead of linking items together into a hierarchical tree this involves linking items to form a stratigraphic sequence diagram of the kind developed in the 1970s by Edward Harris and now used by many archaeologists.
Still another kind of inter-item relationship is spatial adjacency or conjoinment, which is represented using an undirected "network" diagram that shows which items directly touch others (fig. 6). The containment tree, Harris Matrix, and network diagram are all, in mathematical terms, node-link "graphs" which represent various complementary views of the database, neatly encapsulating the extrinsic relationships among units of archaeological observation that are difficult to represent in conventional class-based data management systems.
INFRA can use its various graphical views of inter-item relationships in queries that determine which items belong to a given class; thus computer-aided selection of archaeological items may readily be extended to consider their extrinsic as well as intrinsic attributes. In this way, a given class of items can be retrieved according to easily defined but quite complex criteria because any combination of the variables used to represent an item's intrinsic attributes, together with the extrinsic spatial and temporal relationships represented in schematic diagrams, may be used in data retrieval. For example, one might wish to select all artifacts of a certain type which are found within a particular kind of architectural feature that also contains artifacts of a second type and that occurs stratigraphically after a certain kind of deposit. Similarly, "phase plans" of contemporaneous items can be generated automatically, reflecting current stratigraphic interpretations as these have been entered via a Harris-style sequence diagram. INFRA uses the sequence diagram to select automatically the items that are to be drawn together on a single plan or in the same color.
INFRA's query facility is a central component of the software because it is the vehicle by which the potentially huge assortment of information in an archaeological database may be filtered into named classes of items for display and analysis in a variety of forms. Each set of user-defined class criteria is given a meaningful name by the user and may be saved for repeated use, together with the list of items retrieved the last time a query was executed in order to find the items that match those criteria. The resulting set of items may then be used in the creation of customized reports, statistical graphs, tables, and composite plans which contain data pertaining only to that class of items. This approach provides maximum flexibility because classes are defined not as fixed tabular templates into which items must be inserted irrevocably in the course of primary data entry, but simply as dynamic groupings of items matching user-defined query criteria which may be created, named, saved, and then used at any time to retrieve groups of related items based on their intrinsic and extrinsic attributes.
It is worth emphasizing again that INFRA and the item-based data model that it implements prescribe no rigid format or specific terminology for recording archaeological information. The archaeologist can create and label individual items, variables, and values as needed, in a manner that is appropriate for the project at hand, and can associate items with one another in a variety of spatial and temporal (or "stratigraphic") configurations. Initially, this approach demands a higher degree of conceptual abstraction, yet it actually corresponds better to observed archaeological entities in the real world, which do not manifest themselves in the form of tidy classes of material but as idiosyncratic individual items. Moreover, the abstraction entailed in working with a few generic concepts such as "item" and "variable," and with a few graphically represented spatial and temporal relationships, permits both flexible customization from the archaeologist's perspective and rigorous standardization in terms of the underlying data structure. Most importantly, because of this standardization the task of combining databases from different projects is quite easily accomplished by grafting in the spatial tree of one database as a new branch of the spatial tree of a second database, and then defining equivalences between the terms used in the two original databases. The archaeologist is not forced to turn to a programmer to map one rigid and idiosyncratic table structure onto another because the comparison between different databases is done at a more basic level, between individual items and their attributes. The end result of such a combination is a comprehensive view of the constituent databases which preserves the standard underlying structure of a simple item-based hierarchy but which also reflects the naming conventions and recording systems of the individual projects whose data are incorporated within it.
INFRA is only one implementation of the item-based data model advocated here, although it demonstrates what I believe is the best approach to representing both the intrinsic attributes and the extrinsic interrelationships of archaeological items in a straightforward and standardized fashion. In its specific features it also demonstrates the benefits of an item-based design for integrating very tightly, within a single software application, an array of powerful yet easy-to-use functions that have been tailored for archaeological use. In addition to the diagrammatic interfaces mentioned above, these features include a "map view" for drawing archaeological plans and sections, an "image view" for displaying scanned photographs and other images and linking them to database items, a "document view" for composing written summaries and interpretations of primary data with hyperlinks to the actual data, a "table view" for generating charts and tables of data, and a "statistics" feature to facilitate quantitative analysis of archaeological data.
Again, however, the most important practical benefits of the software design underlying both INFRA and the related XML tagging scheme now being developed will be fully realized only when a hierarchical item-based data model becomes the basis of a mode of electronic publication in which data from disparate projects may be easily combined. I have argued that electronic publication of archaeological data, and Web publication in particular, ought to facilitate comprehensive retrieval and analysis together with universal access, and this can be accomplished by using an item-based structure which will permit archaeological databases to be viewed or queried in any combination, simultaneously drawing on a variety of different Web sites, while maintaining a consistent user interface. This is possible because each published database is delivered as a subtree that can be dynamically integrated into an overall spatial hierarchy which may then be viewed as a seamless whole by the archaeologist. With the advent of XML such forms of electronic publication are now feasible, so the time is ripe to begin exploiting this new tool for the benefit of archaeological research.
David Schloen earned a bachelors degree in computer science before going into archaeology. He is currently the Associate Director of the Leon Levy Expedition to Ashkelon, Israel. He is using this large-scale project to test his ideas about electronic publication of archaeological data.
Revised: April 28, 2011