By Gene Gragg, Professor of Near Eastern Languages
The Oriental Institute
The University of Chicago

(This article originally appeared in The Oriental Institute News and Notes, No. 149, Spring 1996, and is made available electronically with the permission of the editor.)

The word "etymology" comes wrapped in musty, bookish connotations. It brings up memories of the initial section of lexical entries, often in smaller print, encountered while browsing around in older, more compendious dictionaries. In these somewhat detached sections one can pick up unexpected, and sometimes delightful, but somehow not very practical bits of anecdotal information-for example that the (native) English word "dough" and (the ultimately Latin loan-word) "fiction" are historically related through regular developments, which took place independently in Germanic and Italic, from the same ancestral Indo-European root reconstructed as *dheigh- "knead, fashion".

Approaching etymology from this angle, it is easy to lose sight of, or never even be aware of, a) the fact that the establishment of a historical (linguists often use the biology-influenced term "genetic") relationship among a group of languages, that is, the fact that they are descendants of the language of an earlier single parent speech community, and b) the reconstruction of this parent ("proto-") language as well as c) the working out of the historical ("evolutionary") steps whereby the parent language became differentiated into the various daughter languages-all of this depends crucially and centrally on the ability of the historical linguist to establish sets of etymologically related words ("cognate sets") within the language family, and to work out regular phonological and morphological correspondences within and between these sets. It is this, and only this, process that entitles a linguist to assert that the languages in question are indeed genetically related, and that the resemblances are not simply the result of contact or convergence between independent speech communities. Thus the first step towards being able to draw that historically and socio-culturally important conclusion is the establishment of a sufficiently large set of related lexical items, in other words an etymological dictionary or database.

To continue the illustration with "dough," the fact that we can: (1) establish large numbers of equations such as English dough = German Teig , English deed = German Tat, English deep = German tief, heap = Haufe, hip = Hüfte (adding of course cognate items in Dutch, Scandinavian, Gothic, and older periods of English and German); and (2) observe regular phoneme correspondences such as English d = German t (in the first three items), and English p = German f (in the last three) - all this, in conjunction with many other observations both linguistic and historical-archaeological, enables historical linguists to state with a certain amount of confidence:

-that there was a (more or less) unitary proto-German speech community somewhere in north-central Europe, probably sometime late in the first millennium bc

-that all attested Germanic languages are developments of this proto-speech community

-that a fair amount of information can be recovered about what this language was like (for example that the five partial cognate sets cited above are reflexes of proto-Germanic lexical items which, according to one reconstruction, may have been something like: *daigo-z "dough," *daedi-z "deed," *deupo-z "deep," *haupo-z "heap," *hupi-z "hip").

The reconstruction of proto-Germanic, and the relating of Germanic along with Celtic, Italic, Greek, Albanian, Armenian, Slavic, Iranian, and Indic to a superfamily called Indo-European, was one of the great intellectual achievements of the nineteenth century, and one that attracted some of its greatest minds. Building on this magnificent foundation, work in this area still goes on, with newly discovered languages being added (for example, Hittite), and new discoveries being made concerning the process of differentiation and diffusion of Indo-European, and the date and location of the ancestral speech community.

Around the same time that they were discovering Indo-European, scholars were becoming aware of the existence of other major families like Semitic (uniting, among others, Akkadian, Aramaic, Hebrew, Ugaritic, Arabic, South Arabian, and Ethiopic). Progress here however has been much less dramatic. In part because new languages are continually being discovered and added to the list (e.g., Eblaite), and because fundamental research tools in the individual branches (e.g., the Chicago Assyrian Dictionary) are still being compiled, a real etymological dictionary of Semitic still does not exist. To make matters worse, evidence has been accumulating that Semitic is not an isolated family, but is itself part of a superfamily, probably older than Indo-European, which stretched over large parts of Northern and Eastern Africa and Western Asia. This family, sometimes still called "Hamito-Semitic," but now more often "Afroasiatic" or "Afrasian" includes-besides Semitic-Egyptian, Berber, Cushitic (a heterogeneous group of dozens of languages, including Somali, centered around the Horn of Africa), Omotic (a large group of languages in Southwest Ethiopia), and Chadic (more than a hundred languages, including Hausa, spoken over a large sub-Saharan area centered around Lake Chad). Relationships are still being established within the last four groups, many individual languages are very poorly known, and new information is coming in on an almost daily basis. Clearly we are on the verge (or over the edge) of information overload. There are more pieces of information around and more heterogeneous and even contradictory hypotheses about their relationships than anyone can easily keep track of. Thus it is becoming harder and harder to draw together the material for potential cognate sets and sound correspondences, as well as relevant textual, historical, and archaeological detail, which will make possible, first, the firm establishment of Afroasiatic as a language family, and then the drawing of some reasonable hypotheses about its nature, its place and time of origin, and its differentiation and diffusion.

To help in the process of systematization of what is becoming an increasingly amorphous heap of unassimilated information, the Oriental Institute is sponsoring a project that will draw on the two closely related and developing, not to say exploding, technologies which are being harnessed in many different contexts to stay on top of a rising flood of information-electronic data processing and, courtesy of the Internet and the World-Wide Web, data communication. We are currently setting up the Afroasiatic Index, a major source of historical linguistic information. It should permit access to the most reliable current information (including alternate and mutually incompatible hypotheses) about family-level and super-family-level cognate sets, correspondence sets, sound changes, morphological correspondences, and relevant bibliography. Of its major subparts, the Semitic Index, the Egyptian Index, the Cushitic Index, and the Omotic Index can be handled within the Oriental Institute or through contacts whom we already work. With the Berber Index and the Chadic Index, we are currently working on contacting extramural collaborators or outsourcing the work.

A precursor of the Cushitic Index, and something of a pilot for the whole project, has been the Cushlex project, initiated in 1987 with the help of a National Science Foundation Grant. The object of that project was to explore the possibility of using standard relational database file formats and off-the-shelf database managing software to create and maintain an etymological database (cognate sets, correspondence sets, sound changes, bibliography) for Cushitic and Omotic. Inevitably cognates were noted between these languages and the other major branches of Afroasiatic, so that the project early on acquired a certain Afroasiatic dimension. Indeed, as has been noted by other investigators, Cushitic, with its major subfamilies of Bedja, Agaw, East Cushitic, and South Cushitic, is such a heterogeneous group that the question seriously arises whether it is really a separate "family" at all, or just a collection of Afroasiatic language families which through geographic proximity on and around the Horn of Africa stayed linguistically closer to one another than more widely distributed sister families (perhaps thereby making this area a good candidate for the "home" of Afroasiatic?). The database, implemented in one of the commercial DBMS (database managing software) packages, has been available from the Oriental Institute in a preliminary form since 1994. It is designed to run on a single PC, and data and programs have been distributed to interested users in diskette format (sent by U.S. mail or by electronic ftp [file transfer protocol] on the Internet).

Figure 1: Information modules in database.

Figure 2: Links between principal modules.

The database makes available a complex network of information involving rudimentary dictionaries of the languages covered, the organization of these lexical items into cognate sets, the analysis of the cognate sets into sets of corresponding phonemes, and the formulation of regular sound changes on the basis of these correspondence phoneme sets. Figure 1 gives a sample of the kind of information contained in the database, using data reported by Christopher Ehret from three Southern Cushitic languages (Iraqw, Alagwa, and Burungi, all spoken at the extreme southern limit of Cushitic expansion, in Northern Tanzania near the foot of Mt. Kilimanjaro). The hierarchy of relations among the information modules is shown in figure 2. Figure 3 shows how a set of display windows in the Cushlex application has been set up on a screen to display these complex interrelationships. The "Sourcelist" window displays a list of cognate sets chosen on some basis or other; the "Cognate Sets" window shows all the members of the currently highlighted cognate set. Clicking on the "Cognate Sets" window shows what if any correspondence sets have been related to that cognate set. Through the "Corrsets" window, one can display directly any correspondence set in the system. Clicking on an individual correspondence set shows through "CorrCog" what cognate sets support that set, and in "RuleCorr" what rules are implied by it. Clicking on rules shows in "RuleCorr" what correspondence sets are related to it.

Figure 3: Screen display of file relations.

Useful as it is, major problems have become apparent with the Cushlex approach. These have included difficulties involved in keeping the distributed data up to date, and adapting the interface to a wide variety of incompatible platforms (not just Macintosh versus PC, but even problematic adaptations of the interface to different specifications of monitors within the PC domain). None of these problems are insuperable, but they definitely do demand much more low-level computer involvement than is feasible for a project that intends to be more an information provider than a software provider. The Cushlex approach also involves too much investment in installation time and valuable hard disk space to be practical for all but a few dedicated users. At the very instant that these problems began to endanger the success of the project, though, a remarkable new tool came into the picture: the World-Wide Web. The World-Wide Web transports, almost instantaneously, our information to the nether reaches of the globe, and relieves us of the burden of creating our own visual interface. It is true that the current limitations of even the most advanced web browsers impose some limits on format, and force some compromise in character representation. We cannot yet reproduce, for example, the exact screen of interrelated windows indicated in figure 3. But all substantive information and links between modules of information can be represented, and as the web itself evolves, it will be possible to upgrade formats and character-set inventories.

At present the Afroasiatic Index web page is under construction, but open, and accessible from the Oriental Institute Home Page ( A prototype of the current interface with a complete set of data can be seen in "Semitic Index" section, which now also integrates a module on morphological information. The Cushlex material will be transferred to Web format in the course of the Spring Quarter, even as work progresses on other fronts.

Pardon our dust, but please do drop in on us and look around-we would appreciate reactions, comments, and suggestions.

In addition to research and teaching in the peripheral languages of the Ancient Near East, Gene Gragg has long been occupied with the Semitic and Cushitic languages of Ethiopia. He did lexical research in Ethiopia and has published a dictionary of the Cushitic language Oromo. Computational (and Northwest Semitic) expertise for the Afroasiatic Index is being provided by Richard Goerwitz, Research Associate and Lecturer in the Department of Near Eastern Languages and Civilizations, and recent Ph.D. in that department.

