“Query potential of the digital surrogate” is quite a mouthful, but the phrase usefully draws attention to two aspects of digital objects that scholars encounter in their work. A digital object will either be ‘born digital’, like this blog entry, or it will have originated in some other medium, in which case its digital version is a surrogate. For scholars who work with primary documents before the late twentieth century the central question about digital versions is how to create surrogates that maximize the distinct affordances or query potential of the digital medium while minimizing the sacrifices that come with any translation from one medium to another.
Concatenability is the most salient advantage of the digital surrogate. Consider the extraordinary power of the ‘cat’ command in the Unix operating system. The command
cat text1 text2 …. textn
will “catenate” the content of n texts and turn them into a single “text.” Now you can treat this collection of texts as if they were one.
If you think of ‘cat’ as an abbreviation of ‘catalogue’, the command simply describes what libraries have done for centuries. When you catalogue books and turn them into a “library” you create a single entity in which the different books are the ordered pages of one Big Book.
The Big Book created by “catenating” digital files supports forms of “reading” that are not constrained by what human eyes and hands can see or hold. They are constrained by the power of the algorithms that machines can perform. As an IBM executive once remarked, machines are fast but dumb, while people are smart but slow. How do you put the dumb but fast machine in the service of the smart but slow human reader?
From a scholar’s perspective this remains an uninteresting question as long as you think of the machine as just another device for delivering individual “books” that are then read in the conventional way. Scholars certainly have a keen interest in having more books delivered faster, cheaper, and without hassle, but they don’t care how it is done as long as it is done.
Things become more interesting if you ask what you can do with texts in a digital Big Book that you could not do with them in the Big Book of a traditional library. That question in turn is complicated by the question what you must “do to” those texts before they release their query potential as a single and digital Big Book.
Which brings us to the Text Creation Partnership. Project Bamboo will work with some 2,000 18th-century texts (ECCO-TCP) that have recently released into the public domain, but in the long run the results of that work may matter even more to the much larger EEBO-TCP archive to which they can be applied. The EEBO-TCP archive currently consists of some 30,000 texts published before 1700 and is likely to double in size by 2015. It will by then include at least one version of just about every book published before 1700. Starting in 2015, all these texts will move into the public domain, amounting to a virtually complete coverage of the first two centuries of print culture in English.
The most significant added value of the TCP archive will be the ability to treat 70,000 texts as if they were one Book of Early Modern English, but the realization of that value will depend critically on providing the digital equivalents of the complex indexes that have for centuries enhanced the usability of scholarly books. At the moment, the delivery of TCP texts to readers depends on some very generic forms of indexing in which the machine keeps track of the occurrence of the various spellings or character strings. Such indexing works well enough where readers can assume that the same word is spelled in the same way–a highly problematical assumption for texts before 1700. But even where it works, it works in very rough ways. A human reader looking at a printed page can tell right away whether a word occurs in prose or verse, in a footnote, heading, or other form of paratext, belongs to another language, needs to be tacitly corrected because it is misspelled or represents an orthographic variant, and so forth.
Sophisticated indexing of a Book of Early Modern English will consist of techniques that let machines identify and record explicitly the many differences that human readers tacitly recognize. Such indexing is well within the range of current technology, and many of the conditions for such indexes are already realized in the transcriptions of the TCP texts.
Much of Project Bamboo can be described as an effort to enhance the query potential of primary documents through complex indexes and user-friendly search tools. Indexes and search tools of this type are not tied to particular data sets. What is special about the EEBO-TCP texts is their claim to coverage. When the project is completed, it will be unrivaled in terms of the size of the data for which it will offer virtually complete coverage. There will be many archives that are larger, and there are archives that are more complete (e.g. Old English). But where else will we find virtually complete coverage of the printed record of a culture over more than two centuries?
Fast forward five years to a world in which the EEBO-TCP texts are in the public domain and Project Bamboo has contributed to an environment in which that archive is surrounded by layers of complex but easily searched indexes allowing scholarly minds from all walks of life to traverse genres and generations of Early Modern writing and get query results (not quite the same as answers to questions) within seconds or minutes (and occasionally hours). It will take considerable labor and ingenuity to build the indexes and tools, but if that work is done well it will be a worthwhile achievement, putting within reach of ambitious minds what was previously impracticable and observing the fourth of Ranganathan’s lapidary and charming laws of library science: “save the time of the reader.”
Martin Mueller is a professor of English at Northwestern University.