IST Home > IST Division > Data Services > Blog

Local Navigation:


Spring 2012 trainings begin January 26

January 25th, 2012 by Research Hub News » Announcements

Research Hub presents two series of drop-in trainings this semester, Getting Started and Advanced Topics. Each series is offered once a month; the first Getting Started session meets this Thursday, January 26. For details, visit our Training page.

Research Hub tops 1000

January 22nd, 2012 by Research Hub News » Announcements

As of this past week, over 1000 different people have used the Research Hub. Amazing!

We know from talking with a number of you that the reason is more than just curiosity. Research Hub answers a real and long-unmet need. Please help us continue to improve the service: let us know how you use it, what works well for you, and in what ways you wish it would be better. Take our survey, post your feedback to our discussion forums, or drop us a line. Most of all, please accept our thanks!

Project Bamboo members present at Annual MLA Convention

January 5th, 2012 by Emma Millon

Several Project Bamboo members are currently in Seattle, WA at the 2012 Modern Language Association (MLA) Convention. Those attending include Neil Fraistat (University of Maryland), who is participating in a panel discussion on “#alt-ac: The Future of ‘Alternative Academic’ Careers”; Harriett Green (University of Illinois, Urbana-Champaign), whose paper “Collaborative Economies: Tools and Strategies for Scholars and Libraries” highlights Project Bamboo and the study she is conducting on scholars and digital collections; and Quinn Dombrowski (University of Chicago) who is representing Bamboo DiRT. Neil and Quinn also participated in a day-long DHCommons pre-conference workshop. Say hello to Neil (@fraistat), Harriett (@greenharr), and Quinn (@quinnanya) at MLA, and follow their conversations on Twitter.

Holiday support schedule

December 21st, 2011 by Research Hub News » Announcements

A good but long semester, and now a welcome break.

Research Hub will be available throughout curtailment, but support will be limited. Our support team will be monitoring the service and will respond if anything occurs to prevent people from using Research Hub. However, we will not be looking at or answering trouble tickets submitted to hub@berkeley.edu until the new year. Please visit our information site for helpful documents to answer your questions.

Enjoy your break! See you on January 2.

Announcing the Release of CollectionSpace v2.0

December 19th, 2011 by Angela Spinazze

The CollectionSpace team is pleased to announce the release of CollectionSpace version 2.0. This release is the culmination of a year-long team effort to improve functionality, lessen the complexity of creating new functionality, and to provide greater support for implementers of stand-alone installations and their in-house development staff.

New user-facing functionality includes: Roles and permissions enforcement; media handling; hierarchical vocabularies; import/export; reporting; advanced search; and structured dates. A few of the highlights are below.

read more

Please take our survey

December 8th, 2011 by Research Hub News » Announcements

First impressions are lasting? Please complete the Research Hub user survey. Let us know where we stand and help steer our priorities for further development. It’ll only take a few minutes. We’ll be grateful to you.

(When you’re done with the survey, you can use the back button on your browser to return here.)

Scholarly Services on the Bamboo Services Platform

December 7th, 2011 by Steve Masover

Project Bamboo is working to model existing digital humanities tools as web services that can be accessed from a server that adds value to the tools by supporting basic functionality required across many tools, such as authentication, authorization, and the ability to store the results of long-running processes for later retrieval. We’re calling them Scholarly Services to differentiate them from basic functionality that supports tools of more direct interest to humanist scholars.

Providing ‘single-source’ support for a wide range of tools eliminates the need for each toolmaker to ‘reinvent’ these basic needs over and over again. Our goal is to enable scholars to access a broad range of tools without having to install and run them on their own machines, or in their own universities’ data centers; while empowering developers to contribute to a shared body of stable curatorial, analytic, semantic, collections interoperability, and trust services that are readily accessible to humanists.

We’re calling the server that hosts these services the “Bamboo Services Platform” (BSP). It serves as an integration broker – the technology that mediates interaction between independent tools, environments, and content repositories that participate in the Project Bamboo Ecosystem. These include:

  • Services developed (and in some cases run) by other, often long-standing humanities projects; the Morphology and Syntactic Annotation services described below fall into this category.
  • General-purpose applications, such as the research environments that we’ve been calling “Work Spaces,” being built as adaptations of best-of-breed Open Source platforms for managing and analyzing digital content.
  • Collections of content owned or hosted by a diverse range of repositories, such as HathiTrust and the Perseus Digital Library.

Two Scholarly Services are currently running on the BSP in an “alpha” stage of development. These are the Morphology and Syntactic Annotation services developed by Tufts University under the leadership of Greg Crane.

The Syntactic Annotation service provides a ‘single-source’ API from which scholars can retrieve syntactic annotations from a range of annotation repositories in multiple formats. A scholar may obtain a template for annotating a document or text fragment that she provides.

The Morphology service is a ‘single-source’ API to multiple morphological analysis engines. Individual words, blocks of XML-encoded text, or documents from a variety of repositories can be analyzed.

Detailed examples of how each of these services work in their current “alpha” state can be found in a late-September blog post on Project Bamboo’s Tech Wiki, Alpha Releases of Morphology and Syntactic Annotation Services.

Additional Scholarly Services on Project Bamboo’s roadmap for the current grant period include concordance, collocation table, and frequency table services that draw on Philologic indexing and analysis; as well as a Places-Texts service that will identify place names in texts and provide geolocation metadata about them, using a variety of geoparsers and gazetteers (see Eric Kansa’s June 1 blog on this site, Places in Texts: Illustrating a Prospective Project Bamboo Service).

Over the course of the summer and early fall, an initial instance of the BSP was made available to developers, and architectural design for ecosystem-wide identity management, group management, access policy management, and user profiles was completed.

By the end of the year, a Proof of Concept (PoC) integration of ecosystem elements will be demonstrated. From a Project Bamboo Work Space, a scholar will be able to gather a collection of texts via the Bamboo Collections Interoperability (CI) Hub and perform operations on those texts using tools and services provided by the BSP, by the Work Space, and by servers outside the Project Bamboo ecosystem. For example, a scholar will be able to retrieve a TEI document from the Perseus Digital Library or Eighteenth Century Collections Online (ECCO) through the CI Hub; use the BSP-hosted Syntactic Annotation service to prepare the document; then manually parse the sentences (or correct automatically generated parses) in the prepared document, using the externally-hosted Alpheios Treebank Editor. Syntactic annotations created by the scholar can then be stored in a Bamboo research environment (Work Space), or used as input to additional analytic or visualization tools.

Follow Scholarly Services and BSP work on the Project Bamboo wiki, and on Twitter @projectbamboo.

Steve Masover is IT Architect at University of California, Berkeley.

Drop-in Training – Tuesday, December 6

December 5th, 2011 by Research Hub News » Announcements

Our next training session will be this Tuesday, December 6, 1-2pm in 1535-A Tolman (map; see #13). This will be a good session for people just beginning to use Research Hub, but we’ll also address more advanced issues as they arise.

Bring your questions. We’ll provide the computers. (You can use your laptop if you wish.)
All are invited.

Research Hub and email

December 5th, 2011 by Research Hub News » Announcements

When it comes to sharing files, Research Hub can be a valuable substitute for email, especially as CalMail access is being carefully metered. Connecting via Mobile apps and uploading, downloading and syncing folders and files (see “Accessing your Files through Other Means” on our Tips & Tricks page) make Research Hub extremely efficient and practical.

Research Hub runs on a completely different server than Calmail, so it doesn’t share or contribute to the load on the Calmail environment. Nevertheless, being on the web, there will be times when you can’t get to your stuff on Research Hub — as happened last Wednesday, as CalNet had its troubles and no one could log in. We apologize for any inconvenience that caused, and remind our users (and ourselves!) to keep a local copy of those crucial documents that you’ll need available.

The TCP texts and the query potential of the digital surrogate

November 28th, 2011 by Martin Mueller

“Query potential of the digital surrogate” is quite a mouthful, but the phrase usefully draws attention to two aspects of digital objects that scholars encounter in their work.  A digital object will either be ‘born digital’, like this blog entry, or it will have originated in some other medium, in which case its digital version is a surrogate. For scholars who work with primary documents before the late twentieth century the central question about digital versions is how to create surrogates that maximize the distinct affordances or query potential of the digital medium while minimizing the sacrifices that come with any translation from one medium to another.

Concatenability is the most salient advantage of the digital surrogate. Consider the extraordinary power of the ‘cat’ command in the Unix operating system. The command

cat text1 text2  …. textn

will “catenate”  the content of  n texts and turn them into a single “text.” Now you can treat this collection of texts as if they were one.

If you think of ‘cat’ as an abbreviation of  ‘catalogue’, the command simply describes what libraries have done for centuries. When you catalogue books and turn them into a “library” you create a single entity in which the different books are the ordered pages of one Big Book.

The Big Book created by “catenating” digital files supports forms of “reading” that are not constrained by what human eyes and hands can see or hold. They are constrained by the power of the algorithms that machines can perform. As an IBM executive once remarked, machines are fast but dumb, while people are smart but slow. How do you put the dumb but fast machine in the service of the smart but slow human reader?

From a scholar’s perspective this remains an uninteresting question as long as you think of the machine as just another device for delivering individual “books” that are then read in the conventional way. Scholars certainly have a keen interest in having more books delivered faster, cheaper, and without hassle, but they don’t care how it is done as long as it is done.

Things become more interesting if you ask what you can do with texts in a digital Big Book that you could not do with them in the Big Book of a traditional library. That question in turn is complicated by the question what you must “do to” those texts before they release their query potential as a single and digital Big Book.

Which brings us to the Text Creation Partnership. Project Bamboo will work with some 2,000 18th-century texts (ECCO-TCP) that have recently released into the public domain, but in the long run the results of that work may matter even more to the much larger EEBO-TCP archive to which they can be applied. The EEBO-TCP archive currently consists of some 30,000 texts published before 1700 and is likely to double in size by 2015.  It will by then include at least one version of just about every book published before 1700. Starting in 2015, all these texts will move into the public domain, amounting to a virtually complete coverage of the first two centuries of print culture in English.

The most significant added value of the TCP archive will be the ability to treat 70,000 texts as if they were one Book of Early Modern English, but the realization of that value will depend critically on providing the digital equivalents of the complex indexes that have for centuries enhanced the usability of scholarly books. At the moment, the delivery of TCP texts to readers depends on some very generic forms of indexing in which the machine keeps track of the occurrence of the various spellings or character strings. Such indexing works well enough where readers can assume that the same word is spelled in the same way–a highly problematical assumption for texts before 1700. But even where it works, it works in very rough ways. A human reader looking at a printed page can tell right away whether a word occurs in prose or verse, in a footnote, heading, or other form of paratext, belongs to another language, needs to be tacitly corrected because it is misspelled or represents an orthographic variant, and so forth.

Sophisticated indexing of a Book of Early Modern English will consist of techniques that let machines identify and record explicitly the many differences that human readers tacitly recognize. Such indexing is well within the range of current technology, and many of the conditions for such indexes are already realized in the transcriptions of the TCP texts.

Much of Project Bamboo can be described as an effort to enhance the query potential of primary documents through complex indexes and user-friendly search tools. Indexes and search tools of this type are not tied to particular data sets. What is special about the EEBO-TCP texts is their claim to coverage. When the project is completed, it will be unrivaled in terms of the size of the data for which it will offer virtually complete coverage. There will be many archives that are larger, and there are archives that are more complete (e.g. Old English). But where else will we find virtually complete coverage of the printed record of a culture over more than two centuries?

Fast forward five years to a world in which the EEBO-TCP texts are in the public domain and Project Bamboo has contributed to an environment in which that archive is surrounded by layers of complex but easily searched indexes allowing scholarly minds from all walks of life to traverse genres and generations of Early Modern writing and get query results (not quite the same as answers to questions) within seconds or minutes (and occasionally hours).  It will take considerable labor and ingenuity to build the indexes and tools, but if that work is done well it will be a worthwhile achievement, putting within reach of ambitious minds what was previously impracticable and observing the fourth of Ranganathan’s lapidary and charming laws of library science: “save the time of the reader.”

Martin Mueller is a professor of English at Northwestern University.


UC Berkeley UC Berkeley CIO Campuswide IT Service Providers
Site Map Contact Webmaster