IST Home > IST Division > Data Services > Blog

Local Navigation:


Big Data issue of Nature: uneven, but worth reading

September 11th, 2008 by Patrick Schmitz

The topic of Big Data and the associated trends for research are part of our future here at DS. The recent issue of Nature looks at issues and trends around the topic, and while uneven, has some good material in it that folks should check out. Here’s my blow by blow on the sections:

The opening editorial calls for push to make annotating data be a major component of research and of grants. Sound familiar? Let’s hope funders listen.

The section on the next Google trots out a lot of familiar and frankly pretty dull options. Skip it.

Big data: Data wrangling poses important question about data collection. We might have the sense is that there is so much data, it is just a matter of managing it. However, David Goldston notes that there are also huge holes in the dataverse, and these are a result of political policy. Further, if a political entity controls the data, politics can (and will) shape and filter the data in fair-reaching ways.

Cory Doctorow’s Gee whiz piece is irritating (unless you’re into technoporn), and is easy to skip.

A piece on wikiomics is an excellent description of how community can make a difference, and the social dynamics of a collaboratory.

Cliff Lynch has a good piece on what data production projects must do to rationalize their data management, and what services must be provided by groups like IST/DS, to support these projects.

Frankel & Reid present an interesting discussion of mining and visualization, and include a compelling, cautionary note:

“The ingrained habits of highly trained scientists make them rarely as adventurous as these young minds. We think we are on the path to insight when shading reveals contours in 3D renderings, or when bursts of red appear on heat maps, for example. But the algorithms used to produce the graphics may create illusions or embed assumptions. The human visual system creates in the brain an apparent understanding of what a picture represents, not necessarily a picture of the underlying science. Unless we know all the steps from hypothesis to understanding — by conversing with theorists, experimentalists, instrument and software developers, visualization scientists, graphic artists and cognitive psychologists — we cannot be sure whether a display is accurate or misleading.”

The closing essay is human interest and could be skipped in the interest of time. However, it is short, and like the best human interest stories, is surprising and inspiring.

Indianapolis Museum of Art Dashboard

September 4th, 2008 by Chris Hoffman

I was talking this morning with Peter Cava (Data Warehouse Services Manager here at UC Berkeley) about the (potential) intersection between business intelligence systems used for administrative systems here and the kinds of data aggregation and analysis performed by research scientists and faculty working with museum collections. Afterward, I was looking at the preliminary program for the Museum Computer Network conference this fall and saw that Rob Stein at Indianapolis Museum of Art is giving a talk about a dashboard that they have developed to help measure various aspects about the museum’s performance. I can’t resist when these kinds of connections happen — they always lead in interesting directions. The IMA Dashboard is up on the web, and they should be applauded for their emphasis on transparency. I also enjoyed reading the blog post at the Powerhouse Museum in Sydney featuring an interview with Rob about the project, and this pointed me to a report written by Maxwell L. Anderson for the Getty, titled “Metrics of Success in Art Museums.”

Now it’s time to get in touch with Rob!

Campus Collaborative Tools Strategy Draft: Please Comment

August 26th, 2008 by Ian Crew

The Office of the CIO and Information Services and Technology are working with the UC Berkeley community to develop a campus strategy on how to use technology to best support collaborative work on campus, now and in the future.

You have an opportunity to shape this strategy and hence the future of collaborative technology used on campus.

As we did with our findings, today we are releasing a draft of the collaborative tools strategy for review and comment.

We are distributing this strategy widely in draft form to elicit comments, feedback, and guidance from the campus and higher education communities.  We believe only through that method can we develop a strategy that will support the creation of an environment where collaboration is easy and natural. We would greatly appreciate any guidance, perspectives, corrections, or suggestions.  The development of the strategy to this point has been heavily dependent on the insight, comments and feedback we have from well over 200 people throughout the process.  Our thanks to everyone who has taken the time to participate.

It is only through further input that we will be able to refine this into a strategy that is truly useful across campus.

Review the Strategy

Strategy, Draft 2 (PDF, 11 pages)

Several of the goals in this strategy are discussed in more detail in individual “Spotlight” documents. If you have particular interest or expertise in those areas, we would also appreciate feedback on those documents.

Providing Feedback

After reviewing the draft strategy, you may comment directly below, in the comments section of this blog post.  We may include quotes from any comment posted below in the report. If you’d like to be acknowledged for any quotes from your comments, please include your name, title, and organization in your comments. If you wish to submit a comment that will not be shared beyond the project team, send it to Ian Crew at icrew@berkeley.edu.

Research directions using aggregated museum data sets

August 10th, 2008 by Chris Hoffman

For quite awhile now, I’ve been thinking about the value of aggregating content and information in museum collections. I think it is generally accepted now that museums and collections of many kinds need to make larger portions of their collections available online to the public, and efforts to digitize collections and webify collections data are producing wonderful results. At the same time, aside from good public relations, what’s in it for the museums and for scholarship in general? What new information or new research directions might emerge from aggregations of museum content? Not surprisingly, in natural history and biodiversity research, the power of numbers, of volume, has been recognized for a long time. Single specimens are nice as types, but in order to learn something about ecological systems and evolution, you need statistically valid numbers. In cultural heritage collections, the possibilities are less clear. Some recent work in England has been interesting though perhaps more from the perspective of studying the history of museums and even of colonialism. Museum studies are still especially interested in the individual object or the subcollection. Rather than focusing on the unique individual object or specimen, what can we learn by unlocking and aggregating content in collections? What research questions emerge? What are the limitations and the opportunities?

Here’s one idea I’ve been thinking about that would pertain to Anthropology and Archaeology collections. We could look at the combinations of material and technique across culture, time and space. We’d expect certain combinations to be visible, but I suspect we would be surprised on numerous occasions. The semantic index that supports the Phoebe A. Hearst Museum of Anthropology’s Delphi system could be an excellent source for this project. I might even revisit some of my dissertation materials. Yikes…

What would be problematic about such a study? Data quality within and across collections would be an important consideration. Would we know which objects were documented at a sufficient level of detail? Would we know which parts of the collection were studied more closely? Would we know which museum specialists were “good” at their jobs? Would we know which objects or collections had been reviewed by multiple museum specialists? The number of biases would be large and problematic. But hey, I’m an archaeologist by training. I’m used to studying a messy data set and making a large number of assumptions.

What kinds of things can be done to address these biases? We could select only sets of data that had been carefully studied, but that in and of itself will create bias. We could try to enrich the data ourselves, but the sheer scope of that effort is terrifying. That’s where crowd sourcing, tagging and annotation could come in. By getting our collections online and allowing other experts (including the public) to enrich our content, we can incrementally improve the quality of our information. Other projects are showing how this can be productively done. However, how much work has been done on assessing the quality of tagging and annotation in a setting such as this? Interestingly the CalPhotos system has been allowing reviewers to annotate and re-identify species for many years. CalPhotos then might provide a good context in which to study the results of annotation and review.

BECHAMEL project at NCSA combines preservation and semantic services

August 8th, 2008 by Patrick Schmitz

U. of Illinois is getting a chunk of NDIPP money to develop their BECHAMEL framework that identifies semantic vulnerabilities in metadata, as a means of supporting digital preservation services. What does this mean? Here’s a good quote:

“For example, the meta-data for a digital file—a photo or map or document—might include a field called “creator.” Putting a name like “John Smith” in this field might seem sufficient, but does that really identify the creator of the information? In 50 years will a future researcher be able to pinpoint which of the world’s many “John Smiths” created the information?

BECHAMEL flags risks like that one, or such as numerical values that aren’t accompanied by error ranges.”

There’s only a little more info in the article, but there are some papers on a research page at the uiuc site. David Dubin’s recent paper provides some better details. He describes their earlier BECHAMEL work as “a research environment for proposing and testing theories of the meaning of markup.” It is a Prolog app connected to an RDF store (Kowari, losing favor to Mulgara).

It sounds like some of what they’re doing is to recognize that lots of so-called structured markup (including, im my opinion, lots of RDF) is actually semantic-free and amounts to free text annotations with some weak hints (e.g., “dc:creator”). The question is whether the project will yield useful tools or more guidelines that are unrealistic in deployment. Their near term goal seems to be the conversion of entity references in free text (e.g., in  a dc:creator element) to RDF references to vocabularies. Is a reference to the concept of “San Francisco, CA” in a gazetteer more useful than the same free text? Probably. But will an RDF pointer to a FOAF description of “John Smith”be much more useful than the free text? I doubt it.  Nevertheless, a project worth watching.


UC Berkeley UC Berkeley CIO Campuswide IT Service Providers
Site Map Contact Webmaster