IST Home > IST Division > Data Services > Blog

Local Navigation:


Articles by Patrick Schmitz

Author Email: Contact Author

 

Delphi 1.2 at PAHMA is out

Friday, July 17th, 2009 by Patrick Schmitz

Here’s what Michael Black, research and IT director of the Hearst Museum, said about it:

Hi everyone,

This is just a quick announcement, as fuller information should be upcoming in a campus press release.

Delphi 1.2 (the updated version of the Museum’s collections exploration and discovery tool) is now live and online.

In addition to the features released in version 1.1 two weeks ago — the ability to share sets with other people (whether or not they’re Delphi users), greatly improved ontologies (’concept trees’) for automatic object classification, vastly enlarged object data (thanks to the efforts of dedicated volunteers doing data entry on more than 140,000 objects), and the ability to view scans of catalog cards for the objects you find — Delphi 1.2 presents a couple of new user-oriented features.

Delphi 1.2 now fully supports user tagging of objects, including being able to search on either your own tags or across all tags submitted by the entire user community.  Starting with this release, the blue “tongue” will change the content it displays according to the experience level of the user.  For new users, a basic introductory text is displayed, while for more experienced users (here defined as those who have at least played around with the sets and/or tagging features), the displayed text is more of a “what’s new in Delphi” news item.

I invite you to try it out, to share it with family, friends, colleagues, etc.

http://pahma.berkeley.edu/delphi
Michael

Big Data issue of Nature: uneven, but worth reading

Thursday, September 11th, 2008 by Patrick Schmitz

The topic of Big Data and the associated trends for research are part of our future here at DS. The recent issue of Nature looks at issues and trends around the topic, and while uneven, has some good material in it that folks should check out. Here’s my blow by blow on the sections:

The opening editorial calls for push to make annotating data be a major component of research and of grants. Sound familiar? Let’s hope funders listen.

The section on the next Google trots out a lot of familiar and frankly pretty dull options. Skip it.

Big data: Data wrangling poses important question about data collection. We might have the sense is that there is so much data, it is just a matter of managing it. However, David Goldston notes that there are also huge holes in the dataverse, and these are a result of political policy. Further, if a political entity controls the data, politics can (and will) shape and filter the data in fair-reaching ways.

Cory Doctorow’s Gee whiz piece is irritating (unless you’re into technoporn), and is easy to skip.

A piece on wikiomics is an excellent description of how community can make a difference, and the social dynamics of a collaboratory.

Cliff Lynch has a good piece on what data production projects must do to rationalize their data management, and what services must be provided by groups like IST/DS, to support these projects.

Frankel & Reid present an interesting discussion of mining and visualization, and include a compelling, cautionary note:

“The ingrained habits of highly trained scientists make them rarely as adventurous as these young minds. We think we are on the path to insight when shading reveals contours in 3D renderings, or when bursts of red appear on heat maps, for example. But the algorithms used to produce the graphics may create illusions or embed assumptions. The human visual system creates in the brain an apparent understanding of what a picture represents, not necessarily a picture of the underlying science. Unless we know all the steps from hypothesis to understanding — by conversing with theorists, experimentalists, instrument and software developers, visualization scientists, graphic artists and cognitive psychologists — we cannot be sure whether a display is accurate or misleading.”

The closing essay is human interest and could be skipped in the interest of time. However, it is short, and like the best human interest stories, is surprising and inspiring.

BECHAMEL project at NCSA combines preservation and semantic services

Friday, August 8th, 2008 by Patrick Schmitz

U. of Illinois is getting a chunk of NDIPP money to develop their BECHAMEL framework that identifies semantic vulnerabilities in metadata, as a means of supporting digital preservation services. What does this mean? Here’s a good quote:

“For example, the meta-data for a digital file—a photo or map or document—might include a field called “creator.” Putting a name like “John Smith” in this field might seem sufficient, but does that really identify the creator of the information? In 50 years will a future researcher be able to pinpoint which of the world’s many “John Smiths” created the information?

BECHAMEL flags risks like that one, or such as numerical values that aren’t accompanied by error ranges.”

There’s only a little more info in the article, but there are some papers on a research page at the uiuc site. David Dubin’s recent paper provides some better details. He describes their earlier BECHAMEL work as “a research environment for proposing and testing theories of the meaning of markup.” It is a Prolog app connected to an RDF store (Kowari, losing favor to Mulgara).

It sounds like some of what they’re doing is to recognize that lots of so-called structured markup (including, im my opinion, lots of RDF) is actually semantic-free and amounts to free text annotations with some weak hints (e.g., “dc:creator”). The question is whether the project will yield useful tools or more guidelines that are unrealistic in deployment. Their near term goal seems to be the conversion of entity references in free text (e.g., in  a dc:creator element) to RDF references to vocabularies. Is a reference to the concept of “San Francisco, CA” in a gazetteer more useful than the same free text? Probably. But will an RDF pointer to a FOAF description of “John Smith”be much more useful than the free text? I doubt it.  Nevertheless, a project worth watching.

Nina Simon on IMLS Meeting on Museums and Libraries in the 21st Century

Monday, July 21st, 2008 by Patrick Schmitz

Nina Simon (who writes the Museum 2.0 blog) recently wrote about her impressions of the IMLS Meeting on Museums and Libraries in the 21st Century that took place last week. The meeting was preliminary to a large report that NAS is commissioning on the subject.  It is an interesting survey of the state of attitudes in the industry, from the perspective of someone who wants to see things move forward.

She includes notes on the six topics that the workshop discussed:

  1. How do you plan for the future?
  2. What are the essential differences and similarities between libraries and museums?
  3. How do you measure and articulate the value of museums and libraries?
  4. How can our expertise and assets be applied towards new ends?
  5. Who owns the stuff? Who controls the experience?
  6. How do we reimagine physical space and assets?

Her general observations:

  1. Some leaders are more radical than I hoped, and these people have a hard time advocating for change when their accountability is to those who have not changed.
  2. Some leaders are more conservative than I feared, and these people are alternately smug and desperate about maintaining their power.
  3. Meetings about the future end up being about the present. We were much less creative and forward-thinking than we could have been. Dream big, share it in the comments, and help this be a more productive study.

Read the post – it is interesting, and a good introduction to that blog, if you do not know it already.

Scholarz.net – an intresting collection of tools for scholarship

Wednesday, July 9th, 2008 by Patrick Schmitz

I just came across the http://scholarz.net/ project that wants to be a destination for scholars doing research, collecting notes on projects, sharing information about sources, projects, etc. At this point, it seems to combine basic Wiki functionality (including tagging) with a basic social network app (a la Facebook). Some pieces are kind of nice, but it lacks the workflow integration of Zotero, of which I am very fond. And Zotero is soon to release a more community oriented version of their tools, which will make it much more powerful.  Still, Scholarz.net looks like a project to watch.

Xerox white paper on Semantic Tech

Wednesday, July 9th, 2008 by Patrick Schmitz

The folks at Xerox put out a nice little summary of Semantic Tech and what it offers to the enterprise. It reads a lot like some other recent posts on the bottom up vs. top down approaches (a characterization that I am not really comfortable with). It is also slightly behind the times, given the MSFT acquisition of PowerSet. Nevertheless, worth checking out to get a sense of what a big Document Engineering/Processing company sees in this field.

New BC&C articles from IST Data Services on iNews

Friday, February 22nd, 2008 by Patrick Schmitz

Here are some of the recent iNews articles I know of, from IST/DS:
Enterprise Data Warehouse
Media Vault
Mellon OpenCollection
OKAPI
Semantic Services

Others can add the ones I missed.

Media Vault in the news

Monday, February 4th, 2008 by Patrick Schmitz

Since Michael is too modest to blow his own horn, I’ll note this nice article based upon interviews with him and others in our very own IST staff news.

Cliff Lynch: Humanities workflow as a sensor network

Monday, February 4th, 2008 by Patrick Schmitz

I sat in on Cliff’s recent Friday seminar, where he presented a few intriguing ideas (as he is wont to do). In particular, he was exploring the idea of humanities workflows as related to sensor network management. He (and we) have been thinking about what it means to describe a workflow for the humanities and is comparing it to the kind of systems used by geo-science and oceonography. A recent model (which may or may not scale, but that is beside the point) that is in favor in these sciences specifies a low level of sensing until something “interesting” happens and then an increased rate of observations (to capture lots of interesting details). Cliff proposed that humanities research might be seen to follow a similar model of scanning various sources for potential utility; when something related or interesting is found, the academic dives in and looks deeper and more carefully.

This fired a tangent in my thinking: perhaps it is also related to materials processing (e.g., in manufacturing). A given factory has various sources of inputs and must evaluate these to maximize their own output-quality at a reasonable cost. Seems to me that much of research (of most sorts, but especially for information processing such as in the humanities) sounds rather like this: evaluating quality of input from sources, considering the cost (usually time but may be effort), switching sources from time to time (e.g., when one finds a new journal or research group with promising content) . All this with an eye to maximizing the output quality (an academic’s own research).

So what? Perhaps there are lessons learned from the modeling (optimization, etc.) that have gone into these respective disciplines. This assumes that in aggregate people act somewhat like enterprises, the evaluation of which is left as an exercise for the reader.

Cliff also talked about documenting workflow as digital provenance, and the difference between workflow languages that seek to abstract the work (so that the workflow can be shared and reused) and documentation systems that serve to capture experimental or processing details (including data sources, software versions, etc.). We discussed the coming need to understand what constitutes a significant alteration of a processing flow (e.g., does a minor rev of software in a lab workbench change the experiment in a substantive way?). Appears to be a promising area of research.

Cliff also mentioned the MyExperiment project which lets contributors post scientific workflows and share them with a community. Interesting idea, and underscores the importance of formalized workflow in the scientific disciplines (especially those using the lab workbench tools). Looks to me like yet another discipline is beginning to look more like software (following the path of hardware design using CAD systems with elaborate libraries that are linked, not to mention FPGA devices).

Cliff mentioned the issues of repositories and versioning, noting that archives tend to want an object only when it is “dead” (no longer changing). He mentioned the tension between saving a few versions of interest, and the cost of preserving an auto-save version generated every few minutes. I suggested that having even a nominal charge will take care of much of this, as people will then balance cost and benefit to moderate submissions. A related issue came up about authorial control over the submissions: one the one hand it would be nice to automate dissemination of materials (e.g. to journals, peer review mechanisms, etc.), but one the other hand an author may want to control this so that peers do not see “premature” work. This reminds of similar challenges faced by software developers who want to check in interim (or branch) versions that are not yet ready for integration into the main trunk of development.

I guess when you have a hammer, lots of things start to look like nails…


UC Berkeley UC Berkeley CIO Campuswide IT Service Providers
Site Map Contact Webmaster