Taking the CB away from Humans and Giving It to Computers

By Sylwester Ratowt

What makes today’s internet so different from the original web of the 90s is that, rather than writing webpages by hand, developers often write applications that create the webpages automatically based on data found at other internet locations. Consider the Digital Public Library of America, for example, which features browse by place and browse by date interfaces.

This transition to a more algorithmically driven web was made possible because more and more resources on the web are coded in a format that is understandable to commuter programs, rather than simply a format guided by the needs of the human eye.

This trend in changing how data is represented is also happening at Isis Bibliography. For the past one hundred years the bibliography was produced to be read by humans as a printed list of citations, guided by long-held conventions of humanities scholarship. Even at the CB office, up to this point, we have been thinking of the printed annual volume as the definitive form of the data.

But we’ve re-imagined the fundamentals of the bibliography. From now on, the definitive version of any bibliographical citation will be one that is primarily written for a computer to read.

Data from the Isis CB formatted for human consumption

But what does that mean? Let’s take a look at a human-focused bibliographic entry that has a familiar format. Here is a citation published a few years ago.

Mulsow, Martin, “Ambiguities of the Prisca Sapientia in Late Renaissance Humanism.” Translated by Janita Hämäläinen. Journal of the History of Ideas 65 (2004): 1-13.

Any scholar will easily decode this bibliographical convention to mean that this is a journal article, that it was written by Martin Muslow, it was translated by Janita Hämäläinen, given the title of “Ambiguities of the Prisca Sapientia in Late Renaissance Humanism,” published in 2004 in the 65th volume of the Journal of the History of Ideas, starting on page 1 and ending on page 13. Also a human reader has no problem ascertaining that there were two people responsible for the article, the author and the translator.

However, a computer program can’t decipher any of this without massive instruction at all kinds of levels.

It turns out that the first step toward computerizing the bibliography wasn’t much of a help in this regard. When John Neu, the longtime Isis bibliographer, created the first Isis database and placed the data in a few discrete fields, he was mostly interested in the printed output, how text was to be formatted on a page. That was all that mattered in the 1970s.

By 2002, when Stephen Weldon and I used FileMakerPro to build a database to mange the Isis CB data, we took a step toward making the information computer friendly. We created fields that corresponded better to the functional units of the citation, separating, for example, publication information that had been in a single text field.

Data in a custom database used by CB staff.

Yet there were a couple of problems with our solution. First of all, we were still fundamentally interested in the printed form. Second, we still had to create text fields (like the “Edition Details” field in the figure here) that bundled a lot of human readable information into a single line that couldn’t be easily understood by a computer script.

Third, the most serious problem was that the Isis CB FilemakerPro database is a one-off system. Our conventions are ours alone. There are only a few computers in the world that currently have the ability to read and interpret this data in a meaningful way, and they are all in Norman, Oklahoma!

So the next and most important step is to publish the bibliography in such a way that a lot of computers can read this stuff. That’s why we have adopted very different format for our data. From now on, the definitive form of all Isis CB citations will be in a MODS 3.5 XML schema. Writing the bibliography in this way will make the bibliography accessible to computers around the world that understand these broadly accepted conventions. Since the Library of Congress is standing behind MODS 3.5, the Isis data will become widely accessible to computers in ways that it never was before.

Data in machine readable XML.

In this new format, we are not primarily worried about how things will be displayed for the human to read, but about how the computer will understand what a bibliographic citation is composed of. Using this schema the computer program knows that “Janita Hämäläinen” is a person associated with this text in the role of “translator.”

If you look at the MODS example here, you’ll see right away that this is not written for the human eye. And that’s as it should be. It is time to really get serious about giving our CB to the computer.

Of course, we aren’t intending to leave it there. This is not some kind of bizarre altruism toward sentient robots. We do intend for the computers to give the CB back to us. But exactly how that will happen and what sort of benefits we expect to see from it is a topic for a later post.

