Digitizing the Cumulative Bibliography
By Stephen P. Weldon
Over the next couple of years, one of the goals of this blog will be to document the efforts that I and others are taking to build the Isis Platform I discussed in my last post. I do this in part to show how a project of this sort gets built and to highlight the work of the many people involved in it. I also do it because I’m hoping that readers of the blog might be able to offer ideas as we proceed that will help us avoid pitfalls and take paths we might not have thought about.
I will begin by taking you to the OU Libraries’ new Digitization Lab where the staff there are photographing and digitizing the printed Isis Bibliographies from 1913 to 1975. The DigiLab was installed last year by the library dean Rick Luce in order to ensure that the University would have the tools to support its faculty in the 21st-century information-rich environment. The staff there, under the direction of Brian Shults and Barbara Laufersweiler, are working on projects from around the University.
My project for the DigiLab is to have them extract readable text from the print bibliographies so that I can then incorporate those entries into the Isis dataset. The data splits sharply between pre-digital and “born digital.” The year 1974 is the dividing line. Not until that year did John Neu begin entering citations into a database using home-grown software; and this was done initially simply to facilitate the creation of a printed book. Of course, as soon as it became possible to distribute the data in electronic form, all of the post-1974 data could be used. Now it is time to go back to the pre-digital bibliography and turn sixty years of legacy data into a machine-readable digital form and add it to the Isis dataset.
We are lucky, however, that we don’t have to go back to each volume of Isis and do each bibliography separately. In the late 1960s, the History of Science Society hired the British librarian Magda Whitrow to compile and publish a printed cumulation of the first half-century of the CBs. It took Whitrow fifteen years to accomplish this feat, but it was so successful, that John Neu continued producing these cumulations in ten-year intervals after that.
This means that I need only go to eight volumes of cumulated bibliographies (6 for the 1913-1965 cumulation, and 2 for the 1966-1975 cumulation) in order to extract the citations that I am looking for. Nonetheless, the task is still daunting. Those eight volumes contain nearly 4,350 pages and over 140,000 citations by my estimation. It’s a lot of work. So I am grateful for the new state-of-the-art digitization facility on the third floor of Bizzell Library.
In the first image above, you see Rick Schultz operating the Kabis book scanner that can photograph up to 2,000 pages per hour. The book rests in a cradle that holds it open at about 45 degrees. The machine has two cameras that photograph opposite pages simultaneously, and a mechanical arm that turns each page one-by-one by reaching across the carriage and grabbing a single sheet with its vacuum head—as you see here in the photo. Rick is a photography student in the OU fine arts department who works part time at the Lab.
In the next photograph, Emily Grimes (an English B.A. now studying Computer Science) is working with the OCR software. The library has purchased ABBYY to transform the PDF pages into computer readable text. There are a number of challenges here, but Emily is studying the results quite closely to maximize the accuracy.
OCR is especially tricky for the bibliography because a vast number of languages appear in the cited works—Sarton and his successors have drawn citations from all over the world. This means that we are awash in diacritics! It is raining macrons, breve marks, ogoneks, and cedillas—not always the friendliest weather for OCR work (see figure). But under Emily’s guidance, ABBYY is coping pretty well, though she continues to tinker with it in order to drop the error rate as much as possible.
Along the way, we are looking for easy-to-catch misreadings that can be detected and changed by writing some smart algorithms to deal with the converted text once it is exported into a database.
Of course, this is only the first step. As we move forward, we’ll face the challenge of parsing the data. But that’s a topic for another post.