A New Paradigm

Recent processed data from transcription service.

Since my post at the end of July, a great deal has happened. Developments are taking place along two parallel lines that will eventually be integrated.

First of all, the Isis Cumulative Bibliography volumes that we photographed last year (5000 pages of text in eight volumes, covering the years from 1913 to 1975) are now being transcribed into TEI, a machine readable XML format. I have contracted with the firm Apex CoVantage to do the initial transcription. They guarantee their transcriptions to be 99.95% accurate, and the results we have seen from the first volume that has come back to us shows that they are not exaggerating.

Recent processed data from transcription service.

Recent processed data from transcription service.

The need for a transcription service became apparent after spending a lot of time last year analyzing the OCR that had been done on the page images. Even though the OCR seemed pretty accurate initially, it turned out that it still missed too many characters. Diacritical marks were especially problematic for the OCR script. (You have to realize that the early cumulative volumes done by Magda Whitrow were created by photographing index cards arranged on a page. These cards had been simply typed on a typewriter. In the process of typing and editing—in an age without word processors!—some of the diacritics had to be added by hand. As good as the quality was, and Whitrow’s standards were extremely high, the OCR software just could not reproduce the text with sufficient accuracy.) Even more of a problem than the occasional misread of the letters was the lack of good paragraphing because this made it almost impossible to parse the pages into discrete citations.

The human transcribing process is done in India using a double-key entry method that is quite common for these services. Each page is typed twice and the two pages are compared. All discrepancies are checked and fixed. In addition, we have been able to specify precisely the kind of markup that we want, meaning that the transcribers know what kinds of oddities to look for and how to deal with them.

As soon as the transcription is done, Conal Tuohy—who specializes in text encoding, content management systems, and similar areas, and who works out of Brisbane—will be writing the code to add extensive bibliographical markup to the pages. The result will be files that are TEI compliant and from which it will be easy to extract the bibliographical metadata. Tuohy is especially knowledgeable about the open access environment, and I feel very lucky to have found him. All this means that we will be able to quickly ingest the 1913 data into the IsisCB database, and it also means that historiographical research going back to the early twentieth century will become much easier.

The new platform in its early Beta form.

The new platform in beta testing.

The second front on which we are working is in the building of a stable open access search platform. The file structure that I described in my July post was built by me in FileMakerPro and it was an extremely important step in helping me refactor the database. Since I knew and understood how to use FileMaker, it was possible for me to think through the process of the data transformation by simply building the new database and trying it out. Although I toyed with the idea of developing the new search platform directly in FileMaker, I quickly realized that this would not work because I would never be able to create stable records that anyone could easily see or link to.

To solve this problem, I have contracted with A Place Called Up, headed by Erick Peirson and Julia Damerow, two amazingly able programmers and smart digital humanists. (Incidentally, they will both be participating in the THATCampHSS 2015 that we are holding at this year’s November HSS meeting—so please sign up.) Peirson and Damerow studied my database and have created a web application that is extremely powerful in searching for specific works and navigating through the complex web of interconnected authors, citations, and subjects.

Initial beta testers have given the platform quite good reviews. One colleague of mine claimed that it was simply fun to do research on the system because it was so easy to use and so uncluttered. Another colleague who has not used the IsisCB in the past told me that it he might very well start using the new system since it is so easy to navigate through.

Not only that, the system is built to be interactive. Users can already add comments to all citations and authorities. The commenting feature will be developed further so the users can add links to relevant resources and, eventually, add new resources themselves. Another feature we plan to have before too long will be the ability of users to manage their own account pages and provide such additional information to their records as links to outside webpages was well as further biographical and professional information.

I believe that users will quickly see the potential of this system. It clearly turns a corner in our ability to do research in history of science. I think calling it a new paradigm is not an exaggeration, so I hope that readers of this blog will quickly try out the system when we open it to the public in early November. Also, if you would like to try out this system early, please let us know by sending an email to me or Sylwester Ratowt, the IsisCB project manager.

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *