Researchers create tool to automatically search handwritten historical documents

By Patrick J. Callahan

Historians and researchers searching through handwritten documents, such as the 140,000 pages that make up George Washington’s personal papers in the Library of Congress, now have a new powerful tool to aid their work – a first-of-its kind manuscript retrieval system developed at the Center for Intelligent Information Retrieval in the Computer Science Department.

R. Manmatha, research assistant professor of Computer Science, along with graduate students Toni Rath and Victor Lavrenko, have created a demonstration of their search tool using 1,000 scanned pages of Washington’s manuscripts. Manmatha says the computer interface is similar to the popular computer search engine Google. The demonstration system is available at http://ciir.cs.umass.edu/research/wordspotting

The scanned pages of Washington’s papers can be searched by typing in a word such as “Washington” or “Virginia,” and the program produces a list of ranked pages showing where they appear.

Manmatha says, “Right now, searching a scanned handwritten document is very hard to do. Scanned historical documents are basically images, or pictures, and currently can only be searched if someone manually transcribes the documents or creates and index of their contents. This is time consuming and expensive to do. Given the cost, most handwritten documents are never transcribed or indexed,” Manmatha says. “But there is an enormous amount of handwritten, historical material.

According to Toni Rath, “The basic idea is analogous to searching text documents in one language, say French, using queries in another language, say English. This is usually done by learning models from documents written in both languages. By analogy, our system learns from a parallel body of transcribed scanned images. That is, the word images form a ‘visual language’ and the transcriptions are in English.” Once the model is learned it may be used for searching scanned pages for which no transcriptions are available.

A research paper describing the work was presented this summer at the leading information retrieval conference – the 27th Annual International ACM SIGIR conference
in Sheffield, England. The work is partly funded by a grant from the National Science Foundation and the National Endowment for the Humanities.