Manmatha receives $205,000 Mellon grant for digitization research
Computer Science research associate professor R. Manmatha has been awarded a one-year, $205,000 grant from the Andrew W. Mellon Foundation to support the development of software and techniques for scholars in the humanities to use in processing large corpora of digitized books.
The grant supports “Proteus Infrastructure: Work Aggregation and Entity Extraction,” a pilot project to build and evaluate research infrastructure for scanned books. The work is a collaboration with David Smith at Northeastern University and former research faculty member in Computer Science, and Computer Science professor James Allan.
While there are several large scanned book collections (for example the Internet Archive) much of this is unstructured and not easily used by scholars in the humanities. “The grant will support building the Proteus infrastructure which will help scholars navigate and use such collections more easily,” said Manmatha. “Components of the infrastructure include automatically identifying a book’s language, linking multiple editions of canonical works, finding quotations in canonical works, and entity detection. One of the key aims of the project is to do all these tasks efficiently at large scale.”
Manmatha is also associated with a Mellon Foundation grant awarded in 2012 to Texas A&M. As part of this grant, “OCRing Early Modern Text,” the computer scientist will take the output of optical character recognition systems on 18th-century English books and use its technology to automatically estimate OCR errors and correct the output of multiple OCR engines.