Blacklight Used to Extract Meaning from Cursive Script, Allowing Scanned Documents to be Searched
In 1973, a fire broke out at the St. Louis National Personnel Records Center, destroying 16 to 18 million military service records from 1912 to 1964. If these records had been digitized they’d have been safe, but not necessarily any more accessible.
Scanned PDF images, the low-cost, high-speed method for digitizing images, can be duplicated and stored in many places. But you can’t find anything in them, except by a human being searching through the handwritten text by eye. And the 1940 U.S. Census, for example, consists of 3.6 million PDF images.
Commercial services like Ancestry.com employ thousands of human workers who manually extract the meaning of a small, profitable subset of these images so they can be searched by computer, says Kenton McHenry of the National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign. “But government agencies just don’t have the resources” necessary to make most of them accessible in this way.The danger is that scanned documents may become a digital graveyard for many historically, culturally and scientifically valuable old documents—easy to find, impossible to resuscitate.
Liana Diesendruck, an NCSA research programmer, McHenry and colleagues in his lab have used PSC’s Blacklight supercomputer to begin cracking this formidable problem. Using the 1940 Census as a test case, Blacklight’s architecture has allowed them to create a framework for automatically extracting the meaning from these images—in essence, teaching the machines to read cursive script.
TEACHING MACHINES TO READ
“Before we could even think about extracting information, we had to do a lot of image processing,” says Diesendruck. Misalignments, smudges and tears in the paper records had to be cleaned up first. But the difficulty of that process paled compared with the task of getting the computer to understand the handwritten text.
It’s relatively simple to get a computer to understand text that is typed in electronically. It knows that an “a” is an “a,” and that the word “address” means what it does. But early Census entries were made by many human workers, with different handwriting and styles of cursive script. These entries can be difficult for humans to read, let alone machines.
Having the computer deconstruct each hand-written word, letter by letter, is impossible with today’s technology. Instead, the investigators made the computer analyze the words statistically rather than trying to read them. Factors like the height of capital “I”s, the width of a loop in a cursive “d” and by how many degrees the letters slant from the vertical all go into a 30-dimensional vector—a mathematical description consisting of 30 measurements.This description helps the machine match words it knows with ones it doesn’t.
PSC’s Blacklight proved ideal for the task, McHenry says. Part of the computational problem consists of crunching data from different, largely independent entries as quickly as possible. Blacklight, while not as massively parallel as some supercomputers, has thousands of processors to do that job. More importantly, Blacklight’s best-in-class shared memory allowed them easily to store the relatively massive data their system had extracted from the Census collection—a 30-dimensional vector for each word in each entry.This allowed the calculations to proceed without many return trips to the disk. Eliminating this lag to retrieve data made the calculations run far faster than possible on other supercomputers.
Step 1: Correct rotations and reduce smudges and bleeds to produce a spreadsheet-like document with identifiable cells.
Step 2: Create a statistical picture of the contents of each cell.
Step 3: Classify the contents’ likely meaning based on these statistical pictures—“index” them.
Conclusion: Users can then search for specific entries, picking the “right” answers and helping the system correct itself.
“GOOD ENOUGH” ACCURACY
On average, the system accurately identified words despite the idiosyncrasies of the handwriting. Of course, that “on average” is just what it means. Sometimes the machine is correct, sometimes it isn’t.The idea is quickly to produce a “good enough” list of 10 or 20 entries that may match a person’s query rather than taking far longer to try to make it exact.
“We get some results that aren’t very good,” Diesendruck says. “But the user clicks on the ones he or she is looking for. It isn’t perfect, but instead of looking through thousands of entries you’re looking at 10 or 20 results.”
Search engines like Google have made Web users very demanding in terms of how much time a search takes. But while they expect fast, they don’t expect extreme precision.They don’t tend to mind scanning short lists of possible answers to their query. So the script search technology is similar to what people are used to seeing on the Web, making it more likely to be accepted by end users.
There’s another virtue to how the system works, McHenry points out. “We store what they said was correct,” using the human searcher’s choice to identify the right answers and further improve the system. Such “crowd sourcing” allows the investigators to combine the best features of machine and human intelligence to improve the output of the system. “It’s a hybrid approach that tries to keep the human in the loop as much as possible.”
Today the group is using Blacklight to carry out test searches of the 1940 Census, refining the system and preparing it for searching all sorts of handwritten records.Their work will help to keep those records alive and relevant. It will also give scholars studying those records—not just in the “hard” and social sciences, but also in the humanities—the ability to use and analyze thousands of documents rather than just a few.