
For vintage 19th-century texts in English, O.C.R. The third, however, can be extremely difficult. Today’s technology makes the first two steps relatively straightforward. Today it has become the principal method used by Google to authenticate text in Google Books, its vast project to digitize and disseminate rare and out-of-print texts on the Internet.ĭigitization is normally a three-stage process: create a photographic image of the text, also known as a bitmap encode the text in a compact, easily handled and searchable form using optical character recognition software, commonly called O.C.R.
Captcha software archive#
Its pilot project was to clean up the digitized archive of The New York Times. The set of software tools that accomplishes this feat is called reCaptcha and was developed by a team of researchers led by Luis von Ahn, a computer scientist at Carnegie Mellon University. Buy a ticket to the ballgame, help preserve history. Mets fans and other Web site users are correcting them. One of the wavy words quite likely came from a digitized image from an old, musty text, and while the original page has already been scanned into an online database, the scanning programs made a lot of mistakes. What Web readers do not know, however, is that they have also been enlisted in a project to transform an old book, magazine, newspaper or pamphlet into an accurate, searchable and easily sortable computer text file. Captchas ensure that robots do not hack secure Web sites.

These things are called Captchas, and only humans can read them. Sign in, click “Mets,” pick the date and pay.īut before taking the money, the Web site might first present the reader with two sets of wavy, distorted letters and ask for a transcription. Now, all it takes is to find an online ticket distributor. In the old days, anybody interested in seeing a Mets game during a trip to New York would have to call the team, or write away, or wait to get to the city and visit the box office.
