has plans to help
the world convert its mountains of paper records into a Borg
like hive mind, using
an optical character recognition program called Tesseract. Tesseract
was originally developed by HP
between 1985 and 1995, but was shoved to the back of the shelf when HP pulled out of the OCR business. After that, it was released to the Information Science Research Institute at UNLV so that it could be developed under an open source license.
Google sees OCR
as a key technology in its aim to make information available online. When information is in a paper document, OCR is ideal for converting it into a form that is ready for indexing by Google technology.
"In a nutshell, we are all about making information available to users, and when this information is in a paper document, OCR is the process by which we can convert the pages of this document into text that can then be used for indexing," Google uber techie Luc Vincent said on the firm's code blog today.
Tesseract does suffer from some deficiencies that will have to be resolved. By today's standards, it does not perform very well. Also, it will only read English, does not like multiple columns or fancy layouts, and does not appear to like colour documents very much. It is, however, probably about the best open source OCR software around.
"Google currently "reads" almost every web page in the world. Come help us read all the printed material as well!" the firm said in an advertisement for OCR engineers.