Google sees OCR as a key technology in its aim to make information available online. When information is in a paper document, OCR is ideal for converting it into a form that is ready for indexing by Google technology.
"In a nutshell, we are all about making information available to users, and when this information is in a paper document, OCR is the process by which we can convert the pages of this document into text that can then be used for indexing," Google uber techie Luc Vincent said on the firm's code blog today.
Tesseract does suffer from some deficiencies that will have to be resolved. By today's standards, it does not perform very well. Also, it will only read English, does not like multiple columns or fancy layouts, and does not appear to like colour documents very much. It is, however, probably about the best open source OCR software around.
"Google currently "reads" almost every web page in the world. Come help us read all the printed material as well!" the firm said in an advertisement for OCR engineers.