Using Google's Optical Character Recognition to extract text from images

Google's Optical Character Recognition (OCR) software works for 248+ languages

Book stack
Image by : 
Image by Kate Ter Haar. Modified by Opensource.com. CC BY-SA 2.0.
x

Subscribe now

Get the highlights in your inbox every week.

Google's Optical Character Recognition (OCR) software now works for over 248 world languages (including all the major South Asian languages). It's quite simple and easy to use, and can detect most languages with over 90% accuracy.

The technology extracts text from images, scans of printed text, and even handwriting, which means text can be extracted from pretty much any old books, manuscripts, or images.

Google's OCR is probably using dependencies of Tesseract, an OCR engine released as free software, or OCRopus, a free document analysis and optical character recognition (OCR) system that is primarily used in Google Books. Developed as a community project during 1995-2006 and later taken over by Google, Tesseract is considered one of the most accurate OCR engines and works for over 60 languages. The source code is available on GitHub.

The OCR project support page offers additional details on preserving character formatting for things like bold and italics after OCR in the output text:

When processing your document, we attempt to preserve basic text formatting such as bold and italic text, font size and type, and line breaks. However, detecting these elements is difficult and we may not always succeed. Other text formatting and structuring elements such as bulleted and numbered lists, tables, text columns, and footnotes or endnotes are likely to get lost.

Tamil-language Wikimedian and Wikimedia India's program director Ravishankar Ayyakkannu said on Facebook this after testing: "For some of the languages like Malayalam and Tamil, the OCR works with almost 100% accuracy, along with support in formatting like auto cropping, separating text by discarding images, and ignoring colored backgrounds." Native speakers of the following Indian lanaguages—Bangla, Malayalam, Kannada, Odia, Tamil, and Telugu—also commented on a Facebook post with feedback after testing the OCR.

However, for a few scripts like Gurmukhi (used to write Punjabi), the output after OCR is quite poor and results in gibberish text in different scripts.

A tutorial to convert text in Odia (Indian language) from a scanned image using Google's OCR. Designed by Subhashish Panigrahi. CC BY-SA 4.0

Overall, this is quite a large leap for languages that have old texts that have not yet been digitized. Old and valuable text in many languages can now be digitized and shared over the internet using platforms like Wikisource.

Editor's note: Article has been updated based on community feedback. We changed "Google's OCR partly uses Tesseract, an OCR engine released as free software" to "Google's OCR is probably using dependencies of Tesseract, an OCR engine released as free software, or OCRopus, a free document analysis and optical character recognition (OCR) system that is primarily used in Google Books." If you have additional feedback on the article or technology, please let us know in the comments. -Rikki Endsley

About the author

Somewhere in Mumbai in a moving local train.
Subhashish Panigrahi - Subhashish Panigrahi (@subhapa) is the founder of OpenSpeaks, an award winning project that helps grow open resources to digitally-document marginalized languages. He co-founded O Foundation (OFDN), a nonprofit that works towards addressing issues that lie in the cusp of people, culture, and technology with Openness in its core. He is the...