You can extract text from images on the Linux command line using the Tesseract OCR engine. It is fast, accurate and works in about 100 languages. How to use it.
Optical character recognition
Optical Character Recognition (OCR) is the ability to look at and find words in an image and then extract them as editable text. This simple task for humans is very difficult for computers to do. Previous efforts were clumsy to say the least. Computers were often confused if the font or size did not match the OCR software.
Nevertheless, the pioneers in this field were still highly esteemed. If you lost the electronic copy of a document, but still had a printed version, OCR could create an electronic, editable version. Although the results were not 100 percent accurate, this was still a fantastic time saver.
With a little manual cleaning, you would get your document back. People forgave their mistakes because they understood the complexity of the task facing an OCR package. Also, it was better than writing the entire document.
Things have improved significantly since then. The Tesseract OCR application, written by Hewlett Packard, started in the 1980s as a commercial application. It was launched in 2005 and is now supported by Google. It has several language features, is considered one of the most accurate OCR systems available, and you can use it for free.
Install Tesseract OCR
To install Tesseract OCR on Ubuntu, use this command:
sudo apt-get install tesseract-ocr
On Fedora, the command is:
sudo dnf install tesseract
At Manjaro you must write:
sudo pacman -Syu tesseract
Using Tesseract OCR
We will present a set of challenges for Tesseract OCR. Our first image, which contains text, is an excerpt from recital 63 of the General Data Protection Regulations. Let’s see if OCR can read this (and stay awake).
It’s a tricky image because each sentence begins with a weak superscript number, which is typical of legislative documents.
We must give
tesseract order some information, including:
- The name of the image file we want it to be processed.
- The name of the text file it creates to keep the extracted text. We do not need to provide the file extension (it will always be .txt). If a file with the same name already exists, it will be overwritten.
- We can use
--dpioptions to tell
tesseractwhat the dots per inch (dpi) of the image are. If we do not specify a dpi value,
tesseractwill try to figure it out.
Our image file is called “recital-63.png,” and its resolution is 150 dpi. We will create a text file from the one called “recital.txt.”
Our command looks like this:
tesseract recital-63.png recital --dpi 150
The results are very good. The only question is the superscripts – they were too weak to be read correctly. A good quality image is crucial to get good results.
tesseract has interpreted the superscript numbers as quotation marks (“) and degree symbols (°), but the actual text has been extracted perfectly (the right side of the image must be trimmed to fit here).
The last character is a byte with the hexadecimal value 0x0C, which is a carriage return.
Below is another image with text in different sizes, both bold and italic.
The name of this file is “bold-italic.png.” We want to create a text file called “bold.txt”, so our command is:
tesseract bold-italic.png bold --dpi 150
This did not create any problems and the text was extracted perfectly.
Use different languages
Tesseract OCR supports about 100 languages. To use a language, you must first install it. When you find the language you want to use in the list, note its abbreviation. We will install support for Welsh. The abbreviation is “cym”, which is short for “Cymru”, which means Welsh.
The installation package is called “tesseract-ocr-” with the language abbreviation tagged to the end. To install the Welsh language file in Ubuntu we use:
sudo apt-get install tesseract-ocr-cym
The picture with the text is below. It is the first verse of the Welsh national anthem.
Let’s see if Tesseract OCR stands for the challenge. We will use
-l (language) option to let
tesseract know the language we want to work in:
tesseract hen-wlad-fy-nhadau.png anthem -l cym --dpi 150
tesseract handles perfectly, as shown in the extracted text below. Very good, Tesseract OCR.
If your document contains two or more languages (such as a Welsh-to-English Dictionary, for example), you can use a plus sign (
+) to tell
tesseract to add another language, so:
tesseract image.png textfile -l eng+cym+fra
Using Tesseract OCR with PDF files
tesseract the command is designed to work with image files, but PDF files cannot be read. But if you need to extract text from a PDF, you can use another tool first to generate a set of images. A single image represents a single page in the PDF file.
pdftppm The tool you need should already be installed on your Linux computer. The PDF file we will use for our example is a copy of Alan Turing’s reference document on artificial intelligence, “Computing Machinery and Intelligence.”
-png options to specify that we want to create PNG files. The filename of our PDF file is “turing.pdf.” We call our image files “turing-01.png,” “turing-02.png,” and so on:
pdftoppm -png turing.pdf turing
tesseract on each image file with a single command we must use one for loop. For each of our “turing-nn.png, ”files we run
tesseractand create a text file named “text-” plus “turing-nn“As part of the image file name:
for i in turing-??.png; do tesseract "$i" "text-$i" -l eng; done;
To combine all text files into one we can use
cat text-turing* > complete.txt
So how did it do? Very good, as you can see below. However, the first page looks quite challenging. It has different text styles and sizes and decoration. There is also a vertical “watermark” on the right side of the page.
But the output is close to the original. Obviously lost the formatting, but the text is correct.
The vertical watermark was transcribed as a row of gibberish at the bottom of the page. The text was too small to read
tesseract exactly, but it would be easy enough to find and delete it. The worst result would have been lost characters at the end of each line.
Oddly enough, the individual letters at the beginning of the list of questions and answers on page two have been ignored. The section from the PDF file is shown below.
As you can see below, the questions remain, but “Q” and “A” at the beginning of each line were lost.
Charts will not be transcribed correctly. Let’s look at what happens when we try to extract the one shown below from the Turing PDF.
As you can see in our result below, the characters were read, but the chart format was lost.
tesseract struggled with the small size of the subscriptions, and they were done incorrectly.
But in fairness, it was still a good result. We could not extract simple text, but this example was deliberately chosen because it presented a challenge.
A good solution when you need it
OCR is not something you need to use daily. But when the need arises, it is good to know that you have one of the best OCR engines at your disposal.