قالب وردپرس درنا توس
Home / Tips and Tricks / How to make OCR from the Linux command line using Tesseract

How to make OCR from the Linux command line using Tesseract



A terminal window on a Linux laptop.
Fatmawati Achmad Zaenuri / Shutterstock

You can extract text from images on the Linux command line using the Tesseract OCR engine. It is fast, accurate and works in about 1

00 languages. How to use it.

Optical character recognition

Optical Character Recognition (OCR) is the ability to look at and find words in an image and then extract them as editable text. This simple task for humans is very difficult for computers to do. Previous efforts were clumsy to say the least. Computers were often confused if the font or size did not match the OCR software.

Nevertheless, the pioneers in this field were still highly esteemed. If you lost the electronic copy of a document, but still had a printed version, OCR could create an electronic, editable version. Although the results were not 100 percent accurate, this was still a fantastic time saver.

With a little manual cleaning, you would get your document back. People forgave their mistakes because they understood the complexity of the task facing an OCR package. Also, it was better than writing the entire document.

Things have improved significantly since then. The Tesseract OCR application, written by Hewlett Packard, started in the 1980s as a commercial application. It was launched in 2005 and is now supported by Google. It has several language features, is considered one of the most accurate OCR systems available, and you can use it for free.

Install Tesseract OCR

To install Tesseract OCR on Ubuntu, use this command:

sudo apt-get install tesseract-ocr

sudo apt install tesseract-ocr in a terminal window.

On Fedora, the command is:

sudo dnf install tesseract

sudo dnf install tesseract in a terminal window.

At Manjaro you must write:

sudo pacman -Syu tesseract

sudo pacman -Syu tesseract in a terminal window.

Using Tesseract OCR

We will present a set of challenges for Tesseract OCR. Our first image, which contains text, is an excerpt from recital 63 of the General Data Protection Regulations. Let’s see if OCR can read this (and stay awake).

extract from recital 63 of the GDPR

It’s a tricky image because each sentence begins with a weak superscript number, which is typical of legislative documents.

We must give tesseract order some information, including:

  • The name of the image file we want it to be processed.
  • The name of the text file it creates to keep the extracted text. We do not need to provide the file extension (it will always be .txt). If a file with the same name already exists, it will be overwritten.
  • We can use --dpi options to tell tesseract what the dots per inch (dpi) of the image are. If we do not specify a dpi value, tesseract will try to figure it out.

Our image file is called “recital-63.png,” and its resolution is 150 dpi. We will create a text file from the one called “recital.txt.”

Our command looks like this:

tesseract recital-63.png recital --dpi 150

tesseract recital-63.png recital - dpi 150 in a terminal window.

The results are very good. The only question is the superscripts – they were too weak to be read correctly. A good quality image is crucial to get good results.

Extracted text from recital 63.

tesseract has interpreted the superscript numbers as quotation marks (“) and degree symbols (°), but the actual text has been extracted perfectly (the right side of the image must be trimmed to fit here).

The last character is a byte with the hexadecimal value 0x0C, which is a carriage return.

Below is another image with text in different sizes, both bold and italic.

Image with different text sizes in bold and italics.

The name of this file is “bold-italic.png.” We want to create a text file called “bold.txt”, so our command is:

tesseract bold-italic.png bold --dpi 150

tesseract bold-italic.png bold - dpi 150 in a terminal window.

This did not create any problems and the text was extracted perfectly.

extracted text from fet-italic.png.

Use different languages

Tesseract OCR supports about 100 languages. To use a language, you must first install it. When you find the language you want to use in the list, note its abbreviation. We will install support for Welsh. The abbreviation is “cym”, which is short for “Cymru”, which means Welsh.

The installation package is called “tesseract-ocr-” with the language abbreviation tagged to the end. To install the Welsh language file in Ubuntu we use:

sudo apt-get install tesseract-ocr-cym

sudo apt-get install tesseract-ocr-cym in a terminal window.

The picture with the text is below. It is the first verse of the Welsh national anthem.

image containing lyrics from the first verse of the Welsh national anthem.

Let’s see if Tesseract OCR stands for the challenge. We will use -l (language) option to let tesseract know the language we want to work in:

tesseract hen-wlad-fy-nhadau.png anthem -l cym --dpi 150

tesseract hen-wlad-fy-nhadau.png hymns -l cym - dpi 150 in a terminal window.

tesseract handles perfectly, as shown in the extracted text below. Very good, Tesseract OCR.

Extracted Welsh text.

If your document contains two or more languages ​​(such as a Welsh-to-English Dictionary, for example), you can use a plus sign (+) to tell tesseract to add another language, so:

tesseract image.png textfile -l eng+cym+fra

Using Tesseract OCR with PDF files

The tesseract the command is designed to work with image files, but PDF files cannot be read. But if you need to extract text from a PDF, you can use another tool first to generate a set of images. A single image represents a single page in the PDF file.

The pdftppm The tool you need should already be installed on your Linux computer. The PDF file we will use for our example is a copy of Alan Turing’s reference document on artificial intelligence, “Computing Machinery and Intelligence.”

PDF to the title page for

We use -png options to specify that we want to create PNG files. The filename of our PDF file is “turing.pdf.” We call our image files “turing-01.png,” “turing-02.png,” and so on:

pdftoppm -png turing.pdf turing

pdftoppm -png turing.pdf turing in a terminal window.

To run tesseract on each image file with a single command we must use one for loop. For each of our “turing-nn.png, ”files we run tesseractand create a text file named “text-” plus “turing-nn“As part of the image file name:

for i in turing-??.png; do tesseract "$i" "text-$i" -l eng; done;

for I in turing - ??.  png;  do tesseract

To combine all text files into one we can use cat:

cat text-turing* > complete.txt

cat text-turing *> complete.txt in a terminal window.” width=”646″ height=”57″ onload=”pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);” onerror=”this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);”/></p><div><script async src=

So how did it do? Very good, as you can see below. However, the first page looks quite challenging. It has different text styles and sizes and decoration. There is also a vertical “watermark” on the right side of the page.

But the output is close to the original. Obviously lost the formatting, but the text is correct.

First page with extracted text from Turing PDF.

The vertical watermark was transcribed as a row of gibberish at the bottom of the page. The text was too small to read tesseract exactly, but it would be easy enough to find and delete it. The worst result would have been lost characters at the end of each line.

Oddly enough, the individual letters at the beginning of the list of questions and answers on page two have been ignored. The section from the PDF file is shown below.

A list of questions and answers from the Turing paper PDF document.

As you can see below, the questions remain, but “Q” and “A” at the beginning of each line were lost.

Extracted text from the question and answer page in Turing PDF.

Charts will not be transcribed correctly. Let’s look at what happens when we try to extract the one shown below from the Turing PDF.

A chart of

As you can see in our result below, the characters were read, but the chart format was lost.

Extracted text from a chart in Turing PDF.

Again, tesseract struggled with the small size of the subscriptions, and they were done incorrectly.

But in fairness, it was still a good result. We could not extract simple text, but this example was deliberately chosen because it presented a challenge.

A good solution when you need it

OCR is not something you need to use daily. But when the need arises, it is good to know that you have one of the best OCR engines at your disposal.




Source link