Pre-Process for OCR with tesseract. Only Digits.

Questions and postings pertaining to the usage of ImageMagick regardless of the interface. This includes the command-line utilities, as well as the C and C++ APIs. Usage questions are like "How do I use ImageMagick to create drop shadows?".
Post Reply
unimatrix27
Posts: 2
Joined: 2016-01-28T05:13:25-07:00
Authentication code: 1151

Pre-Process for OCR with tesseract. Only Digits.

Post by unimatrix27 »

Hi,

I have to process screenshots from excel-style tables with mostly numbers in it. Since the screenshot always has the same size and all columns are at static pixel positions, I have implemented a script to crop out individual cells into small images. THey have a very low resolution of just 15-19 pixels height. The digits have different foreground and background colors, sometimes the background color is achieved through dithering (png) which makes ocr even much more difficult. I tried several techniques to improve the image quality and ocr works quite well already but i still get some mistakes and I would like to kindly ask if you had any other suggestions for how to improve the images. Below some examples of original and my improvement results.

Image -> Image
options used: -scale 1000% -blur 0x02 -colorspace gray -threshold 60% -type bilevel
---
Image -> Image
options used: -scale 1000% -posterize 2 -blur 0x02 -colorspace gray -threshold 65% -type bilevel

Image
Image
options used: -scale 1000% -blur 0x02 -colorspace gray -threshold 17% -type bilevel
---

I am detecting foreground and background colors and based on the result use different arguments for convert.

I am still a beginner to ImageMagick or image processing in general so I am wondering if there are other ways to improve OCR readability of those images. I have to add that I have not yet done a training for tesseract based on those images. It seems to be a quite complicated process, I am willing to do it but I would like to get the best out of image processing first. Especially I would like to avoid letter overlapping which is caused by the blur I am applying.

Thanks for your feedback.

I am using Ubuntu and the shell version of convert, called by a perl script, since I did not find all options in PerlMagick

Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Pre-Process for OCR with tesseract. Only Digits.

Post by snibgo »

Thresholding is a bad idea. It makes the output either "on" or "off", so will merge adjacent digits. One of the "-level" variants will work better.

Sadly, in my limited experience of Tesseract, it needs characters to be at least 20 pixels high to perform reliably. Yours are only 8 pixels high. Perhaps Tesseract can be told to only look for digits (and punctuation). If so, that might increase reliability.
snibgo's IM pages: im.snibgo.com
unimatrix27
Posts: 2
Joined: 2016-01-28T05:13:25-07:00
Authentication code: 1151

Re: Pre-Process for OCR with tesseract. Only Digits.

Post by unimatrix27 »

Thanks for your reply. I realize what threshold does however it seems the results are still better if tesseract gets a pure black and white image as input..

i actually did not think about having tesseract only detect digits (and a comma, a dot and a dash) - now i found out how to configure that and it already works much better. thank you very much!
joluinfante
Posts: 5
Joined: 2011-05-17T11:16:53-07:00
Authentication code: 8675308

Re: Pre-Process for OCR with tesseract. Only Digits.

Post by joluinfante »

Hi
I have a similar problem to recognize characters in images.
Unfortunately, I have thousands of images scanned at 200x200, and what I get are very distant points, the human eye can clearly read as numbers. but for tesseract, there are many errors, and in some cases even "understand" what he says. I tried other tools (gamera, opencv), but the problem I have is that some transformation need to unite the closest points, so that each symbol is interpreted as a gliph.

I have an image sample to attach, but, I don't know to do it on this post.

My technique was extracted from the original image, the boxes where specifically a fact. Now, I have to be able to obtain that data.
joluinfante
Posts: 5
Joined: 2011-05-17T11:16:53-07:00
Authentication code: 8675308

Re: Pre-Process for OCR with tesseract. Only Digits.

Post by joluinfante »

Here the image sample:
Image

You can see the digits are ok for human, but, for ocr, there is a lot of points.

Can any help me with the transformation to join the points?
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Pre-Process for OCR with tesseract. Only Digits.

Post by snibgo »

This joins the black pixels within each character, without joining characters:
convert izq_1_pun.png -morphology open disk:1 out.png
snibgo's IM pages: im.snibgo.com
joluinfante
Posts: 5
Joined: 2011-05-17T11:16:53-07:00
Authentication code: 8675308

Re: Pre-Process for OCR with tesseract. Only Digits.

Post by joluinfante »

Okay. Thanks for answering.
I have another related problem.
I have this set of numbers in this form:

Image

Ocr attempt to make whole, and I have a lot of errors.
As the positions of the boxes are quite accurate, assemble a small program that "short" each box, leaving enough numbers easier to detect.
Now, in some cases, it is that the lines of the boxes (not so perfect), stain me the OCR process.
Is there any standard method to remove these lines (ie, edges are not complete, because the cut is not as accurate as can be seen in this picture):

Image

An alternative would be some kind of library where I could train her to detect ways. But what I have found, are attempts to detect objects, analyzing the complete contour, no forms (or parts thereof), as is my case.
Can you think of any option?
I have no problems in developing routines using some sort of library base, but I'm not finding the right choice. Obviously, I can not pay for the solution, I must limit the free world.

TIA
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Pre-Process for OCR with tesseract. Only Digits.

Post by snibgo »

There are many ways to remove vertical and horizontal lines from that image. If the numbers never touch the lines, "floodfill" might be used:

1. Find a long black horizontal line.

2. Pick any pixel on that line.

3. Floodfill with a fuzz at that point.

4. Repeat from (1) until no more are found.

Do the same for vertical lines.
snibgo's IM pages: im.snibgo.com
joluinfante
Posts: 5
Joined: 2011-05-17T11:16:53-07:00
Authentication code: 8675308

Re: Pre-Process for OCR with tesseract. Only Digits.

Post by joluinfante »

Thanks for your response.
You. Recommend any library to do this?
snibgo
Posts: 12159
Joined: 2010-01-23T23:01:33-07:00
Authentication code: 1151
Location: England, UK

Re: Pre-Process for OCR with tesseract. Only Digits.

Post by snibgo »

I would use ImageMagick.
snibgo's IM pages: im.snibgo.com
Post Reply