Pre-Process for OCR with tesseract. Only Digits.

unimatrix27 · Post by **unimatrix27** » 2016-01-28T05:47:33-07:00

Hi,

I have to process screenshots from excel-style tables with mostly numbers in it. Since the screenshot always has the same size and all columns are at static pixel positions, I have implemented a script to crop out individual cells into small images. THey have a very low resolution of just 15-19 pixels height. The digits have different foreground and background colors, sometimes the background color is achieved through dithering (png) which makes ocr even much more difficult. I tried several techniques to improve the image quality and ocr works quite well already but i still get some mistakes and I would like to kindly ask if you had any other suggestions for how to improve the images. Below some examples of original and my improvement results.

->

options used: -scale 1000% -blur 0x02 -colorspace gray -threshold 60% -type bilevel
---

->

options used: -scale 1000% -posterize 2 -blur 0x02 -colorspace gray -threshold 65% -type bilevel

options used: -scale 1000% -blur 0x02 -colorspace gray -threshold 17% -type bilevel
---

I am detecting foreground and background colors and based on the result use different arguments for convert.

I am still a beginner to ImageMagick or image processing in general so I am wondering if there are other ways to improve OCR readability of those images. I have to add that I have not yet done a training for tesseract based on those images. It seems to be a quite complicated process, I am willing to do it but I would like to get the best out of image processing first. Especially I would like to avoid letter overlapping which is caused by the blur I am applying.

Thanks for your feedback.

I am using Ubuntu and the shell version of convert, called by a perl script, since I did not find all options in PerlMagick

Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP

Post by **snibgo** » 2016-01-28T22:44:27-07:00

Thresholding is a bad idea. It makes the output either "on" or "off", so will merge adjacent digits. One of the "-level" variants will work better.

Sadly, in my limited experience of Tesseract, it needs characters to be at least 20 pixels high to perform reliably. Yours are only 8 pixels high. Perhaps Tesseract can be told to only look for digits (and punctuation). If so, that might increase reliability.

unimatrix27 · Post by **unimatrix27** » 2016-01-29T02:53:57-07:00

Thanks for your reply. I realize what threshold does however it seems the results are still better if tesseract gets a pure black and white image as input..

i actually did not think about having tesseract only detect digits (and a comma, a dot and a dash) - now i found out how to configure that and it already works much better. thank you very much!

joluinfante · Post by **joluinfante** » 2016-09-26T04:56:53-07:00

Hi
I have a similar problem to recognize characters in images.
Unfortunately, I have thousands of images scanned at 200x200, and what I get are very distant points, the human eye can clearly read as numbers. but for tesseract, there are many errors, and in some cases even "understand" what he says. I tried other tools (gamera, opencv), but the problem I have is that some transformation need to unite the closest points, so that each symbol is interpreted as a gliph.

I have an image sample to attach, but, I don't know to do it on this post.

My technique was extracted from the original image, the boxes where specifically a fact. Now, I have to be able to obtain that data.

joluinfante · Post by **joluinfante** » 2016-09-26T05:03:46-07:00

Here the image sample:

You can see the digits are ok for human, but, for ocr, there is a lot of points.

Can any help me with the transformation to join the points?

Post by **snibgo** » 2016-09-26T06:19:04-07:00

This joins the black pixels within each character, without joining characters:

convert izq_1_pun.png -morphology open disk:1 out.png

joluinfante · Post by **joluinfante** » 2016-09-30T03:35:22-07:00

Okay. Thanks for answering.
I have another related problem.
I have this set of numbers in this form:

Ocr attempt to make whole, and I have a lot of errors.
As the positions of the boxes are quite accurate, assemble a small program that "short" each box, leaving enough numbers easier to detect.
Now, in some cases, it is that the lines of the boxes (not so perfect), stain me the OCR process.
Is there any standard method to remove these lines (ie, edges are not complete, because the cut is not as accurate as can be seen in this picture):

An alternative would be some kind of library where I could train her to detect ways. But what I have found, are attempts to detect objects, analyzing the complete contour, no forms (or parts thereof), as is my case.
Can you think of any option?
I have no problems in developing routines using some sort of library base, but I'm not finding the right choice. Obviously, I can not pay for the solution, I must limit the free world.

TIA

Post by **snibgo** » 2016-09-30T06:26:47-07:00

There are many ways to remove vertical and horizontal lines from that image. If the numbers never touch the lines, "floodfill" might be used:

1. Find a long black horizontal line.

2. Pick any pixel on that line.

3. Floodfill with a fuzz at that point.

4. Repeat from (1) until no more are found.

Do the same for vertical lines.

joluinfante · Post by **joluinfante** » 2016-09-30T06:53:14-07:00

Thanks for your response.
You. Recommend any library to do this?

Post by **snibgo** » 2016-09-30T07:32:58-07:00

I would use ImageMagick.

Legacy ImageMagick Discussions Archive

Pre-Process for OCR with tesseract. Only Digits.

Pre-Process for OCR with tesseract. Only Digits.

Re: Pre-Process for OCR with tesseract. Only Digits.

Re: Pre-Process for OCR with tesseract. Only Digits.

Re: Pre-Process for OCR with tesseract. Only Digits.

Re: Pre-Process for OCR with tesseract. Only Digits.

Re: Pre-Process for OCR with tesseract. Only Digits.

Re: Pre-Process for OCR with tesseract. Only Digits.

Re: Pre-Process for OCR with tesseract. Only Digits.

Re: Pre-Process for OCR with tesseract. Only Digits.

Re: Pre-Process for OCR with tesseract. Only Digits.