convert -shave ruins image quality and strips OCR'd text

Post any defects you find in the released or beta versions of the ImageMagick software here. Include the ImageMagick version, OS, and any command-line required to reproduce the problem. Got a patch for a bug? Post it here.
Post Reply
bastardo
Posts: 5
Joined: 2011-11-29T20:27:32-07:00
Authentication code: 8675308

convert -shave ruins image quality and strips OCR'd text

Post by bastardo »

When running the following command:

convert in.pdf -shave 20x20 out.pdf

on Ubuntu, where in.pdf is a monochrome image of scanned text at 300 dpi, there are two bugs:

1. The image in out.pdf is seriously degraded, to the point where the text goes from crisp to completely unreadable.

2. The OCR'd text is selectable in in.pdf, but appears to have been stripped out of out.pdf.

I can provide in.pdf and out.pdf to whoever sends me an email address.
el_supremo
Posts: 1015
Joined: 2005-03-21T21:16:57-07:00

Re: convert -shave ruins image quality and strips OCR'd text

Post by el_supremo »

1. The image in out.pdf is seriously degraded, to the point where the text goes from crisp to completely unreadable.
You need to specify a density for the PDF otherwise it will default to 72dpi which will make text look horrible.
Try this:

Code: Select all

convert -density 300 in.pdf -shave 20x20 out.pdf
This will produce a larger output image so you'll probably have to change the values used in -shave.
2. The OCR'd text is selectable in in.pdf, but appears to have been stripped out of out.pdf.
I don't understand what you mean. Put your in.pdf file online somewhere so I can try it out.

Pete
Sorry, my ISP shutdown all personal webspace so my MagickWand Examples in C is offline.
See my message in this topic for a link to a zip of all the files.
bastardo
Posts: 5
Joined: 2011-11-29T20:27:32-07:00
Authentication code: 8675308

Re: convert -shave ruins image quality and strips OCR'd text

Post by bastardo »

The -density 300 did work, but the file size increased from 100K to 400K. The image quality looks the same. The OCR'd text is still stripped out.

The image in the file had been OCR'd, and the OCR'd text is selectable using the usual copy text selection method (on Ubuntu, click and drag the mouse across the text and you can select it, copy it, then paste it into a text window). I don't have a place to upload the file to, but if you provide me an email I can send it.
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: convert -shave ruins image quality and strips OCR'd text

Post by anthony »

Remember ImageMagick is a raster image processor. It will convert small vertor images into large rasper (pixel array) images.

See A word about Vector Image formats.
http://www.imagemagick.org/Usage/formats/#vector


Basically you scaned an image (raster)
used OCR to convert that to a text PDF (vector)
then used ImageMagick to convert it back to a raster!

I think you are not quite thinking things through!
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
bastardo
Posts: 5
Joined: 2011-11-29T20:27:32-07:00
Authentication code: 8675308

Re: convert -shave ruins image quality and strips OCR'd text

Post by bastardo »

used OCR to convert that to a text PDF (vector)

That's what the OCR does? It sure looks like a bitmap. It zooms like a bitmap.
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: convert -shave ruins image quality and strips OCR'd text

Post by anthony »

BY vector I mean actual text that is drawn. Not a bitmap (raster). IM does not deal with vector images, so it converts back to a raster.

OCR means Optical Character Recognition... That is convert an image into ordinary text.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
bastardo
Posts: 5
Joined: 2011-11-29T20:27:32-07:00
Authentication code: 8675308

Re: convert -shave ruins image quality and strips OCR'd text

Post by bastardo »

I don't believe that my scanner creates a vector image. It's a raster image backed by a text representation (the latter added by the OCR), I read that this was one of the pdf file formats.
User avatar
anthony
Posts: 8883
Joined: 2004-05-31T19:27:03-07:00
Authentication code: 8675308
Location: Brisbane, Australia

Re: convert -shave ruins image quality and strips OCR'd text

Post by anthony »

that is right. Scanners generate raster images. OCR converts them to vector (text) images (PDF)

Now a PDF is typically just the vector component. If it is more than that fine. But if you feed that into IM you will lose that vector component (the text), and thus all the OCR work!

The point of all this is that IM is a raster image processor.

If you want to shave the image. do it before you use OCR. That is use it on the TIFF, PNG, PBM, or JPEG image, before you create the PDF.
Anthony Thyssen -- Webmaster for ImageMagick Example Pages
https://imagemagick.org/Usage/
Post Reply