Page 1 of 1

convert -shave ruins image quality and strips OCR'd text

Posted: 2011-11-30T15:55:59-07:00
by bastardo
When running the following command:

convert in.pdf -shave 20x20 out.pdf

on Ubuntu, where in.pdf is a monochrome image of scanned text at 300 dpi, there are two bugs:

1. The image in out.pdf is seriously degraded, to the point where the text goes from crisp to completely unreadable.

2. The OCR'd text is selectable in in.pdf, but appears to have been stripped out of out.pdf.

I can provide in.pdf and out.pdf to whoever sends me an email address.

Re: convert -shave ruins image quality and strips OCR'd text

Posted: 2011-11-30T16:47:39-07:00
by el_supremo
1. The image in out.pdf is seriously degraded, to the point where the text goes from crisp to completely unreadable.
You need to specify a density for the PDF otherwise it will default to 72dpi which will make text look horrible.
Try this:

Code: Select all

convert -density 300 in.pdf -shave 20x20 out.pdf
This will produce a larger output image so you'll probably have to change the values used in -shave.
2. The OCR'd text is selectable in in.pdf, but appears to have been stripped out of out.pdf.
I don't understand what you mean. Put your in.pdf file online somewhere so I can try it out.

Pete

Re: convert -shave ruins image quality and strips OCR'd text

Posted: 2011-11-30T16:58:39-07:00
by bastardo
The -density 300 did work, but the file size increased from 100K to 400K. The image quality looks the same. The OCR'd text is still stripped out.

The image in the file had been OCR'd, and the OCR'd text is selectable using the usual copy text selection method (on Ubuntu, click and drag the mouse across the text and you can select it, copy it, then paste it into a text window). I don't have a place to upload the file to, but if you provide me an email I can send it.

Re: convert -shave ruins image quality and strips OCR'd text

Posted: 2011-12-01T23:32:12-07:00
by anthony
Remember ImageMagick is a raster image processor. It will convert small vertor images into large rasper (pixel array) images.

See A word about Vector Image formats.
http://www.imagemagick.org/Usage/formats/#vector


Basically you scaned an image (raster)
used OCR to convert that to a text PDF (vector)
then used ImageMagick to convert it back to a raster!

I think you are not quite thinking things through!

Re: convert -shave ruins image quality and strips OCR'd text

Posted: 2011-12-02T00:50:58-07:00
by bastardo
used OCR to convert that to a text PDF (vector)

That's what the OCR does? It sure looks like a bitmap. It zooms like a bitmap.

Re: convert -shave ruins image quality and strips OCR'd text

Posted: 2011-12-04T22:08:09-07:00
by anthony
BY vector I mean actual text that is drawn. Not a bitmap (raster). IM does not deal with vector images, so it converts back to a raster.

OCR means Optical Character Recognition... That is convert an image into ordinary text.

Re: convert -shave ruins image quality and strips OCR'd text

Posted: 2011-12-04T22:24:47-07:00
by bastardo
I don't believe that my scanner creates a vector image. It's a raster image backed by a text representation (the latter added by the OCR), I read that this was one of the pdf file formats.

Re: convert -shave ruins image quality and strips OCR'd text

Posted: 2011-12-04T22:49:46-07:00
by anthony
that is right. Scanners generate raster images. OCR converts them to vector (text) images (PDF)

Now a PDF is typically just the vector component. If it is more than that fine. But if you feed that into IM you will lose that vector component (the text), and thus all the OCR work!

The point of all this is that IM is a raster image processor.

If you want to shave the image. do it before you use OCR. That is use it on the TIFF, PNG, PBM, or JPEG image, before you create the PDF.