[reportlab-users] Writing smaller image-only PDFs
John J. Lee
jjlee at reportlab.com
Thu Feb 9 05:47:35 EST 2006
On Thu, 9 Feb 2006, Robin Becker wrote:
> Nicholas Watmough wrote:
[...]
>> But the imge-only PDF produced by Omnipage was 0.4MB, and the one produced
>> through reportlab was 7.8MB.
>
> Could it be your docs are only black/white? A clever tool might recognize
> that and do the appropriate image manipulation. I'm fairly sure we try to
> respect the image properties ie check for gray/rgb/cmyk so we don't.
>
> Since jpeg is native for pdf we use only ascii85 encoding to make the
> contents more like ascii. I think we could save a bit by not doing that, but
> not a huge amount. Jpegs are already compressed and we have to specify
> dctdecode as well in the image filters.
>
> Perhaps they're tweaking the jpeg parameters to allow something smaller.
>
> Alternatively a smart scanner tool could actually do OCR, but I suspect they
> don't unless you ask for it.
http://www.nuance.com/omnipage/
"""OmniPage 15, the entry-level version of the world's best selling OCR
software"""
So there's your explanation. I don't know exactly what you're doing, but
in addition to the OCR, I'm think these tools do now have all kinds of
clever heuristics and algorithms to recognise image boundaries, simple
line graphics etc, too, and convert them to efficient text, image and
vector representations that would make for a small PDF. ReportLab's code
doesn't try to get in to that big area -- if you want to do that kind of
thing using our tools, you need to use a tool like Omnipage as a
preprocessing step, and then use the OCRed textual data as input to your
ReportLab code. Perhaps Omnipage will let you get at descriptions of
images and vector graphics that it pulled out, too?
John
More information about the reportlab-users
mailing list