[reportlab-users] Writing smaller image-only PDFs

Thu Feb 9 05:47:35 EST 2006

On Thu, 9 Feb 2006, Robin Becker wrote:
> Nicholas Watmough wrote:
[...]
>> But the imge-only PDF produced by Omnipage was 0.4MB, and the one produced 
>> through reportlab was 7.8MB.
>
> Could it be your docs are only black/white? A clever tool might recognize 
> that and do the appropriate image manipulation. I'm fairly sure we try to 
> respect the image properties ie check for gray/rgb/cmyk so we don't.
>
> Since jpeg is native for pdf we use only ascii85 encoding to make the 
> contents more like ascii.  I think we could save a bit by not doing that, but 
> not a huge amount. Jpegs are already compressed and we have to specify 
> dctdecode as well in the image filters.
>
> Perhaps they're tweaking the jpeg parameters to allow something smaller.
>
> Alternatively a smart scanner tool could actually do OCR, but I suspect they 
> don't unless you ask for it.

http://www.nuance.com/omnipage/

"""OmniPage 15, the entry-level version of the world's best selling OCR 
software"""

So there's your explanation.  I don't know exactly what you're doing, but 
in addition to the OCR, I'm think these tools do now have all kinds of 
clever heuristics and algorithms to recognise image boundaries, simple 
line graphics etc, too, and convert them to efficient text, image and 
vector representations that would make for a small PDF.  ReportLab's code 
doesn't try to get in to that big area -- if you want to do that kind of 
thing using our tools, you need to use a tool like Omnipage as a 
preprocessing step, and then use the OCRed textual data as input to your 
ReportLab code.  Perhaps Omnipage will let you get at descriptions of 
images and vector graphics that it pulled out, too?

John