[reportlab-users] Building a PDF file with Images and "OCR" searchable text

Glenn Linderman v+python at g.nevcal.com
Sun Sep 23 12:39:20 EDT 2012


On 9/23/2012 6:45 AM, Andy Robinson wrote:

> On 23 September 2012 07:51, Glenn Linderman <v+python at g.nevcal.com> wrote:

>> Is this possible in reportlab?

> I honestly don't know how Acrobat does it. There is a recent feature

> we have not implemented to add xml text versions of a document

> somewhere inside the PDF file for easy indexing.

>

> However, there's an easy enough trick. You just need to draw that

> text on the same page as the image with some combination of (a) no

> fill colour for the text ('white on white'), (b) in small text or

> even behind the image, or (c) off the edge of the readable page. Then

> all the normal text search tools should find it.

>

> Are you drawing in a flowing, Platypus mode, or using pdfgen to

> manually place images and control page breaks?


I'm doing nothing yet... I receive the images as TIFF files. They are
various sizes, and the goal is to make a single PDF with each TIFF file
as one page. That I was able to achieve by using IrfanView. However,
IrfanView can only convert images to PDF, it doesn't deal with text, or
especially not with text tricks (It can convert text to image, I
believe, although I've not dealt with that).

So to do more, I turned to reportlab, but realized I didn't know where
to start; I've only ever generated PDFs with text, using reportlab, to date.

Perhaps you have a recommendation, which of these techniques would be
easier to generate a PDF with such variant page sizes, starting from TIFF?


> If the latter, then

> you could construct a paragraph object, put all the text in it, pick a

> pretty small font size so it's just about certain to be smaller than

> the scanned image, and position it on the page then draw your

> page-image over the top. (Call 'wrap' and 'draw' manually). If you

> are doing it in Platypus it's a bit fiddlier but we can probably show

> you a code snippet to do it.

>

> Please let us know if this works and especially how it shows up in

> Acrobat Reader ;-)

>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://two.pairlist.net/pipermail/reportlab-users/attachments/20120923/d35b6539/attachment.html>


More information about the reportlab-users mailing list