[reportlab-users] counting pages in PDF

Jerome Alet reportlab-users@reportlab.com
Sat, 19 Jun 2004 00:34:01 +0200


Good morning !

On Fri, Jun 18, 2004 at 08:05:55PM +0200, Jerome Alet wrote:
> 
> some time ago, someone asked how to count pages in a PDF document.

This was Chris Withers it seems, sorry Chris.

> So for your pleasure, here's some code which seems to work with all 
> the PDF documents I've tested, and which should be completely cross 
> platform, provided you use Python 2.3 or newer : it uses the 
> Universal line end opening mode which appeared in 2.3 

and here's a severely optimized and portable across Python versions
rewrite, which doesn't need the "U" opening mode for files, and
which works perfectly well in 2.1

--- CUT ---
    def getPDFPageCount(infile) :    
        """Counts pages in a PDF document. 
        
           This is GPLed code written by J.Alet on 2004/06/19
        """
        import re
        regexp = re.compile(r"(/Type) ?(/Page)[/ \r\n]")
        pagecount = 0
        for line in infile.xreadlines() : 
            pagecount += len(regexp.findall(line))
        return pagecount    
--- CUT ---

I managed to parse a 15000 pages PDF document (RL made, with forms) 
on my PII 350 Mhz in less than 13 seconds. 

The PCL6 reference (265 pages PDF) is parsed in about 0.7 seconds on 
the same machine. 

NB : using a recent construct like "for line in infile :" makes it
even faster, but non portable across Python versions

I'd be interested in having a PDF document for which the code above
fails to give the correct page count. If you find any, please send
it to me for debugging.

bye

Jerome Alet