[reportlab-users] Experimental early serializing pdfdoc.py for reportlab

Wed Apr 13 13:27:27 EDT 2005

Thomas Blatter wrote:
> Memory footprint problem of the current implementation
> ------------------------------------------------------
> 
> Currently reportlabs pdfdoc.py serializes (and builds the document) only at the very end when the document is saved. This can be a problem for very large documents which need a lot of memory. In my first evaluation tests of reportlab ate all memory of our testserver with 500 copies of the chartable demo in a single PDF document and didn't succeed to produce the document.
> 
> For my purposes this behaviour made reportlab near to useless for me (i'd have to print documents with thousands of pages). So i went into the reportlab source and patched (finally rather rewrote) pdfdoc.py to do early serializing.
> 
> Early serializing pdfdoc.py
> ---------------------------
> 
> I patched the newest stable release i got of reportlab: 1.20
> 
> The new pdfbase/pdfdoc.py does early serializing (basically at each page break). I made the changes so that it breaks as little as possible. Actually there is only one little additional patch needed to make it work with the rest of reportlab: the PDFDocument has to know the filename from the beginning (obviously).
> 
> 
> Results
> --------
> 
> Three of the runAll.py tests break, those which inspect the already serialized and discarded PDF structures. The others are OK and the produced documents look equal.
> 
> Now for the memory footprint: i changed the gadflypaper demo to make 500 copies of that paper in a single document. This needs on my testserver more than 32% of the memory in the original version, a little less than 4% with the early serializing version (it is a 23.5MB document with more than 19'000 pages). For some reason the resulting copies are not identical, there is more and more blank.
> 
> Where to get
> ------------
> 
> You find here sources for the early serializing pdfdoc.py: http://bebabo.homelinux.org/test/reportlab/earlyserialize
> 
> Feedback welcome here or directly to me.
.....

This is now probably entirely sensible; at one time we had some rather hackish 
code which attempted to optimize multiple copies of images etc etc and which 
looked backwards into the code array.

I think we got rid of that some time ago, but there are things called forms and 
since PDF is inherently non-linear in using indirection there are certainly 
going to be cases when a page uses a resource that would normally be defined 
later on. However, I don't think we're using indirection except in pdfdoc itself.

Should be a win win. I'll take a look real soon.
-- 
Robin Becker