[reportlab-users] Experimental early serializing pdfdoc.py for reportlab

Wed Apr 13 11:43:47 EDT 2005

Memory footprint problem of the current implementation
------------------------------------------------------

Currently reportlabs pdfdoc.py serializes (and builds the document) only at the very end when the document is saved. This can be a problem for very large documents which need a lot of memory. In my first evaluation tests of reportlab ate all memory of our testserver with 500 copies of the chartable demo in a single PDF document and didn't succeed to produce the document.

For my purposes this behaviour made reportlab near to useless for me (i'd have to print documents with thousands of pages). So i went into the reportlab source and patched (finally rather rewrote) pdfdoc.py to do early serializing.

Early serializing pdfdoc.py
---------------------------

I patched the newest stable release i got of reportlab: 1.20

The new pdfbase/pdfdoc.py does early serializing (basically at each page break). I made the changes so that it breaks as little as possible. Actually there is only one little additional patch needed to make it work with the rest of reportlab: the PDFDocument has to know the filename from the beginning (obviously).

Results
--------

Three of the runAll.py tests break, those which inspect the already serialized and discarded PDF structures. The others are OK and the produced documents look equal.

Now for the memory footprint: i changed the gadflypaper demo to make 500 copies of that paper in a single document. This needs on my testserver more than 32% of the memory in the original version, a little less than 4% with the early serializing version (it is a 23.5MB document with more than 19'000 pages). For some reason the resulting copies are not identical, there is more and more blank.

Where to get
------------

You find here sources for the early serializing pdfdoc.py: http://bebabo.homelinux.org/test/reportlab/earlyserialize

Feedback welcome here or directly to me.