[reportlab-users] PDF Generation performance

Karl Putland reportlab-users@reportlab.com
Wed, 24 Dec 2003 15:42:56 -0700


--=-G4PXiTH2Q2pSt9xTNivY
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

On Wed, 2003-12-24 at 15:30, Karl Putland wrote:
> A couple of years ago I sent a case study to Andy that contained
> performance numbers for a HUGE report I used to run.  I'll see if I can
> find it.  Maybe Andy has it laying around somewhere.  It's been a while
> and I don't know if I've still got it.
> 

WOW!  It's been over three years since I did that.

Attached is the case study that I wrote up.  I have no way to test the
program on faster equipment any more, but the results as they were are
documented.

A lot has changed in reportlab since then, and I think I still have the
table code laying around if it turns out that tables are an issue.  It
may or may not work any more.

--Karl


> --Karl
> 
> On Wed, 2003-12-24 at 04:31, Shayan Raghavjee wrote:
> > Hi guys,
> > 
> > I'm having a major performance issue, and it's nothing that a pill off 
> > the internet could solve, I was hoping you could give me some input. 
> > I've been using Reportlab for a few months, but never really needed any 
> > huge reports. Some reports have taken ages to build, but that was more 
> > due to the SQL statement complexity than anything else.
> > 
> > I'm currently working on something that doesn't require too much heavy 
> > SQL stuff, but does generate PDFs with sometimes well over a hundred 
> > pages. I've noticed Reportlab takes an age to build the file, a 100 page 
> > document typically took 15 minutes, which is far outside the required 
> > speed. I'm not entirely sure where the problem lies. The form is 
> > basically a table with some required information on top, and another at 
> > the bottom, which is a table of tables. Maybe it's too complex?
> > 
> > How long should it take for 100 pages? Is there any way to make it 
> > faster, using templates or something, because most of the info is 
> > duplicated.
> > 
> > Any help, or ideas would be welcomed with open arms.
> > 
> > Thanks,
> > Shayan Raghavjee
> > St. James Software
> > 
> > _______________________________________________
> > reportlab-users mailing list
> > reportlab-users@reportlab.com
> > http://two.pairlist.net/mailman/listinfo/reportlab-users
-- 
Karl Putland <karl@putland.linux-site.net>

--=-G4PXiTH2Q2pSt9xTNivY
Content-Disposition: attachment; filename=case_study_v3.txt
Content-Type: text/plain; name=case_study_v3.txt; charset=us-ascii
Content-Transfer-Encoding: quoted-printable

About ServiceMagic.com:
    ServiceMagic.com is the leading online connection for today's
    consumers and qualified local Service Professionals ("SPs").
    ServiceMagic.com, launched in October of 1999, utilizes a
    proprietary technology to match a consumer's request for local
    home services to a network of qualified, licensed and
    interested local service professionals. ServiceMagic.com
    currently addresses more than 485 different common home
    service needs from simple home repairs and maintenance to
    complete home remodeling projects. Through ServiceMagic.com,
    consumers can easily submit a request and quickly receive up
    to three qualified service professionals often including online
    quotes for the job requested.

Problem:
    Create an activity report for each SP detailing their monthly
    activity.  The report needed to be in a static document format=20
    such as PS or PDF so that it could be printed in-house or=20
    outsourced.  The report also had some very complex formatting=20
    issues.  I've used shrink wrap tools in the past and=20
    not had much luck with complex formatting, complex data=20
    structures, or nesting data.  Not to mention the fact that
    it would require yet another program like Adobe Distiller to
    create the final electronic document. We needed a more flexible
    approach.


Solution:
    Using ReportLab, we generate PDFs programmatically, controlling=20
    all of the formatting, controlling all of the data, controlling
    all of the output.

    I've used and followed Python for about three years and was
    aware of ReportLab's predecessor PDFgen.  When ReportLab formed=20
    in January, I picked up a copy and reviewed the demos in the
    release and stashed that knowledge away for future reference.

    When the requirements for the monthly activity report were=20
    submitted, it was apparent that we needed a document that=20
    could be=20
        - dynamically generated
        - formatted for windowed envelopes
        - printed in-house
        - printed offsite
        - archived
        - emailed
        - not be easily edited by the end viewer.
    These requirements led to the choice of PDF for the format.
   =20
    In a couple of days I had a proof of concept that PDF files
    could indeed be generated dynamically and formatted using=20
    ReportLab.  Over the next two weeks, formatting issues were
    ironed out with the designers and a custom table model
    was implemented.  On its first run it generated 841 pages=20
    of PDF.

    After the second run of the report, I reviewed
    the various PDF offerings available for Java, and ReportLab
    was more mature and full featured.

    As the number of SPs with activity continued to grow the
    report generation grew from 4 hours, to 7 hours; then to an=20
    estimated 13 hours for the month of June. I had to do=20
    something.  After profiling the application it turned out=20
    the way I was accessing the data was the bottle neck.  A=20
    couple of days optimizing the queries helped a little, but=20
    not enough.  The solution here was to cache the data locally. =20
    Python's shelve module suited this need perfectly.  With data=20
    cached and keyed on the SPs user_id the report generation for=20
    1 BIG files was reduced to 20 minutes!
   =20
    UPDATE 2000-08-02:
    Yet more maintenance and improvements. - 4days
   =20
    A fundamental redesign to simplify the activity report and
    to allow for the rendering of invoices from the accounting
    system.  The report was also altered to work with the v1.0
    or ReportLab.
   =20
    The report now incorporates data from Oracle and M$SQL and
    renders both the invoices and the activity reports into
    three files.  The three files are for different weight classes
    of postage, 1oz, 2oz, and 3oz.  The program estimates the
    number of pages a particular SP will be receiving and changes
    canvases accordingly.=20
   =20
    This latest addition allowed us to significantly reduce the=20
    costs associated with the mailing fo invoices and activity reports.
    Before this feature was implemented the outsourcers were
    hand collating invoices to activity reports.  Now with the
    combined document they are using an intelligent inserter
    to automate the folding and stuffing of the mailings saving
    us money, increasing the vollume they can process, and speeding
    delivery to the SP.
   =20
    All of these changes also GREATLY improved the performance.
    The update to v1.0 was a 25% performance boost, and the=20
    redesign contributed greater that 30% on top of that! So for
    the same vollume the time went from 20 minutes to about 8 minutes!
   =20
  =20
Extending the system:
    While creating the SPs monthly activity report, a table was needed to=20
    hold a variable amount of different kinds of information.  So I layed
    out a table model and hacked together some working code in about three
    days.
   =20
    One of the things that needed to go in a cell were multiple choice
    bubbles similar to the old Scantron test that you had in school.
    I created a  simple class that accepts a list of strings and draws=20
    a series of bubbles containing the strings. =20
   =20
    ReportLabs open architecture and open source allowed me to create the=20
    pieces I needed to get the job done faster and better than with=20
    comparable proprietary systems.
   =20
Case Facts:
    How long to production:
        v.1 - 2 weeks (with no prior ReportLab experience)
        v.2 - 4 days (profiling then incorporating the cache)
        v.3 - 4 days (reformatting, incorporating data from two
                      sources, and designing the invoice)
    System:
        Data:
            Oracle 8i - Remote over a T1
            M$SQL 7 - Locally 10/100
        DCOracle to access the Oracle data
        mxODBC to access the MSSQL data
        Python shelves for a local cache
        Workstation:
            Pentium III 500
            128MB RAM
            6GB IDE HDD
            Other processes running concurrently:
                Another 40MB Python process that kicks off every
                5 minutes and runs for about a minute

    The Oracle data for the report consists of 12 distinct data sets
    that represent data from 7 different tables in many different
    relationships.  There are also 9 counts and a sum in the queries
    that produce the data sets and 16 separate calculations that occur
    for each SP when the report is created.
   =20
    The M$SQL data for the report is queried out of the accounting
    system by a couple of whoppers.  But the resulting data is easy
    to work with and is stuffed into its own shelves

    Speed:
        3500+ Records.
        ~ 2-7 pages each
            3 BIG files with pageCompression ON:
                ~=3D 7.58 minutes or .130sec/record
                13,065KB - 1oz group
                 2,224KB - 2oz group
                   136KB - 3oz group

            3 BIG files with pageCompression OFF:
                ~=3D 7.12 minutes or .122sec/record
                46,275KB - 1oz group
                13,466KB - 2oz group
                   710KB - 3oz group
   =20
    As a side note here,  it appears that as file size increases the cost
    of file I/O increases at a faster rate that the cost of compression.
    Results have not been validated but no compression, on average,=20
    appears to take longer.  This observation is while other systems
    are active.  The results above were produced from a fresh boot,
    with no extraneous processes.
--=-G4PXiTH2Q2pSt9xTNivY--