[reportlab-users] PDF Generation performance
Karl Putland
reportlab-users@reportlab.com
Wed, 24 Dec 2003 15:42:56 -0700
--=-G4PXiTH2Q2pSt9xTNivY
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
On Wed, 2003-12-24 at 15:30, Karl Putland wrote:
> A couple of years ago I sent a case study to Andy that contained
> performance numbers for a HUGE report I used to run. I'll see if I can
> find it. Maybe Andy has it laying around somewhere. It's been a while
> and I don't know if I've still got it.
>
WOW! It's been over three years since I did that.
Attached is the case study that I wrote up. I have no way to test the
program on faster equipment any more, but the results as they were are
documented.
A lot has changed in reportlab since then, and I think I still have the
table code laying around if it turns out that tables are an issue. It
may or may not work any more.
--Karl
> --Karl
>
> On Wed, 2003-12-24 at 04:31, Shayan Raghavjee wrote:
> > Hi guys,
> >
> > I'm having a major performance issue, and it's nothing that a pill off
> > the internet could solve, I was hoping you could give me some input.
> > I've been using Reportlab for a few months, but never really needed any
> > huge reports. Some reports have taken ages to build, but that was more
> > due to the SQL statement complexity than anything else.
> >
> > I'm currently working on something that doesn't require too much heavy
> > SQL stuff, but does generate PDFs with sometimes well over a hundred
> > pages. I've noticed Reportlab takes an age to build the file, a 100 page
> > document typically took 15 minutes, which is far outside the required
> > speed. I'm not entirely sure where the problem lies. The form is
> > basically a table with some required information on top, and another at
> > the bottom, which is a table of tables. Maybe it's too complex?
> >
> > How long should it take for 100 pages? Is there any way to make it
> > faster, using templates or something, because most of the info is
> > duplicated.
> >
> > Any help, or ideas would be welcomed with open arms.
> >
> > Thanks,
> > Shayan Raghavjee
> > St. James Software
> >
> > _______________________________________________
> > reportlab-users mailing list
> > reportlab-users@reportlab.com
> > http://two.pairlist.net/mailman/listinfo/reportlab-users
--
Karl Putland <karl@putland.linux-site.net>
--=-G4PXiTH2Q2pSt9xTNivY
Content-Disposition: attachment; filename=case_study_v3.txt
Content-Type: text/plain; name=case_study_v3.txt; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
About ServiceMagic.com:
ServiceMagic.com is the leading online connection for today's
consumers and qualified local Service Professionals ("SPs").
ServiceMagic.com, launched in October of 1999, utilizes a
proprietary technology to match a consumer's request for local
home services to a network of qualified, licensed and
interested local service professionals. ServiceMagic.com
currently addresses more than 485 different common home
service needs from simple home repairs and maintenance to
complete home remodeling projects. Through ServiceMagic.com,
consumers can easily submit a request and quickly receive up
to three qualified service professionals often including online
quotes for the job requested.
Problem:
Create an activity report for each SP detailing their monthly
activity. The report needed to be in a static document format=20
such as PS or PDF so that it could be printed in-house or=20
outsourced. The report also had some very complex formatting=20
issues. I've used shrink wrap tools in the past and=20
not had much luck with complex formatting, complex data=20
structures, or nesting data. Not to mention the fact that
it would require yet another program like Adobe Distiller to
create the final electronic document. We needed a more flexible
approach.
Solution:
Using ReportLab, we generate PDFs programmatically, controlling=20
all of the formatting, controlling all of the data, controlling
all of the output.
I've used and followed Python for about three years and was
aware of ReportLab's predecessor PDFgen. When ReportLab formed=20
in January, I picked up a copy and reviewed the demos in the
release and stashed that knowledge away for future reference.
When the requirements for the monthly activity report were=20
submitted, it was apparent that we needed a document that=20
could be=20
- dynamically generated
- formatted for windowed envelopes
- printed in-house
- printed offsite
- archived
- emailed
- not be easily edited by the end viewer.
These requirements led to the choice of PDF for the format.
=20
In a couple of days I had a proof of concept that PDF files
could indeed be generated dynamically and formatted using=20
ReportLab. Over the next two weeks, formatting issues were
ironed out with the designers and a custom table model
was implemented. On its first run it generated 841 pages=20
of PDF.
After the second run of the report, I reviewed
the various PDF offerings available for Java, and ReportLab
was more mature and full featured.
As the number of SPs with activity continued to grow the
report generation grew from 4 hours, to 7 hours; then to an=20
estimated 13 hours for the month of June. I had to do=20
something. After profiling the application it turned out=20
the way I was accessing the data was the bottle neck. A=20
couple of days optimizing the queries helped a little, but=20
not enough. The solution here was to cache the data locally. =20
Python's shelve module suited this need perfectly. With data=20
cached and keyed on the SPs user_id the report generation for=20
1 BIG files was reduced to 20 minutes!
=20
UPDATE 2000-08-02:
Yet more maintenance and improvements. - 4days
=20
A fundamental redesign to simplify the activity report and
to allow for the rendering of invoices from the accounting
system. The report was also altered to work with the v1.0
or ReportLab.
=20
The report now incorporates data from Oracle and M$SQL and
renders both the invoices and the activity reports into
three files. The three files are for different weight classes
of postage, 1oz, 2oz, and 3oz. The program estimates the
number of pages a particular SP will be receiving and changes
canvases accordingly.=20
=20
This latest addition allowed us to significantly reduce the=20
costs associated with the mailing fo invoices and activity reports.
Before this feature was implemented the outsourcers were
hand collating invoices to activity reports. Now with the
combined document they are using an intelligent inserter
to automate the folding and stuffing of the mailings saving
us money, increasing the vollume they can process, and speeding
delivery to the SP.
=20
All of these changes also GREATLY improved the performance.
The update to v1.0 was a 25% performance boost, and the=20
redesign contributed greater that 30% on top of that! So for
the same vollume the time went from 20 minutes to about 8 minutes!
=20
=20
Extending the system:
While creating the SPs monthly activity report, a table was needed to=20
hold a variable amount of different kinds of information. So I layed
out a table model and hacked together some working code in about three
days.
=20
One of the things that needed to go in a cell were multiple choice
bubbles similar to the old Scantron test that you had in school.
I created a simple class that accepts a list of strings and draws=20
a series of bubbles containing the strings. =20
=20
ReportLabs open architecture and open source allowed me to create the=20
pieces I needed to get the job done faster and better than with=20
comparable proprietary systems.
=20
Case Facts:
How long to production:
v.1 - 2 weeks (with no prior ReportLab experience)
v.2 - 4 days (profiling then incorporating the cache)
v.3 - 4 days (reformatting, incorporating data from two
sources, and designing the invoice)
System:
Data:
Oracle 8i - Remote over a T1
M$SQL 7 - Locally 10/100
DCOracle to access the Oracle data
mxODBC to access the MSSQL data
Python shelves for a local cache
Workstation:
Pentium III 500
128MB RAM
6GB IDE HDD
Other processes running concurrently:
Another 40MB Python process that kicks off every
5 minutes and runs for about a minute
The Oracle data for the report consists of 12 distinct data sets
that represent data from 7 different tables in many different
relationships. There are also 9 counts and a sum in the queries
that produce the data sets and 16 separate calculations that occur
for each SP when the report is created.
=20
The M$SQL data for the report is queried out of the accounting
system by a couple of whoppers. But the resulting data is easy
to work with and is stuffed into its own shelves
Speed:
3500+ Records.
~ 2-7 pages each
3 BIG files with pageCompression ON:
~=3D 7.58 minutes or .130sec/record
13,065KB - 1oz group
2,224KB - 2oz group
136KB - 3oz group
3 BIG files with pageCompression OFF:
~=3D 7.12 minutes or .122sec/record
46,275KB - 1oz group
13,466KB - 2oz group
710KB - 3oz group
=20
As a side note here, it appears that as file size increases the cost
of file I/O increases at a faster rate that the cost of compression.
Results have not been validated but no compression, on average,=20
appears to take longer. This observation is while other systems
are active. The results above were produced from a fresh boot,
with no extraneous processes.
--=-G4PXiTH2Q2pSt9xTNivY--