[reportlab-users] Large tables - exponentially increasing run times

Sat, 15 May 2004 01:55:30 +0800

Hi folks

I'm pretty new to ReportLab, but I thought I'd post a few questions
here, having been unable to find an answer in the list archives or docs.

I've been looking seriously into using ReportLab (specifically,
platypus) as a print system for an application I'm working on. Part of
this will involve potentially extremely large table output. My tests
with "dummy" data in a very simply program are showing some performance
issues that will be a serious problem for this purpose. I thought I'd
post my observations so far, then a few questions. If I've missed
something obvious, then sorry for your time.

First: I'm using ReportLab 1.19 with Python 2.3.3 ( vanilla from
python.org ) on Linux (FC1 and RH8). Just in case it matters.

I searched the archives for info on performance problems, and while I
did find one post regarding the LongTables class, I'm at a loss to find
the 'longtables.py' module that was mentioned in that post. I can't see
it in sf.net CVS, 1.19, or the current snapshot. While there is a
LongTable class, it doesn't seem to be exported by the platypus module.
If I add the export and use the module, it doesn't help my run times at
all. I suspect I've missed something - a prod in the right direction
would be much appreciated.

Anyway, I noticed that someone else had run into the problem and
submitted a feature request at sf.net:
http://sourceforge.net/tracker/index.php?func=detail&aid=729034&group_id=682&atid=350682
. Since I wasn't the only one having trouble, I thought I'd have a dig
and see what I could figure out.

Like the submitter of that feature, I was seeing O(n^2) runtimes - even
with rowHeights and colWidths specified.

I ended up doing some profiling which suggested that Table.__init__ was
the main CPU hog. Adding some print statements, etc, revealed what was
happening. Every time Table._splitRows is called (by platypus via
Table.split), it splits a small table off and returns both that table
and an _entirely_ _new_ table instance with the rest of the rows. Of
course, the new and nearly-as-large table's split method gets called,
and another tiny chunk is removed and _another_ huge table instance is
created. This gets slow, fast.

So, for every doubling of row count the numbers of rows that must be
processed doubles ... and so, roughly, does the number of times they
must be processed. This appears to be the origin of the O(n^2) runtimes.

To confirm this, I rewrote (perhaps I should say "severly mangled")
Table._splitRows to do things another way. The rewritten test version is
BROKEN in many ways, so it's not a valid performance comparison - all I
wanted was a version that ran so I could test if _splitRows was really
the problem. The test version:
	(a) doesn't work if page/frame sizes vary (unsure how to fix)
	(b) doesn't support repeatRows (but could easily)
	(c) doesn't support styles (but again, could easily)
The test version checks if all rowHeights are equal, and if so does
things differently. It divides the number of rows in the data by the
number of rows that fit on a page, then splits the data into that many
chunks and creates a table instance for each of them. It then returns
that list of tables. In other words, it avoids the recursive behaviour
demonstrated by the current _splitRows, at the cost of correctness. It's
only a test!

The results of my testing with and without the mangled _splitRows are
remarkable, and strongly point to _splitRows as the problem with my
code. I'm _not_ saying "the problem with Table" or "the problem with
tables in platypus"... I don't know enough about it for that, and as I
said I may be missing something obvious. Nonetheless, with my code, the
mangled _splitRows linearised run times.

Heare are some very rough and ready benchmarks. These were created with
'time' on the command line, not the timeit module, so they include
interpreter startup and data set construction. I wrapped the actual
ReportLab calls with some print statments as a test, and the vast
majority of time is definitely being taken almost entirely within
ReportLab. In other words, these times are "good enough".

Some of these appear in my comments on the feature request, created as I
was working on this. The results:

Without:

rows	time(s)	rows/s
1000	1.9	526
2000	5.3	337
4000	17.8	224
8000	75	106

I've estimated 6 to 10 minutes for 20,000 rows. I'm running the test...
lets see how close I was:

20000	519	38
8 mins 38 seconds. Not a bad estimate :-) That makes this one:
100000	??	??
impractical to run, however.

With the mangled _splitRows:

1000	1.4	714
2000	2.5	800		
4000	4.5	888
8000	9.3	860
20000	24	833
100000	131	763

So ... as you can see, at least with how I'm using the Table class the
_splitRows method does appear to be the cause of the long run times.

_Assuming_ I'm not doing missing something obvious, and there is a real
problem here, then I'm not too sure how to approach fixing it. Hence my
post here - I'm hoping to find out (a) if I'm wasting my time fixing a
non-problem, and (b) if there is a problem, how best to approach the
fix.

My test code is not a suitable solution, because it causes the table to
split its self incorrectly when the table on the first page doesn't have
the same amount of space as available as do subsequent pages. Three
(two, really) appoaches come to mind. Both are really "thinking out
loud" - they might be totally wrong, or solving a non-problem. Anyway,
in case they're of any use:

	(a) use the same technique I used my my test, ie split all in one go,
but do so more intelligently than I did in my testing. It could be
difficult/impossible to get enough information about things like frame
sizes of frames later in the document, so I doubt this would work.

	(b) Retain the same recursive-style splitting (I know it's not truly
recursive, but it has some recursive-like behaviour) as the current
split uses. However, instead of creating an entire new Table instance
each time Table.split is called, find a way to efficiently pare some
rows off the current Table instance without re-creating it. Return
[new_small_split_table, shrunken_old_table_instance] as [R0,R1] instead
of the current [new_small_split_table, new_table_containing_rest] as the
Table class currently does.

	(c) is a variant on (b). A new Table instance _is_ created, but a
method is provided for it to be passed the pre-calculated metrics of the
remaining rows from the old table, so that the __init__ function is fast
and efficient.

With my limited understanding of ReportLab internals, it seems that (b)
would be strongly preferable. Both could handle tables where rowHeight
is not fixed across rows, but (a) seems really ugly in how it relates to
the platypus flow model - I'd be a little surprised if it could even
work. 

(c) seems a "lesser" alternative to (b), in that it seems best to avoid
creating a whole new (potentially huge) copy of most of the
Table._cellvalues list if it's possible to simply extract a small chunk
with a slice, then del() the slice (retaining the first repeatRows
entries) from the original list. I guess one could modify the "parent"
table's _cellvalues this way then pass that modified list to the new
class, avoiding copying it. That'd involve modifying the parent class's
data in place, though, and would seem to invalidate the parent class -
it'd need to be discarded. If you're going to do that, why not have the
parent class modify its self properly and return its self, as per (b)?

If I have this right, Table._splitRows is currently essentially doing
this (massively over simplified,  pseudocode really):

smalltable = Table(data[:n])
# The current Table instance will have all references deleted and be
# garbage collected shortly after we return. Adding a __del__ method
# seems to confirm this.
return [ smalltable, Table(data[n:]) ]

and approach (b) would change that to (again, massively oversimplified,
almost pseudocode):

smalltable = Table(data[:n])
del(data[repeatRows:n])
# fix up other related info like self._rowHeights here
return [ smalltable, self ]

I hope I'm making some sense here, and haven't wasted my time tracking
down a non-problem!

Craig Ringer