[reportlab-users] Unicode handling bugs - 2.0

Mon Jun 5 10:52:31 EDT 2006

Greg Phillips wrote:
> First of all, thanks much for going Unicode in 2.0, and especially for 
> fixing the KeepTogether bug. Both of those simplify my life enormously.
........

> In paragraph.py, there are two places (lines 279 and 298) where tests like:
> 
>     if type(bulletText) is StringType:
> 
> are made to determine whether the bullets are text or lists of 
> fragments. This breaks for the obvious reason if the bullet text is 
> unicode. I suggest changing these lines to:
> 
>     if isinstance(bulletText, basestring):

I'm fairly sure we're allowed to have both here now; the decision to go either 
utf8/unicode was made fairly late and probably without enough checking.

> There's a similar error at line 1186 of pdfdoc.py. A quick grep shows 
> other instances of "is StringType" in the library, but I haven't 
> investigated whether these are bugs or not.

I think that one is supposed to be a string.

> 
> Also, in paraparser.py, line 710, there's a conversion to cp1252 
> encoding to make sgmlop happy; this was causing errors when my input 
> included characters that weren't recognized in that encoding. Changing 
> the encoding to utf-8 seemed to solve the problem, but I don't know 
> enough about what's really going on there to know if that's the Right 
> Thing To Do.
.....

Seems right to me, but I don't really know what happens if we get a '<' as part 
of a multi-byte character in utf8.
-- 
Robin Becker