[reportlab-users] encoding errors

Dirk Holtwick holtwick at spirito.de
Tue Jan 23 04:52:49 EST 2007


Hi Henning,

your donation of the hyphenation library sounds great! Hope it will be
integrated.

After a second look I found out that you are right and Reportlab seems
to convert string data here and there. Here is an example from
platypus/pdfdoc.py:

def format(self, document):
s = self.s
enc = getattr(self,'enc','auto')
if type(s) is str:
if enc is 'auto':
try:
u = s.decode('utf8')
except:
print s
raise
if _checkPdfdoc(u):
s = u.encode('pdfdoc')
else:
s = codecs.BOM_UTF16_BE+u.encode('utf_16_be')
elif type(s) is unicode:
if enc is 'auto':
if _checkPdfdoc(s):
s = s.encode('pdfdoc')
else:
s = codecs.BOM_UTF16_BE+s.encode('utf_16_be')
else:
s = codecs.BOM_UTF16_BE+s.encode('utf_16_be')
else:
raise ValueError('PDFString argument must be str/unicode not
%s' % type(s))

At first I ask myself why there is no central routine for converting any
String form Unicode or UTF-8 to a format needed in Reportlab?

Second I find a "print" statement, which should not be used at all in a
library where it is not clear if the is any STDOUT where the information
goes like e.g. usage within a server.

Third: I prefer to add the "ignore" attribute to the "decode" routines ;-)

Fourth: "type(s) is str" should be "type(s) is types.StringType" or
better like ""type(s) is not types.UnicodeType" and then handle the rest.

In a real scenario I had an error in this place:

error in line 0: Traceback (innermost last):
File "sx\pisapro\pml.pyo", line 359, in __init__
File "reportlab\platypus\doctemplate.pyo", line 749, in build
File "reportlab\platypus\doctemplate.pyo", line 698, in _endBuild
File "reportlab\pdfgen\canvas.pyo", line 870, in save
File "reportlab\pdfbase\pdfdoc.pyo", line 215, in SaveToFile
File "reportlab\pdfbase\pdfdoc.pyo", line 237, in GetPDFData
File "reportlab\pdfbase\pdfdoc.pyo", line 380, in format
File "reportlab\pdfbase\pdfdoc.pyo", line 765, in format
File "reportlab\pdfbase\pdfdoc.pyo", line 90, in format
File "reportlab\pdfbase\pdfdoc.pyo", line 739, in format
File "reportlab\pdfbase\pdfdoc.pyo", line 90, in format
File "reportlab\pdfbase\pdfdoc.pyo", line 550, in format
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 0: unexpected code byte

Even with all data send as Unicode, done by this little converter:

import reportlab
ReportlabVersion2 = (reportlab.Version >= "2.0")
def toString(s, enc="latin1"):
if ReportlabVersion2:
if type(s) is not types.UnicodeType:
s = unicode(str(s), enc, "ignore")
# Euro sign helper
return s.replace(u"\x80", u"\u20ac")
return s

Bye,
Dirk


Henning von Bargen schrieb:

> Last autumn, I began porting the ReportLab integration of my deco-cow

> hyphenation library

> (http://deco-cow.sourceforge.net) to ReportLab 2.0.

> Though the documentation on the the web site is still mentioning RL 1.19,

> you can download the RL 2 port from

> http://sourceforge.net/project/showfiles.php?group_id=105867

>

> During the port, I was struggling with the RL 2 code mainly because of

> varying encoding issues.

> I think that the RL 2 code is a little bit "unclean" concerning unicode.

> There are various places

> in the code where either unicode or string variables can be used, and are

> encoded/decoded

> on-the-fly. It could probably be improved, but I don't fully understand it

> in-depth and I'm not

> aware of possible side-effects.

> Perhaps more "public" documentation (for the internal helper functions, too)

> and assert statements

> throughout the code could help.

> I'm not 100% sure about it, but from what I remember it seemed that in most

> places, that

> "encoding/decoding on the fly" routines are using utf8, but other encoding

> are used in some places.

>

> After developing the hyphenation library to a level that worked ok for me, I

> tried running the

> whole RL test suite against it, and I found some issues (in my modified RL

> code with hyphenation)

> that belong to the "encoding/decoding" kind of issues.

> One problem was in genreference.py, and another one in graphdocpy.py

> ('ascii' encoding is used here!)

> I remember also that I had trouble with rl_codecs.py, for example regarding

> the "shy" character (I am using it

> for hyphenation because that's what it is intended for, and it integrates

> nicely with the Adobe Reader

> text selection feature). And there was a problem with the bullets in the

> pythonpoint sample.

>

> I'd like to donate the hyphenation library to the ReportLab open-source

> project (and see it fully

> integrated), but these unicode issues prevent me from saying that the

> library is production quality code

> (the SiSiSi implementation for german hyphenation is so-called spaghetti

> code anyway, but it is working

> quite well).

> I'd be happy if someone from the RL development team could take a look (or

> two) at the library and

> perhaps fix these unicode bugs; I don't have the time now and in the near

> future to do it myself.

>

> Henning

>

> _______________________________________________

> reportlab-users mailing list

> reportlab-users at reportlab.com

> http://two.pairlist.net/mailman/listinfo/reportlab-users

>





More information about the reportlab-users mailing list