[reportlab-users] Encoding UTF-8 instead of PDFDoc
robin at reportlab.com
Wed Mar 1 07:23:50 EST 2017
On 01/03/2017 05:05, Koki Nomura wrote:
> pdfdocEnc() in pdfdoc.py raises a UnicodeEncodeError as below when I
> process a PDF file with Unicode characters. I'm running my script on Python
> UnicodeEncodeError: 'charmap' codec can't encode character '\x00' in
> position 11: character maps to <undefined>
> This error disappears when I change the encoding from extpdfdoc to utf-8 in
> this block of code.
> if isPy3:
> def pdfdocEnc(x):
> return x.encode('extpdfdoc') if isinstance(x,str) else x
> While I don't fully understand 'extpdfdoc' encoding, can we change this
> encoding to utf-8 as PDF specifications allow to use Unicode as well as
not sure whether this is a good idea. The pdfdocEnc function is supposed to use
either a bytestring or unicode. The output is 'supposed' to be acceptable to PDF
and for that we would normally expect to use the pdfdoc standard encoding. The
extpdfdoc encoding just adds CR ('\r') and LF ('\n') identity mapped.
Can you give an example of where this is going wrong ie what you passed to a
reportlab function to cause the problem.
PDF does allow different encodings in various places, but usually we either end
up using pdfdoc or sometimes UTF16. I don't think PDF allows utf8 in many
places; names are one case and I believe some software uri's can be directly
encoded as utf8.
More information about the reportlab-users