[reportlab-users] Encoding UTF-8 instead of PDFDoc

Wed Mar 1 21:39:33 EST 2017

Hi Robin,

Yes, it was a wrong idea to use UTF-8 as the specifications explicitly
require PDFDocEncoding or UTF-16BE for text string. (I thought all Unicode
encodings were acceptable.) So now my idea is to encode using UTF-16BE.

I've attached a script causing the problem with two simple PDF files, which
are called ascii.pdf and cjk.pdf. This script runs simply as below.

$ python test.py

Changing the filename 'cjk.pdf' in the script to 'ascii.pdf' will remove
the error. These PDF files are basically same while ascii.pdf has an
optional content group called 'layer 1' and cjk.pdf has a group with its
name in Japanese. These names are the default layer names in Adobe
Illustrator, so I always have the same problem when I edit PDF files made
by Ai.

I changed the code block raising the error as below and reinstalled
reportlab, then my script didn't raise errors anymore.

# reportlab/pdfbase/pdfdoc.py
if isPy3:
    def pdfdocEnc(x):
        return x.encode('utf_16_be') if isinstance(x,str) else x

I didn't check the 'else' block for Python 2.x but my script didn't raise
any errors when I ran the same script with Python 2.7.12.

I'm using pdfrw library (https://github.com/pmaupin/pdfrw) to read PDF
files. Here are my environments:

- macOS 10.12.3
- Python 3.6.0
- pdfrw 0.2
- reportlab 3.3.32

Thanks!
Koki

2017年3月1日(水) 21:23 Robin Becker <robin at reportlab.com>:

On 01/03/2017 05:05, Koki Nomura wrote:
> Hi,
>
> pdfdocEnc() in pdfdoc.py raises a UnicodeEncodeError as below when I
> process a PDF file with Unicode characters. I'm running my script on
Python
> 3.6.0.
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\x00' in
> position 11: character maps to <undefined>
>
> This error disappears when I change the encoding from extpdfdoc to utf-8
in
> this block of code.
>
> if isPy3:
>     def pdfdocEnc(x):
>         return x.encode('extpdfdoc') if isinstance(x,str) else x
>
> While I don't fully understand 'extpdfdoc' encoding, can we change this
> encoding to utf-8 as PDF specifications allow to use Unicode as well as
> PDFDocEncoding?
>
> Thanks,
> Koki
........
Hi Koki,

not sure whether this is a good idea. The pdfdocEnc function is supposed to
use
either a bytestring or unicode. The output is 'supposed' to be acceptable
to PDF
and for that we would normally expect to use the pdfdoc standard encoding.
The
extpdfdoc encoding just adds  CR ('\r') and LF ('\n') identity mapped.

Can you give an example of where this is going wrong ie what you passed to a
reportlab function to cause the problem.

PDF does allow different encodings in various places, but usually we either
end
up using pdfdoc or sometimes UTF16. I don't think PDF allows utf8 in many
places; names are one case and I believe some software uri's can be directly
encoded as utf8.
--
Robin Becker
_______________________________________________
reportlab-users mailing list
reportlab-users at lists2.reportlab.com
https://pairlist2.pair.net/mailman/listinfo/reportlab-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20170302/e0cce11c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ascii.pdf
Type: application/pdf
Size: 909 bytes
Desc: not available
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20170302/e0cce11c/attachment.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cjk.pdf
Type: application/pdf
Size: 916 bytes
Desc: not available
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20170302/e0cce11c/attachment-0001.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.py
Type: text/x-python-script
Size: 531 bytes
Desc: not available
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20170302/e0cce11c/attachment.bin>