[reportlab-users] Surprising clash of encoders with pikepdf (with patch)

Mon Jan 10 11:16:46 EST 2022

Well, I just first partly apologize, as it seems pikepdf made further
changes since I last checked, and they do indeed raise UnicodeEncodeError
now. When I first encountered this they raised ValueError, and I made a
patch to move that to UnicodeError, I didn't notice they improved it
further.

On the other hand, you now uncovered incompatibilities between the codecs,
so I'm still glad I raised the issue, even if it was for the wrong reason.
:-D

Your interpretation of the spec seems reasonable to me, it seems clear that
0x18 should be decoded as a BREVE, for example, and not 0x18, which is what
you get now.

Do you want me to start a discussion on QPDF's github about that?

On Mon, Jan 10, 2022 at 3:23 PM Robin Becker <robin at reportlab.com> wrote:

> Hi Lennart,
>
> I have qpdf 10.5.0-1 installed and am using pikepdf 4.3.1; when I do tests
> on the pikepdf 'pdfdoc' codec I do see that
> the codec appears to raise the standard error at least for encoding.
>
> So for example the unmapped '\u02d8' (should map to b'\x18') gives this
>
>
> > (.py310) robin at minikat:~/devel/reportlab/REPOS/reportlab
> > $ python -c'import pikepdf;print(ascii("\u02d8".encode("pdfdoc")))'
> > Traceback (most recent call last):
> >   File "<string>", line 1, in <module>
> >   File
> "/home/robin/devel/reportlab/.py310/lib/python3.10/site-packages/pikepdf/codec.py",
> line 124, in encode
> >     return pdfdoc_encode(input, errors)
> >   File
> "/home/robin/devel/reportlab/.py310/lib/python3.10/site-packages/pikepdf/codec.py",
> line 99, in pdfdoc_encode
> >     raise UnicodeEncodeError(
> > UnicodeEncodeError: 'pdfdoc' codec can't encode character '\u02d8' in
> position 0: character cannot be represented in pdfdoc encoding
> > (.py310) robin at minikat:~/devel/reportlab/REPOS/reportlab
>
> ie it raises UnicodeEncodeError.
>
> On the other hand the decoding seems completely broken so far as the
> undefined mappings are concerned. The following
> raises no errors
>
> > python -c'import pikepdf;B=[bytes((i,)).decode("pdfdoc") for i in
> range(256)]'
>
> in other words the pikepdf 'pdfdoc' codec never appears to raise any error
> when decoding.
>
> My version of the 1.7 ref says that these should be undefined in
> PDFDocEncoding
>
>
> > ^@ 0 0x00 0000 U+0000 (NULL) U
> > ^A 1 0x01 0001 U+0001 (START OF HEADING) U
> > ^B 2 0x02 0002 U+0002 (START OF TEXT) U
> > ^C 3 0x03 0003 U+0003 (END OF TEXT) U
> > ^D 4 0x04 0004 U+0004 (END OF TEXT) U
> > ^E 5 0x05 0005 U+0005 (END OF TRANSMISSION) U
> > ^F 6 0x06 0006 U+0006 (ACKNOWLEDGE) U
> > ^G 7 0x07 0007 U+0007 (BELL) U
> > ^H 8 0x08 0010 U+0008 (BACKSPACE) U
> > ^K 11 0x0b 0013 U+000B (LINE TABULATION) U
> > ^L 12 0x0c 0014 U+000C (FORM FEED) U
> > ^N 14 0x0e 0016 U+000E (SHIFT OUT) U
> > ^O 15 0x0f 0017 U+000F (SHIFT IN) U
> > ^P 16 0x10 0020 U+0010 (DATA LINK ESCAPE) U
> > ^Q 17 0x11 0021 U+0011 (DEVICE CONTROL ONE) U
> > ^R 18 0x12 0022 U+0012 (DEVICE CONTROL TWO) U
> > ^S 19 0x13 0023 U+0013 (DEVICE CONTROL THREE) U
> > ^T 20 0x14 0024 U+0014 (DEVICE CONTROL FOUR) U
> > ^U 21 0x15 0025 U+0015 (NEGATIVE ACKNOWLEDGE) U
> > ^V 22 0x16 0026 U+0017 (SYNCRONOUS IDLE) U
> > ^W 23 0x17 0027 U+0017 (END OF TRANSMISSION BLOCK) U
> >  127 0x7f 0177 Undefined U
> > Ÿ 159 0x9f 0237 Undefined U0x9f
> > ¬ 173 0xad 0255 Undefined U0xad
>
>
> although since ^@ .. ^W are given unicode values they can map reasonably
> except for the fact that U+0017 appears twice
> and cannot be defined uniquely. I believe that is a typo as the name
> SYNCHRONOUS IDLE is actually U+0016.
>
> However, the bytes 0x7f, 0x9f & 0xad should be unmapped and raise a decode
> error. It seems that
>
> U+007f is DEL, U+009F is Application Program Command & U+00AD is a soft
> hyphen. Which presumably cannot be of use in pdfdoc.
>
> It's never clear what a pdf viewer will actually do when given bytes like
> these, but if the spec says to ignore them I
> think it's probably the best thing to do.
>
> If we are to agree on the coding of the 'pdfdoc' codec we must at least
> agree on the definition. If the coding in
> pikepdf is adapted from QPDF then perhaps someone there can justify their
> choices.
> --
> Robin Becker
> _______________________________________________
> reportlab-users mailing list
> reportlab-users at lists2.reportlab.com
> https://pairlist2.pair.net/mailman/listinfo/reportlab-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20220110/51214d26/attachment-0001.htm>