[reportlab-users] Surprising clash of encoders with pikepdf (with patch)

Mon Jan 10 09:23:45 EST 2022

Hi Lennart,

I have qpdf 10.5.0-1 installed and am using pikepdf 4.3.1; when I do tests on the pikepdf 'pdfdoc' codec I do see that 
the codec appears to raise the standard error at least for encoding.

So for example the unmapped '\u02d8' (should map to b'\x18') gives this

> (.py310) robin at minikat:~/devel/reportlab/REPOS/reportlab
> $ python -c'import pikepdf;print(ascii("\u02d8".encode("pdfdoc")))'
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File "/home/robin/devel/reportlab/.py310/lib/python3.10/site-packages/pikepdf/codec.py", line 124, in encode
>     return pdfdoc_encode(input, errors)
>   File "/home/robin/devel/reportlab/.py310/lib/python3.10/site-packages/pikepdf/codec.py", line 99, in pdfdoc_encode
>     raise UnicodeEncodeError(
> UnicodeEncodeError: 'pdfdoc' codec can't encode character '\u02d8' in position 0: character cannot be represented in pdfdoc encoding
> (.py310) robin at minikat:~/devel/reportlab/REPOS/reportlab

ie it raises UnicodeEncodeError.

On the other hand the decoding seems completely broken so far as the undefined mappings are concerned. The following 
raises no errors

> python -c'import pikepdf;B=[bytes((i,)).decode("pdfdoc") for i in range(256)]'

in other words the pikepdf 'pdfdoc' codec never appears to raise any error when decoding.

My version of the 1.7 ref says that these should be undefined in PDFDocEncoding

> ^@ 0 0x00 0000 U+0000 (NULL) U
> ^A 1 0x01 0001 U+0001 (START OF HEADING) U
> ^B 2 0x02 0002 U+0002 (START OF TEXT) U
> ^C 3 0x03 0003 U+0003 (END OF TEXT) U
> ^D 4 0x04 0004 U+0004 (END OF TEXT) U
> ^E 5 0x05 0005 U+0005 (END OF TRANSMISSION) U
> ^F 6 0x06 0006 U+0006 (ACKNOWLEDGE) U
> ^G 7 0x07 0007 U+0007 (BELL) U
> ^H 8 0x08 0010 U+0008 (BACKSPACE) U
> ^K 11 0x0b 0013 U+000B (LINE TABULATION) U
> ^L 12 0x0c 0014 U+000C (FORM FEED) U
> ^N 14 0x0e 0016 U+000E (SHIFT OUT) U
> ^O 15 0x0f 0017 U+000F (SHIFT IN) U
> ^P 16 0x10 0020 U+0010 (DATA LINK ESCAPE) U
> ^Q 17 0x11 0021 U+0011 (DEVICE CONTROL ONE) U
> ^R 18 0x12 0022 U+0012 (DEVICE CONTROL TWO) U
> ^S 19 0x13 0023 U+0013 (DEVICE CONTROL THREE) U
> ^T 20 0x14 0024 U+0014 (DEVICE CONTROL FOUR) U
> ^U 21 0x15 0025 U+0015 (NEGATIVE ACKNOWLEDGE) U
> ^V 22 0x16 0026 U+0017 (SYNCRONOUS IDLE) U
> ^W 23 0x17 0027 U+0017 (END OF TRANSMISSION BLOCK) U
>  127 0x7f 0177 Undefined U
> Ÿ 159 0x9f 0237 Undefined U0x9f
> ¬ 173 0xad 0255 Undefined U0xad

although since ^@ .. ^W are given unicode values they can map reasonably except for the fact that U+0017 appears twice 
and cannot be defined uniquely. I believe that is a typo as the name SYNCHRONOUS IDLE is actually U+0016.

However, the bytes 0x7f, 0x9f & 0xad should be unmapped and raise a decode error. It seems that

U+007f is DEL, U+009F is Application Program Command & U+00AD is a soft hyphen. Which presumably cannot be of use in pdfdoc.

It's never clear what a pdf viewer will actually do when given bytes like these, but if the spec says to ignore them I 
think it's probably the best thing to do.

If we are to agree on the coding of the 'pdfdoc' codec we must at least agree on the definition. If the coding in 
pikepdf is adapted from QPDF then perhaps someone there can justify their choices.
-- 
Robin Becker