[reportlab-users] Surprising clash of encoders with pikepdf (with patch)

Mon Jan 10 07:16:04 EST 2022

Hi Lennart,

I guess you know that reportlab predates pikepdf by a number of years.

I see no reason why the pdfdoc encoding for pikepdf and for reportlab should not behave the same.

1) I did the following test: convert the pdf 1.7 appendix D table representing PDFDocEncoding into a test file and use 
the values there to see what the problems are. I ignored all the rows marked as 'U'.

in ReportLab the failing conversions are
b'\t' --> None
b'\n' --> None
b'\r' --> None

I can easily fix those to match what would be expected ie b'\t' <--> u'\t' etc

in pikepdf the errors are

b'\x18' .. b'\x1f' --> None. I think that is an error in QPDF (& by inheritance pikepdf). There's no real reason why 
these bytes are not translatable

> u 24 0x18 0030 U+02D8 BREVE
> v 25 0x19 0031 U+02C7 CARON
> ^ 26 0x1a 0032 U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT
> · 27 0x1b 0033 U+02D9 DOT ABOVE
> ” 28 0x1c 0034 U+02DD DOUBLE ACUTE ACCENT
> , 29 0x1d 0035 U+02DB OGONEK
> ° 30 0x1e 0036 U+02DA RING ABOVE
> ~ 31 0x1f 0037 U+02DC SMALL TILDE

*NB* I did not convert/check the bytes which the 1.7 ref says should be undefined.

2) I do not think there's any point in making the reportlab code use an explicit encoding instead of u.encode('pdfdoc') 
or b.decode('pdfdoc'). It might make the reportlab code slightly more robust against others overwriting that encoding, 
but that would not fix downstream software that just wants to use the 'pdfdoc' codec. I think the only way to fix that 
is to have an agreed codec and the PDF ref is surely the definition to follow.

3) The rl_codecs modules doesn't raise any Unicode related errors itself. Those are handled by the python str/byte 
objects so we are following what python does eg

>$ python
> Python 3.10.1 (main, Dec  7 2021, 09:01:12) [GCC 11.1.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>>>> from reportlab.pdfbase import pdfmetrics
>>>> '\x0a'.encode('pdfdoc')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/robin/devel/reportlab/reportlab/pdfbase/rl_codecs.py", line 1000, in encode
>     return charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character '\x0a' in position 0: character maps to <undefined>
>>>> b'\x0a'.decode('pdfdoc')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/robin/devel/reportlab/reportlab/pdfbase/rl_codecs.py", line 1003, in decode
>     return charmap_decode(input,errors,decoding_map)
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x0a in position 0: character maps to <undefined>
>>>> 

so failing encodes/decodes raise Unicode<Encode/Decode>Error so it's more specific than just UnicodeError.

However, because if the exception inheritance in python
>      +-- ValueError
>       |    +-- UnicodeError
>       |         +-- UnicodeDecodeError
>       |         +-- UnicodeEncodeError
>       |         +-- UnicodeTranslateError

it follows that UnicodeDecodeError & UnicodeDecodeError will already be trappable as UnicodeError if that's what code wants.

If we can agree to match the encoding itself it will make no difference which one is used.
--
Robin Becker

On 07/01/2022 17:16, Lennart Regebro via reportlab-users wrote:
> Hi all!
> 
> Both Reportlab and PikePDF registers "pdfdoc" encodings, which means that
> which encoding you actually end up using is arbitrary. I guess it depends
> on the import order, but I haven't checked.
> 
> That's all and well in itself, and shouldn't be a problem, but alas,
> PikePDF's encoding is using the qpdf library, and that library will not
> tell you which character failed to encode. Therefore, it doesn't raise
> UnicodeEncodeError which requires that information, but ValueError. This is
> actually specified in the docs for encode() and decode():
> 
> "encoding errors raise ValueError
> <https://docs.python.org/3/library/exceptions.html#ValueError> (or a more
> codec specific subclass, such as UnicodeEncodeError
> <https://docs.python.org/3/library/exceptions.html#UnicodeEncodeError>)" -
> https://docs.python.org/3/library/codecs.html
> 
> In other places it says "Raise UnicodeError
> <https://docs.python.org/3/library/exceptions.html#UnicodeError> (or a
> subclass); this is the default. Implemented in strict_errors()
> <https://docs.python.org/3/library/codecs.html#codecs.strict_errors>." -
> https://docs.python.org/3/library/codecs.html#error-handlers
> 
> I made a PR to change pikepdf's error from ValueError to UnicodeError which
> has been iomplemented, but that only fixes half the problem. I believe
> Reportlab should make one or both of these minor changes:
> 
> 1. Catch UnicodeErrorsinstead of UnicodeEncodeError when "pdfdoc" encoding
> is used. This should have no drawbacks.
> 
> 2. When it uses the pdfdoc codec it should use it directly, and not via the
> "pdfdoc" name.
> 
> The first fix is trivial so I didn't do that, but I attach a patch for the
> second fix here.
> I hope attachemnts works for this.
> 
> 
> _______________________________________________
> reportlab-users mailing list
> reportlab-users at lists2.reportlab.com
> https://pairlist2.pair.net/mailman/listinfo/reportlab-users

-- 
Robin Becker