[reportlab-users] Surprising clash of encoders with pikepdf (with patch)

Lennart Regebro lregebro at shoobx.com
Mon Jan 10 07:45:04 EST 2022

On Mon, Jan 10, 2022 at 1:16 PM Robin Becker <robin at reportlab.com> wrote:

> Hi Lennart,
> I guess you know that reportlab predates pikepdf by a number of years.

Of course.

> I see no reason why the pdfdoc encoding for pikepdf and for reportlab
> should not behave the same.

Well, as mentioned, the reportlab pdfdoc raises UnicodeDecodeError and
UnicodeEncodeError, while the pikepdf will raise UnicodeError. When
reportlab then tests if a string is encodeable, instead of trapping the
error, the error will be raised, and you get a stracktrace.

I agree that the simplest fix for this problem is for reportlab is to trap
UnicodeError instead of the more specific ones. This does mean you aren't
sure which codec is used, but if that's acceptable, that's the best
solution. I can make a patch for that as well, if you like. I don't think
there are many places to change.

That's the only issue I have encountered, I did not actually look deeply at
if there are any other differences.

1) I did the following test: convert the pdf 1.7 appendix D table
> representing PDFDocEncoding into a test file and use
> the values there to see what the problems are. I ignored all the rows
> marked as 'U'.
> in ReportLab the failing conversions are
> b'\t' --> None
> b'\n' --> None
> b'\r' --> None
> I can easily fix those to match what would be expected ie b'\t' <--> u'\t'
> etc


> in pikepdf the errors are
> b'\x18' .. b'\x1f' --> None. I think that is an error in QPDF (& by
> inheritance pikepdf). There's no real reason why
> these bytes are not translatable
> > u 24 0x18 0030 U+02D8 BREVE
> > v 25 0x19 0031 U+02C7 CARON
> > · 27 0x1b 0033 U+02D9 DOT ABOVE
> > ” 28 0x1c 0034 U+02DD DOUBLE ACUTE ACCENT
> > , 29 0x1d 0035 U+02DB OGONEK
> > ° 30 0x1e 0036 U+02DA RING ABOVE
> > ~ 31 0x1f 0037 U+02DC SMALL TILDE

I guess that needs to be raised as an issue with QPDF. My c++ is way too
rusty to make a patch, though.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20220110/f6fdc807/attachment.htm>

More information about the reportlab-users mailing list