[reportlab-users] String shapes and encodings

Claude Paroz claude at 2xlibre.net
Tue Jun 7 12:27:52 EDT 2022


Le 07.06.22 à 10:40, Robin Becker a écrit :
> On 06/06/2022 15:28, Claude Paroz wrote:
>> Hi,
>>
>> In the spirit of Python3 strings being always Unicode, I think that 
>> ReportLab String shape should behave the same and accept only Python 
>> strings.
>> I admit this might be slightly backwards incompatible, but it could be 
>> a first step in string handling simplification in ReportLab. The next 
>> step could be a similar patch for platypus Paragraph.
>>
>> Claude
> 
> I don't think the fact that python regards a specific encoding of glyphs 
> to be strings has much relevance here. Most of the external data is in 
> byte form whether encoded as unicode utf8 etc etc.
> 
> When python started to provide a unicode encoding of glyphs reportlab 
> had to support them because people wanted to use them. Today people 
> still want to use bytes.

Of course, at a certain point in time, any digital content is a matter 
of bytes. That's not what is discussed here.
The approach Python choose is to push for character conversion happening 
in process boundaries, that is at input and output time. When you get 
some string input, you have to know (or guess) the encoding and the idea 
is to immediately convert to Unicode. Then during the whole string 
lifetime in your program, it is Unicode (Python 3 str type). Then, at 
some point you have to produce some outpout, and that's the time to 
convert back to bytes with the expected encoding from the output 
consumer side.
This simplify things *a lot* compared to the Python 2 world when you 
never knew if you had to manipulate pure bytes or unicode, and had to 
constantly test content in many parts of your code, as you can see in 
ReportLab with the many isStr, isBytes, isUnicode, asNative, etc. uses 
throughout the code base. I don't despise that, it was a "normal" 
consequence of string status on Python 2.

> If python said it was abandoning byte strings then that would be a 
> reason to drop all support for them. That would really annoy the gene 
> analysts though :)

This won't happen. Bytes, be it strings or any other content type has 
legitimate use cases, of course.

> I don't think I would like to apply this patch anytime soon. If others 
> have an opinion please speak up.

I totally respect your maintainer choice. It was a (first-step) proposal 
in order to simplify string handling and to also improve performances by 
less function calls. I'm not angry if you refuse it, we can agree to 
disagree :-)

Regards,

Claude
-- 
www.2xlibre.net


More information about the reportlab-users mailing list