[reportlab-users] speeding up parse_utf8?

Tue, 14 Oct 2003 19:53:37 +0100

In article <cQzmjIA2OEj$EwfP@jessikat.fsnet.co.uk>, Robin Becker
<robin@reportlab.com> writes
>In article <20031014171013.GA9044@gintaras>, Marius Gedminas
><mgedmin@centras.lt> writes
>>On Tue, Oct 14, 2003 at 04:50:08PM +0100, Robin Becker wrote:
>>> Do you know if the python parse_utf8 in pdfmetrics is correct. I looked
>>> at the source code and see a lot more corners for the built in
>>> utf8_decode. 
>>
>>There's no parse_utf8 in pdfmetrics.  Did you mean parse_utf8 in
>>ttfonts.py?  It is mostly correct, in a sense that it accepts valid
>>UTF-8 correctly.  It does not reject all cases of invalid UTF-8 (like
>>overlong sequences or unassigned codes such as U+FFFE or surrogates).
>>I would trust Python's builtin UTF-8 codec more.
>>
>>Marius Gedminas
>yes I'm being particularly stupid today. I rewrote the ttfonts
>parse_utf8 as a c function and it's marginally faster than the built in
>one, I actually need the UCS int values rather than the unicode string
>itself. Is there an easy way to proceed when I have a PyUnicode object
>to get at those. I assume that Py_UNICODE* PyUnicode_AS_UNICODE(PyObject
>*o) returns some kind of 32 bit things, but I probably need to handle
>the mapping to 32 bit unsigned myself in case the local machine byte
>order is wrong.
I'm still stupid, Py_UNICODE is 16 bit unsigned. So I guess I can do my
thing pretty easily.
-- 
Robin Becker