[reportlab-users] speeding up parse_utf8?

Robin Becker reportlab-users@reportlab.com
Tue, 14 Oct 2003 19:43:02 +0100


In article <20031014171013.GA9044@gintaras>, Marius Gedminas
<mgedmin@centras.lt> writes
>On Tue, Oct 14, 2003 at 04:50:08PM +0100, Robin Becker wrote:
>> Do you know if the python parse_utf8 in pdfmetrics is correct. I looked
>> at the source code and see a lot more corners for the built in
>> utf8_decode. 
>
>There's no parse_utf8 in pdfmetrics.  Did you mean parse_utf8 in
>ttfonts.py?  It is mostly correct, in a sense that it accepts valid
>UTF-8 correctly.  It does not reject all cases of invalid UTF-8 (like
>overlong sequences or unassigned codes such as U+FFFE or surrogates).
>I would trust Python's builtin UTF-8 codec more.
>
>Marius Gedminas
yes I'm being particularly stupid today. I rewrote the ttfonts
parse_utf8 as a c function and it's marginally faster than the built in
one, I actually need the UCS int values rather than the unicode string
itself. Is there an easy way to proceed when I have a PyUnicode object
to get at those. I assume that Py_UNICODE* PyUnicode_AS_UNICODE(PyObject
*o) returns some kind of 32 bit things, but I probably need to handle
the mapping to 32 bit unsigned myself in case the local machine byte
order is wrong.
-- 
Robin Becker