[reportlab-users] speeding up parse_utf8?
Marius Gedminas
reportlab-users@reportlab.com
Tue, 14 Oct 2003 00:29:40 +0300
--ahP6B03r4gLOj5uD
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Mon, Oct 13, 2003 at 01:02:49PM +0100, Robin Becker wrote:
> Andy suggested speeding up ttfonts by using the built in codecs to improv=
e parse_utf8
[...]
> but my tests with this code
[...]
> show Marius' code is faster.=20
>=20
> C:\Python\reportlab\test>\tmp\ttt.py
> <function parse_utf8 at 0x00843260> 22.7929999828
> <function <lambda> at 0x00911CA0> 26.0670000315
My tests with Python 2.3 show parse_utf8 to be about 5x slower, both for
short and for long strings
mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
"from reportlab.pdfbase.ttfonts import parse_utf8" \
"parse_utf8('abcdefghi'*30)"
1000 loops, best of 3: 1.09e+03 usec per loop
mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
"import codecs; nparse_utf8=3Dlambda x, decode=3Dcodecs.lookup('ut=
f8')[1]: map(ord,decode(x)[0])" \
"nparse_utf8('abcdefghi'*30)"
1000 loops, best of 3: 214 usec per loop
mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
"uparse_utf8 =3D lambda x: map(ord, unicode(x, 'UTF-8'))" \
"uparse_utf8('abcdefghi'*30)"
1000 loops, best of 3: 219 usec per loop
mg: ~$ python /usr/lib/python2.3/timeit.py -s \
"from reportlab.pdfbase.ttfonts import parse_utf8"
"parse_utf8('abcdefghi'*500)"
100 loops, best of 3: 1.77e+04 usec per loop
mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
"import codecs; nparse_utf8=3Dlambda x, decode=3Dcodecs.lookup('ut=
f8')[1]: map(ord,decode(x)[0])" \
"nparse_utf8('abcdefghi'*500)"
100 loops, best of 3: 3.24e+03 usec per loop
mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
"uparse_utf8 =3D lambda x: map(ord, unicode(x, 'UTF-8'))" \
"uparse_utf8('abcdefghi'*500)"
100 loops, best of 3: 3.03e+03 usec per loop
> I thought these decoders were supposed to be very fast.=20
Which Python did you use? What if you used timeit.py from Python 2.3
(it does work with older Python versions). Why don't I do that? ;)
I get even worse results with Python 2.2 (parse_utf8 is 8x slower).
Actually the only reason for parse_utf8 to exist was Python 1.5
compatibility. I wanted to deal with Unicode strings only. In that
case cases for loops like
for code in parse_utf8(s):
become
for char in s:
code =3D ord(char)
or maybe (too lazy to benchmark it now)
for code in map(ord, s):
and people do not have to do weird recodings windows-1252 -> UTF-8 in
order to use TTFs, they just have to make sure all their strings are
Unicode strings. With Python 2.3 you can do things like
# -*- coding: iso-8859-1 -*-
...
canvas.drawString(x, y, u"My text in ISO-8859-1")
and it will work. With older Pythons you'd have to do
canvas.drawString(x, y, unicode("My text in ISO-8859-1", "ISO-8859-1"))
but it's still better than current
canvas.drawString(x, y, unicode("My text in ISO-8859-1", "ISO-8859-1")
.encode("UTF-8"))
Ideally, Type1 fonts should also support Unicode string objects, so you
wouldn't have to worry about what fonts you use and what encoding they
expect.
When I did my initial TTF prototyping, I used Unicode string objects and
had no problems with that (except that the XML parser used by Platypus
did not support unicode strings and required encoding to UTF-8 and then
decoding back to Unicode in a couple of places in the code). Python 1.5
compatibility was the only reason I redid it with 8-bit UTF-8 strings.
Marius Gedminas
--=20
Where do you think you're going today?
--ahP6B03r4gLOj5uD
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
iD8DBQE/ixlEkVdEXeem148RAtFOAJ44kd/jHTu/+k33fSvu6FeHETYZRQCcCXEt
kZbft0Ny9W5hfgbIAa2P+20=
=UHte
-----END PGP SIGNATURE-----
--ahP6B03r4gLOj5uD--