[reportlab-users] speeding up parse_utf8?

Marius Gedminas reportlab-users@reportlab.com
Tue, 14 Oct 2003 00:29:40 +0300


--ahP6B03r4gLOj5uD
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Oct 13, 2003 at 01:02:49PM +0100, Robin Becker wrote:
> Andy suggested speeding up ttfonts by using the built in codecs to improv=
e parse_utf8
[...]
> but my tests with this code
[...]
> show Marius' code is faster.=20
>=20
> C:\Python\reportlab\test>\tmp\ttt.py
> <function parse_utf8 at 0x00843260> 22.7929999828
> <function <lambda> at 0x00911CA0> 26.0670000315

My tests with Python 2.3 show parse_utf8 to be about 5x slower, both for
short and for long strings

  mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
         "from reportlab.pdfbase.ttfonts import parse_utf8" \
         "parse_utf8('abcdefghi'*30)"
  1000 loops, best of 3: 1.09e+03 usec per loop

  mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
         "import codecs; nparse_utf8=3Dlambda x, decode=3Dcodecs.lookup('ut=
f8')[1]: map(ord,decode(x)[0])" \
         "nparse_utf8('abcdefghi'*30)"
  1000 loops, best of 3: 214 usec per loop

  mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
         "uparse_utf8 =3D lambda x: map(ord, unicode(x, 'UTF-8'))" \
         "uparse_utf8('abcdefghi'*30)"
  1000 loops, best of 3: 219 usec per loop


  mg: ~$ python /usr/lib/python2.3/timeit.py -s \
         "from reportlab.pdfbase.ttfonts import parse_utf8"
         "parse_utf8('abcdefghi'*500)"
  100 loops, best of 3: 1.77e+04 usec per loop

  mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
         "import codecs; nparse_utf8=3Dlambda x, decode=3Dcodecs.lookup('ut=
f8')[1]: map(ord,decode(x)[0])" \
         "nparse_utf8('abcdefghi'*500)"
  100 loops, best of 3: 3.24e+03 usec per loop

  mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
         "uparse_utf8 =3D lambda x: map(ord, unicode(x, 'UTF-8'))" \
         "uparse_utf8('abcdefghi'*500)"
  100 loops, best of 3: 3.03e+03 usec per loop

> I thought these decoders were supposed to be very fast.=20

Which Python did you use?  What if you used timeit.py from Python 2.3
(it does work with older Python versions).  Why don't I do that? ;)

I get even worse results with Python 2.2 (parse_utf8 is 8x slower).


Actually the only reason for parse_utf8 to exist was Python 1.5
compatibility.  I wanted to deal with Unicode strings only.  In that
case cases for loops like

  for code in parse_utf8(s):

become

  for char in s:
    code =3D ord(char)

or maybe (too lazy to benchmark it now)

  for code in map(ord, s):

and people do not have to do weird recodings windows-1252 -> UTF-8 in
order to use TTFs, they just have to make sure all their strings are
Unicode strings.  With Python 2.3 you can do things like

  # -*- coding: iso-8859-1 -*-
  ...
  canvas.drawString(x, y, u"My text in ISO-8859-1")

and it will work.  With older Pythons you'd have to do

  canvas.drawString(x, y, unicode("My text in ISO-8859-1", "ISO-8859-1"))

but it's still better than current

  canvas.drawString(x, y, unicode("My text in ISO-8859-1", "ISO-8859-1")
                            .encode("UTF-8"))

Ideally, Type1 fonts should also support Unicode string objects, so you
wouldn't have to worry about what fonts you use and what encoding they
expect.

When I did my initial TTF prototyping, I used Unicode string objects and
had no problems with that (except that the XML parser used by Platypus
did not support unicode strings and required encoding to UTF-8 and then
decoding back to Unicode in a couple of places in the code).  Python 1.5
compatibility was the only reason I redid it with 8-bit UTF-8 strings.

Marius Gedminas
--=20
Where do you think you're going today?

--ahP6B03r4gLOj5uD
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/ixlEkVdEXeem148RAtFOAJ44kd/jHTu/+k33fSvu6FeHETYZRQCcCXEt
kZbft0Ny9W5hfgbIAa2P+20=
=UHte
-----END PGP SIGNATURE-----

--ahP6B03r4gLOj5uD--