[reportlab-users] speeding up parse_utf8?

Robin Becker reportlab-users@reportlab.com
Tue, 14 Oct 2003 09:01:38 +0100


In article <20031013212940.GH13774@gintaras>, Marius Gedminas
<mgedmin@centras.lt> writes
>On Mon, Oct 13, 2003 at 01:02:49PM +0100, Robin Becker wrote:
>> Andy suggested speeding up ttfonts by using the built in codecs to improve 
>parse_utf8
>[...]
>> but my tests with this code
>[...]
>> show Marius' code is faster. 
>> 
>> C:\Python\reportlab\test>\tmp\ttt.py
>> <function parse_utf8 at 0x00843260> 22.7929999828
>> <function <lambda> at 0x00911CA0> 26.0670000315
>
>My tests with Python 2.3 show parse_utf8 to be about 5x slower, both for
>short and for long strings
>
>  mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
>         "from reportlab.pdfbase.ttfonts import parse_utf8" \
>         "parse_utf8('abcdefghi'*30)"
>  1000 loops, best of 3: 1.09e+03 usec per loop
>
>  mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
>         "import codecs; nparse_utf8=lambda x, decode=codecs.lookup('utf8')[1]: 
>map(ord,decode(x)[0])" \
>         "nparse_utf8('abcdefghi'*30)"
>  1000 loops, best of 3: 214 usec per loop
>
>  mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
>         "uparse_utf8 = lambda x: map(ord, unicode(x, 'UTF-8'))" \
>         "uparse_utf8('abcdefghi'*30)"
>  1000 loops, best of 3: 219 usec per loop
>
>
>  mg: ~$ python /usr/lib/python2.3/timeit.py -s \
>         "from reportlab.pdfbase.ttfonts import parse_utf8"
>         "parse_utf8('abcdefghi'*500)"
>  100 loops, best of 3: 1.77e+04 usec per loop
>
>  mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
>         "import codecs; nparse_utf8=lambda x, decode=codecs.lookup('utf8')[1]: 
>map(ord,decode(x)[0])" \
>         "nparse_utf8('abcdefghi'*500)"
>  100 loops, best of 3: 3.24e+03 usec per loop
>
>  mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
>         "uparse_utf8 = lambda x: map(ord, unicode(x, 'UTF-8'))" \
>         "uparse_utf8('abcdefghi'*500)"
>  100 loops, best of 3: 3.03e+03 usec per loop
>
>> I thought these decoders were supposed to be very fast. 
>
>Which Python did you use?  What if you used timeit.py from Python 2.3
>(it does work with older Python versions).  Why don't I do that? ;)
>
>I get even worse results with Python 2.2 (parse_utf8 is 8x slower).
......
weird I tried the code at home with 2.3 and still see this

C:\>\tmp\ttt.py
<function parse_utf8 at 0x007F6DF0> 7.49100005627
<function <lambda> at 0x007F6070> 10.3350000381

C:\>cat \tmp\ttt.py
from time import time
import codecs
from reportlab.pdfbase.ttfonts import parse_utf8
nparse_utf8=lambda x, decode=codecs.lookup('utf8')[1]:
map(ord,decode(x)[0])
assert nparse_utf8('abcdefghi')==parse_utf8('abcdefghi')

for fn in (parse_utf8,nparse_utf8):
    t0 = time()
    for i in xrange(500):
        map(fn,i*'abcdefghi')
    print str(fn), time()-t0


Also

C:\>\python\lib\timeit.py  -s "import codecs;nparse_utf8=lambda x,
decode=codecs.lookup('utf8')[1]:map(ord,decode(x)[0])
" map(nparse_utf8,5000*'abcdefghi')
10 loops, best of 3: 4.72e+005 usec per loop

C:\>\python\lib\timeit.py -s "from reportlab.pdfbase.ttfonts import
parse_utf8" map(parse_utf8,5000*'abcdefghi')
10 loops, best of 3: 3.58e+005 usec per loop


>
>Actually the only reason for parse_utf8 to exist was Python 1.5
>compatibility.  I wanted to deal with Unicode strings only.  In that
>case cases for loops like
>
>  for code in parse_utf8(s):
>
>become
>
>  for char in s:
>    code = ord(char)
>
>or maybe (too lazy to benchmark it now)
>
>  for code in map(ord, s):
>
>and people do not have to do weird recodings windows-1252 -> UTF-8 in
>order to use TTFs, they just have to make sure all their strings are
>Unicode strings.  With Python 2.3 you can do things like
>
>  # -*- coding: iso-8859-1 -*-
>  ...
>  canvas.drawString(x, y, u"My text in ISO-8859-1")
>
>and it will work.  With older Pythons you'd have to do
>
>  canvas.drawString(x, y, unicode("My text in ISO-8859-1", "ISO-8859-1"))
>
>but it's still better than current
>
>  canvas.drawString(x, y, unicode("My text in ISO-8859-1", "ISO-8859-1")
>                            .encode("UTF-8"))
>
>Ideally, Type1 fonts should also support Unicode string objects, so you
>wouldn't have to worry about what fonts you use and what encoding they
>expect.
>
>When I did my initial TTF prototyping, I used Unicode string objects and
>had no problems with that (except that the XML parser used by Platypus
>did not support unicode strings and required encoding to UTF-8 and then
>decoding back to Unicode in a couple of places in the code).  Python 1.5
>compatibility was the only reason I redid it with 8-bit UTF-8 strings.
>
>Marius Gedminas

-- 
Robin Becker