[reportlab-users] speeding up parse_utf8?
Robin Becker
reportlab-users@reportlab.com
Tue, 14 Oct 2003 09:01:38 +0100
In article <20031013212940.GH13774@gintaras>, Marius Gedminas
<mgedmin@centras.lt> writes
>On Mon, Oct 13, 2003 at 01:02:49PM +0100, Robin Becker wrote:
>> Andy suggested speeding up ttfonts by using the built in codecs to improve
>parse_utf8
>[...]
>> but my tests with this code
>[...]
>> show Marius' code is faster.
>>
>> C:\Python\reportlab\test>\tmp\ttt.py
>> <function parse_utf8 at 0x00843260> 22.7929999828
>> <function <lambda> at 0x00911CA0> 26.0670000315
>
>My tests with Python 2.3 show parse_utf8 to be about 5x slower, both for
>short and for long strings
>
> mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
> "from reportlab.pdfbase.ttfonts import parse_utf8" \
> "parse_utf8('abcdefghi'*30)"
> 1000 loops, best of 3: 1.09e+03 usec per loop
>
> mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
> "import codecs; nparse_utf8=lambda x, decode=codecs.lookup('utf8')[1]:
>map(ord,decode(x)[0])" \
> "nparse_utf8('abcdefghi'*30)"
> 1000 loops, best of 3: 214 usec per loop
>
> mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
> "uparse_utf8 = lambda x: map(ord, unicode(x, 'UTF-8'))" \
> "uparse_utf8('abcdefghi'*30)"
> 1000 loops, best of 3: 219 usec per loop
>
>
> mg: ~$ python /usr/lib/python2.3/timeit.py -s \
> "from reportlab.pdfbase.ttfonts import parse_utf8"
> "parse_utf8('abcdefghi'*500)"
> 100 loops, best of 3: 1.77e+04 usec per loop
>
> mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
> "import codecs; nparse_utf8=lambda x, decode=codecs.lookup('utf8')[1]:
>map(ord,decode(x)[0])" \
> "nparse_utf8('abcdefghi'*500)"
> 100 loops, best of 3: 3.24e+03 usec per loop
>
> mg: ~$ python2.3 /usr/lib/python2.3/timeit.py -s \
> "uparse_utf8 = lambda x: map(ord, unicode(x, 'UTF-8'))" \
> "uparse_utf8('abcdefghi'*500)"
> 100 loops, best of 3: 3.03e+03 usec per loop
>
>> I thought these decoders were supposed to be very fast.
>
>Which Python did you use? What if you used timeit.py from Python 2.3
>(it does work with older Python versions). Why don't I do that? ;)
>
>I get even worse results with Python 2.2 (parse_utf8 is 8x slower).
......
weird I tried the code at home with 2.3 and still see this
C:\>\tmp\ttt.py
<function parse_utf8 at 0x007F6DF0> 7.49100005627
<function <lambda> at 0x007F6070> 10.3350000381
C:\>cat \tmp\ttt.py
from time import time
import codecs
from reportlab.pdfbase.ttfonts import parse_utf8
nparse_utf8=lambda x, decode=codecs.lookup('utf8')[1]:
map(ord,decode(x)[0])
assert nparse_utf8('abcdefghi')==parse_utf8('abcdefghi')
for fn in (parse_utf8,nparse_utf8):
t0 = time()
for i in xrange(500):
map(fn,i*'abcdefghi')
print str(fn), time()-t0
Also
C:\>\python\lib\timeit.py -s "import codecs;nparse_utf8=lambda x,
decode=codecs.lookup('utf8')[1]:map(ord,decode(x)[0])
" map(nparse_utf8,5000*'abcdefghi')
10 loops, best of 3: 4.72e+005 usec per loop
C:\>\python\lib\timeit.py -s "from reportlab.pdfbase.ttfonts import
parse_utf8" map(parse_utf8,5000*'abcdefghi')
10 loops, best of 3: 3.58e+005 usec per loop
>
>Actually the only reason for parse_utf8 to exist was Python 1.5
>compatibility. I wanted to deal with Unicode strings only. In that
>case cases for loops like
>
> for code in parse_utf8(s):
>
>become
>
> for char in s:
> code = ord(char)
>
>or maybe (too lazy to benchmark it now)
>
> for code in map(ord, s):
>
>and people do not have to do weird recodings windows-1252 -> UTF-8 in
>order to use TTFs, they just have to make sure all their strings are
>Unicode strings. With Python 2.3 you can do things like
>
> # -*- coding: iso-8859-1 -*-
> ...
> canvas.drawString(x, y, u"My text in ISO-8859-1")
>
>and it will work. With older Pythons you'd have to do
>
> canvas.drawString(x, y, unicode("My text in ISO-8859-1", "ISO-8859-1"))
>
>but it's still better than current
>
> canvas.drawString(x, y, unicode("My text in ISO-8859-1", "ISO-8859-1")
> .encode("UTF-8"))
>
>Ideally, Type1 fonts should also support Unicode string objects, so you
>wouldn't have to worry about what fonts you use and what encoding they
>expect.
>
>When I did my initial TTF prototyping, I used Unicode string objects and
>had no problems with that (except that the XML parser used by Platypus
>did not support unicode strings and required encoding to UTF-8 and then
>decoding back to Unicode in a couple of places in the code). Python 1.5
>compatibility was the only reason I redid it with 8-bit UTF-8 strings.
>
>Marius Gedminas
--
Robin Becker