[reportlab-users] CJK wrapping

Yoann Roman yroman-reportlab at altalang.com
Thu Jul 22 14:31:14 EDT 2010



> I have a bug report in rst2pdf about CJK word wrapping.

>

> http://code.google.com/p/rst2pdf/issues/detail?id=338

>

> It's not very verbose, but what I understand is that when mixing

> asian characters and english, the user is surprised that the english

> words split arbitrarily and expects they shouldn't.

>

> I have no idea whatsoever about CJK wrapping conventions, much less

> when intermingling non-CJK words in there, so I defer to the people

> here for an answer :-)


CJK wrapping is covered by the Unicode Line Breaking Algorithm:
http://unicode.org/reports/tr14/

Although there are a few CJK wrapping algorithms in ReportLab, I found
out that they can't handle more complex cases like the one above (and
their docstrings indicate as such). Thai also isn't properly handled.
We tend to see every language under the sun, and so I ended up using
PyICU to get UAX #14-compliant line breaking instead. I was already
using a custom paragraph Flowable, so integrating PyICU wasn't hard. I
haven't looked at integrating it into ReportLab itself, but I can't
imagine it would be much more complex.

To get PyICU line-wrapping:

import PyICU
icu_locale = PyICU.Locale.getDefault() # or PyICU.Locale('th')
iterator = PyICU.BreakIterator.createLineInstance(icu_locale)

words = []
last_position = 0
iterator.setText('This is sample text to wrap')
for position in iterator:
words.append(line[last_position:position])
last_position = position

Everything other than Thai uses UAX #14 rules, so the locale really
only matters when you're dealing with Thai. In all other cases, using
the default works (the default being en-US for me).

Note that ICU recommends reusing iterators. In my library, I keep a
cache keyed by locale so that I only create one when necessary. See:
http://userguide.icu-project.org/boundaryanalysis#TOC-Reuse

Hope that helps,

--
Yoann Roman



More information about the reportlab-users mailing list