[reportlab-users] BUGFIX: Re:   in paragraph

Robin Becker robin at reportlab.com
Wed Dec 3 11:35:00 EST 2008


Dirk Holtwick wrote:

> Hi,

>

> to fix the described error please modify the following function in

> "paragraph.py":

>

> -----------------8<---------------[cut here]

> #on UTF8 branch, split and strip must be unicode-safe!

> def split(text, delim=None):

> if type(text) is str: text = text.decode('utf8')

> if type(delim) is str: delim = delim.decode('utf8')

> # This fixes &nbsp; issue and multiple linebraks on splitted page part

> if delim is None and text == u'\xa0':

> delim = ' '

> return [uword.encode('utf8') for uword in text.split(delim)]

> -----------------8<---------------[cut here]

.......

I think this works in some special cases particularly when using the &nbsp;
form. However, it still fails to split in the case that u'\xa0' is embedded in
the string in a more normal way.

eg even using the above

>>> split(u'a\nb\xa0\tbbbb')
['a', 'b', 'bbbb']

whereas we presumably don't want \xa0 to be regarded as a split point. The
problem lies with python's unicode split which regards the None delim case as
being all white space codes. In the C code these seem to be used


> u'\u0009', # HORIZONTAL TABULATION

> u'\u000A', # LINE FEED

> u'\u000B', # VERTICAL TABULATION

> u'\u000C', # FORM FEED

> u'\u000D', # CARRIAGE RETURN

> u'\u001C', # FILE SEPARATOR

> u'\u001D', # GROUP SEPARATOR

> u'\u001E', # RECORD SEPARATOR

> u'\u001F', # UNIT SEPARATOR

> u'\u0020', # SPACE

> u'\u0085', # NEXT LINE

> u'\u00A0', # NO-BREAK SPACE

> u'\u1680', # OGHAM SPACE MARK

> u'\u2000', # EN QUAD

> u'\u2001', # EM QUAD

> u'\u2002', # EN SPACE

> u'\u2003', # EM SPACE

> u'\u2004', # THREE-PER-EM SPACE

> u'\u2005', # FOUR-PER-EM SPACE

> u'\u2006', # SIX-PER-EM SPACE

> u'\u2007', # FIGURE SPACE

> u'\u2008', # PUNCTUATION SPACE

> u'\u2009', # THIN SPACE

> u'\u200A', # HAIR SPACE

> u'\u200B', # ZERO WIDTH SPACE

> u'\u2028', # LINE SEPARATOR

> u'\u2029', # PARAGRAPH SEPARATOR

> u'\u202F', # NARROW NO-BREAK SPACE

> u'\u205F', # MEDIUM MATHEMATICAL SPACE

> u'\u3000', # IDEOGRAPHIC SPACE


so I believe we can change split to a better scheme using


_WSC=u''.join((
u'\u0009', # HORIZONTAL TABULATION
u'\u000A', # LINE FEED
u'\u000B', # VERTICAL TABULATION
u'\u000C', # FORM FEED
u'\u000D', # CARRIAGE RETURN
u'\u001C', # FILE SEPARATOR
u'\u001D', # GROUP SEPARATOR
u'\u001E', # RECORD SEPARATOR
u'\u001F', # UNIT SEPARATOR
u'\u0020', # SPACE
u'\u0085', # NEXT LINE
#u'\u00A0', # NO-BREAK SPACE
u'\u1680', # OGHAM SPACE MARK
u'\u2000', # EN QUAD
u'\u2001', # EM QUAD
u'\u2002', # EN SPACE
u'\u2003', # EM SPACE
u'\u2004', # THREE-PER-EM SPACE
u'\u2005', # FOUR-PER-EM SPACE
u'\u2006', # SIX-PER-EM SPACE
u'\u2007', # FIGURE SPACE
u'\u2008', # PUNCTUATION SPACE
u'\u2009', # THIN SPACE
u'\u200A', # HAIR SPACE
u'\u200B', # ZERO WIDTH SPACE
u'\u2028', # LINE SEPARATOR
u'\u2029', # PARAGRAPH SEPARATOR
u'\u202F', # NARROW NO-BREAK SPACE
u'\u205F', # MEDIUM MATHEMATICAL SPACE
u'\u3000', # IDEOGRAPHIC SPACE
))

#on UTF8 branch, split and strip must be unicode-safe!
def split(text, delim=None):
if type(text) is str: text = text.decode('utf8')
if type(delim) is str: delim = delim.decode('utf8')
if delim is None and u'\xa0' in text:
delim = _WSC
return [uword.encode('utf8') for uword in text.split(delim)]



can you check this against your problem cases?
--
Robin Becker


More information about the reportlab-users mailing list