[reportlab-users] BUGFIX: Re: in paragraph
Robin Becker
robin at reportlab.com
Wed Dec 3 11:35:00 EST 2008
Dirk Holtwick wrote:
> Hi,
>
> to fix the described error please modify the following function in
> "paragraph.py":
>
> -----------------8<---------------[cut here]
> #on UTF8 branch, split and strip must be unicode-safe!
> def split(text, delim=None):
> if type(text) is str: text = text.decode('utf8')
> if type(delim) is str: delim = delim.decode('utf8')
> # This fixes issue and multiple linebraks on splitted page part
> if delim is None and text == u'\xa0':
> delim = ' '
> return [uword.encode('utf8') for uword in text.split(delim)]
> -----------------8<---------------[cut here]
.......
I think this works in some special cases particularly when using the
form. However, it still fails to split in the case that u'\xa0' is embedded in
the string in a more normal way.
eg even using the above
>>> split(u'a\nb\xa0\tbbbb')
['a', 'b', 'bbbb']
whereas we presumably don't want \xa0 to be regarded as a split point. The
problem lies with python's unicode split which regards the None delim case as
being all white space codes. In the C code these seem to be used
> u'\u0009', # HORIZONTAL TABULATION
> u'\u000A', # LINE FEED
> u'\u000B', # VERTICAL TABULATION
> u'\u000C', # FORM FEED
> u'\u000D', # CARRIAGE RETURN
> u'\u001C', # FILE SEPARATOR
> u'\u001D', # GROUP SEPARATOR
> u'\u001E', # RECORD SEPARATOR
> u'\u001F', # UNIT SEPARATOR
> u'\u0020', # SPACE
> u'\u0085', # NEXT LINE
> u'\u00A0', # NO-BREAK SPACE
> u'\u1680', # OGHAM SPACE MARK
> u'\u2000', # EN QUAD
> u'\u2001', # EM QUAD
> u'\u2002', # EN SPACE
> u'\u2003', # EM SPACE
> u'\u2004', # THREE-PER-EM SPACE
> u'\u2005', # FOUR-PER-EM SPACE
> u'\u2006', # SIX-PER-EM SPACE
> u'\u2007', # FIGURE SPACE
> u'\u2008', # PUNCTUATION SPACE
> u'\u2009', # THIN SPACE
> u'\u200A', # HAIR SPACE
> u'\u200B', # ZERO WIDTH SPACE
> u'\u2028', # LINE SEPARATOR
> u'\u2029', # PARAGRAPH SEPARATOR
> u'\u202F', # NARROW NO-BREAK SPACE
> u'\u205F', # MEDIUM MATHEMATICAL SPACE
> u'\u3000', # IDEOGRAPHIC SPACE
so I believe we can change split to a better scheme using
_WSC=u''.join((
u'\u0009', # HORIZONTAL TABULATION
u'\u000A', # LINE FEED
u'\u000B', # VERTICAL TABULATION
u'\u000C', # FORM FEED
u'\u000D', # CARRIAGE RETURN
u'\u001C', # FILE SEPARATOR
u'\u001D', # GROUP SEPARATOR
u'\u001E', # RECORD SEPARATOR
u'\u001F', # UNIT SEPARATOR
u'\u0020', # SPACE
u'\u0085', # NEXT LINE
#u'\u00A0', # NO-BREAK SPACE
u'\u1680', # OGHAM SPACE MARK
u'\u2000', # EN QUAD
u'\u2001', # EM QUAD
u'\u2002', # EN SPACE
u'\u2003', # EM SPACE
u'\u2004', # THREE-PER-EM SPACE
u'\u2005', # FOUR-PER-EM SPACE
u'\u2006', # SIX-PER-EM SPACE
u'\u2007', # FIGURE SPACE
u'\u2008', # PUNCTUATION SPACE
u'\u2009', # THIN SPACE
u'\u200A', # HAIR SPACE
u'\u200B', # ZERO WIDTH SPACE
u'\u2028', # LINE SEPARATOR
u'\u2029', # PARAGRAPH SEPARATOR
u'\u202F', # NARROW NO-BREAK SPACE
u'\u205F', # MEDIUM MATHEMATICAL SPACE
u'\u3000', # IDEOGRAPHIC SPACE
))
#on UTF8 branch, split and strip must be unicode-safe!
def split(text, delim=None):
if type(text) is str: text = text.decode('utf8')
if type(delim) is str: delim = delim.decode('utf8')
if delim is None and u'\xa0' in text:
delim = _WSC
return [uword.encode('utf8') for uword in text.split(delim)]
can you check this against your problem cases?
--
Robin Becker
More information about the reportlab-users
mailing list