[reportlab-users] BUGFIX: Re: in paragraph
Robin Becker
robin at reportlab.com
Thu Dec 4 09:31:54 EST 2008
Dirk Holtwick wrote:
>> _WSC=u''.join((
>> u'\u0009', # HORIZONTAL TABULATION
>> u'\u000A', # LINE FEED
>> u'\u000B', # VERTICAL TABULATION
>> u'\u000C', # FORM FEED
>> u'\u000D', # CARRIAGE RETURN
>> u'\u001C', # FILE SEPARATOR
>> u'\u001D', # GROUP SEPARATOR
>> u'\u001E', # RECORD SEPARATOR
>> u'\u001F', # UNIT SEPARATOR
>> u'\u0020', # SPACE
>> u'\u0085', # NEXT LINE
>> #u'\u00A0', # NO-BREAK SPACE
>> u'\u1680', # OGHAM SPACE MARK
>> u'\u2000', # EN QUAD
>> u'\u2001', # EM QUAD
>> u'\u2002', # EN SPACE
>> u'\u2003', # EM SPACE
>> u'\u2004', # THREE-PER-EM SPACE
>> u'\u2005', # FOUR-PER-EM SPACE
>> u'\u2006', # SIX-PER-EM SPACE
>> u'\u2007', # FIGURE SPACE
>> u'\u2008', # PUNCTUATION SPACE
>> u'\u2009', # THIN SPACE
>> u'\u200A', # HAIR SPACE
>> u'\u200B', # ZERO WIDTH SPACE
>> u'\u2028', # LINE SEPARATOR
>> u'\u2029', # PARAGRAPH SEPARATOR
>> u'\u202F', # NARROW NO-BREAK SPACE
>> u'\u205F', # MEDIUM MATHEMATICAL SPACE
>> u'\u3000', # IDEOGRAPHIC SPACE
>> ))
>>
>> #on UTF8 branch, split and strip must be unicode-safe!
>> def split(text, delim=None):
>> if type(text) is str: text = text.decode('utf8')
>> if type(delim) is str: delim = delim.decode('utf8')
>> if delim is None and u'\xa0' in text:
>> delim = _WSC
>> return [uword.encode('utf8') for uword in text.split(delim)]
>>
>>
>>
>> can you check this against your problem cases?
>
> I don't think the last line will work like this. I think it should be
> more like this:
>
> -----------------8<---------------[cut here]
> import re
> _WSC_RE = re.compile(u"[%s]" % re.escape(_WSC))
>
> def split(text, delim=None):
> if type(text) is str: text = text.decode('utf8')
> if type(delim) is str: delim = delim.decode('utf8')
> if delim is None and u'\xa0' in text:
> return [uword.encode('utf8') for uword in re.split(_WSC_RE, text)]
> return [uword.encode('utf8') for uword in text.split(delim)]
> -----------------8<---------------[cut here]
>
> This one worked fine in my version.
>
> Dirk
you're absolutely right. I keep thinking delim is a set of chars, but it's a
string. If the above works for you I guess it'll be fine. Perhaps we could code
it a bit more efficiently by using _WSC_RE.split(text) instead of
re.split(_WSC_RE, text) or for the hyper speeders
_WSC_RE_split = re.compile(u"[%s]" % re.escape(_WSC)).split
.......
return [uword.encode('utf8') for uword in _WSC_RE_split(text)]
In fact I notice that \s doesn't match \xa0, but I am uncertain if that is
intended or accidental.
--
Robin Becker
More information about the reportlab-users
mailing list