[reportlab-users] BUGFIX: Re: in paragraph
Dirk Holtwick
dirk.holtwick at gmail.com
Thu Dec 4 09:46:05 EST 2008
> you're absolutely right. I keep thinking delim is a set of chars, but
> it's a string. If the above works for you I guess it'll be fine. Perhaps
> we could code it a bit more efficiently by using _WSC_RE.split(text)
> instead of re.split(_WSC_RE, text) or for the hyper speeders
Of course :)
> _WSC_RE_split = re.compile(u"[%s]" % re.escape(_WSC)).split
> .......
> return [uword.encode('utf8') for uword in _WSC_RE_split(text)]
>
>
> In fact I notice that \s doesn't match \xa0, but I am uncertain if that
> is intended or accidental.
It depends on the settings, see Python Manual:
-----------------8<---------------[cut here]
\s
When the LOCALE and UNICODE flags are not specified, matches any
whitespace character; this is equivalent to the set [ \t\n\r\f\v]. With
LOCALE, it will match this set plus whatever characters are defined as
space for the current locale. If UNICODE is set, this will match the
characters [ \t\n\r\f\v] plus whatever is classified as space in the
Unicode character properties database.
-----------------8<---------------[cut here]
I think to have an explicit rule set as in out code avoids a lot of
trouble, since in unicode it is defined as a space as you already mentioned:
-----------------8<---------------[cut here]
>>> u"\x0a".isspace()
True
-----------------8<---------------[cut here]
Dirk
More information about the reportlab-users
mailing list