[reportlab-users] BUGFIX: Re:   in paragraph

Dirk Holtwick dirk.holtwick at gmail.com
Thu Dec 4 09:46:05 EST 2008



> you're absolutely right. I keep thinking delim is a set of chars, but

> it's a string. If the above works for you I guess it'll be fine. Perhaps

> we could code it a bit more efficiently by using _WSC_RE.split(text)

> instead of re.split(_WSC_RE, text) or for the hyper speeders


Of course :)


> _WSC_RE_split = re.compile(u"[%s]" % re.escape(_WSC)).split

> .......

> return [uword.encode('utf8') for uword in _WSC_RE_split(text)]

>

>

> In fact I notice that \s doesn't match \xa0, but I am uncertain if that

> is intended or accidental.


It depends on the settings, see Python Manual:

-----------------8<---------------[cut here]
\s
When the LOCALE and UNICODE flags are not specified, matches any
whitespace character; this is equivalent to the set [ \t\n\r\f\v]. With
LOCALE, it will match this set plus whatever characters are defined as
space for the current locale. If UNICODE is set, this will match the
characters [ \t\n\r\f\v] plus whatever is classified as space in the
Unicode character properties database.
-----------------8<---------------[cut here]

I think to have an explicit rule set as in out code avoids a lot of
trouble, since in unicode it is defined as a space as you already mentioned:

-----------------8<---------------[cut here]
>>> u"\x0a".isspace()
True
-----------------8<---------------[cut here]

Dirk



More information about the reportlab-users mailing list