[reportlab-users] BUGFIX: Re:   in paragraph

Dirk Holtwick dirk.holtwick at gmail.com
Thu Dec 4 09:46:05 EST 2008

Previous message: [reportlab-users] BUGFIX: Re:   in paragraph
Next message: [reportlab-users] BUGFIX: Re:   in paragraph
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> you're absolutely right. I keep thinking delim is a set of chars, but

> it's a string. If the above works for you I guess it'll be fine. Perhaps

> we could code it a bit more efficiently by using _WSC_RE.split(text)

> instead of re.split(_WSC_RE, text) or for the hyper speeders

Of course :)

> _WSC_RE_split = re.compile(u"[%s]" % re.escape(_WSC)).split

> .......

> return [uword.encode('utf8') for uword in _WSC_RE_split(text)]

>

>

> In fact I notice that \s doesn't match \xa0, but I am uncertain if that

> is intended or accidental.

It depends on the settings, see Python Manual:

-----------------8<---------------[cut here]
\s
When the LOCALE and UNICODE flags are not specified, matches any
whitespace character; this is equivalent to the set [ \t\n\r\f\v]. With
LOCALE, it will match this set plus whatever characters are defined as
space for the current locale. If UNICODE is set, this will match the
characters [ \t\n\r\f\v] plus whatever is classified as space in the
Unicode character properties database.
-----------------8<---------------[cut here]

I think to have an explicit rule set as in out code avoids a lot of
trouble, since in unicode it is defined as a space as you already mentioned:

-----------------8<---------------[cut here]
>>> u"\x0a".isspace()
True
-----------------8<---------------[cut here]

Dirk

Previous message: [reportlab-users] BUGFIX: Re:   in paragraph
Next message: [reportlab-users] BUGFIX: Re:   in paragraph
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the reportlab-users mailing list

[reportlab-users] BUGFIX: Re: &nbsp; in paragraph

[reportlab-users] BUGFIX: Re: in paragraph