[reportlab-users] BUGFIX: Re: in paragraph
Robin Becker
robin at reportlab.com
Thu Dec 4 12:54:25 EST 2008
Dirk Holtwick wrote:
>> you're absolutely right. I keep thinking delim is a set of chars, but
>> it's a string. If the above works for you I guess it'll be fine.
>> Perhaps we could code it a bit more efficiently by using
>> _WSC_RE.split(text) instead of re.split(_WSC_RE, text) or for the
>> hyper speeders
>
> Of course :)
>
>> _WSC_RE_split = re.compile(u"[%s]" % re.escape(_WSC)).split
>> .......
>> return [uword.encode('utf8') for uword in _WSC_RE_split(text)]
>>
>>
>> In fact I notice that \s doesn't match \xa0, but I am uncertain if
>> that is intended or accidental.
>
........
yes thanks everyone has now told me :)
With your original version I found some slight issues related to multiple space
chars resulting in null elements. Can you try this version for size? It
basically just adds a + after the charset in the re so that u'a\x\a0b\n\n\nc'
splits in 2 elements not 4.
_wsc_re_split=re.compile('[%s]+'% re.escape(''.join((
u'\u0009', # HORIZONTAL TABULATION
u'\u000A', # LINE FEED
u'\u000B', # VERTICAL TABULATION
u'\u000C', # FORM FEED
u'\u000D', # CARRIAGE RETURN
u'\u001C', # FILE SEPARATOR
u'\u001D', # GROUP SEPARATOR
u'\u001E', # RECORD SEPARATOR
u'\u001F', # UNIT SEPARATOR
u'\u0020', # SPACE
u'\u0085', # NEXT LINE
#u'\u00A0', # NO-BREAK SPACE
u'\u1680', # OGHAM SPACE MARK
u'\u2000', # EN QUAD
u'\u2001', # EM QUAD
u'\u2002', # EN SPACE
u'\u2003', # EM SPACE
u'\u2004', # THREE-PER-EM SPACE
u'\u2005', # FOUR-PER-EM SPACE
u'\u2006', # SIX-PER-EM SPACE
u'\u2007', # FIGURE SPACE
u'\u2008', # PUNCTUATION SPACE
u'\u2009', # THIN SPACE
u'\u200A', # HAIR SPACE
u'\u200B', # ZERO WIDTH SPACE
u'\u2028', # LINE SEPARATOR
u'\u2029', # PARAGRAPH SEPARATOR
u'\u202F', # NARROW NO-BREAK SPACE
u'\u205F', # MEDIUM MATHEMATICAL SPACE
u'\u3000', # IDEOGRAPHIC SPACE
)))).split
def split(text, delim=None):
if type(text) is str: text = text.decode('utf8')
if type(delim) is str: delim = delim.decode('utf8')
if delim is None and u'\xa0' in text:
return [uword.encode('utf8') for uword in _wsc_re_split(text)]
return [uword.encode('utf8') for uword in text.split(delim)]
--
Robin Becker
More information about the reportlab-users
mailing list