[reportlab-users] utf-8 characters

Marius Gedminas reportlab-users@reportlab.com
Mon, 3 May 2004 12:34:18 +0300

Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, May 03, 2004 at 08:34:01AM +0100, Chris Withers wrote:
> David Bourillot wrote:
> >Exception type: exceptions.UnicodeDecodeError
> >Exception message: 'utf8' codec can't decode byte 0xc3 in position 9:
> >unexpected end of data
> >
> >After some little investigation, it's seems to me that when the string is
> >split, it's cut between the two bytes of the encoded character '=E0'
> That would seem unlikely, but maybe ask on the python list for confirmati=
> Could it be tha tyou have non-UTF-8 data in your UTF-8 string?

I'm pretty sure the problem is in the line wrapping algorithm used by

There have been plans to ditch Python 1.5.2 support and switch to
unicode objects instead of str objects with UTF-8 data everywhere.
When this is done, this problem will disappear, as there's no way to
split a unicode string incorrectly [1][2].

  [1] AFAIU Python does not use UTF-16 surrogate pairs, right?  If you
      want to use characters outside the BMP, you're supposed to compile
      your Python interpreter with 32-bit Unicode support.

  [2] There are also combining characters that might pose problems with
      line wrapping.  And I'm not talking about BiDi or other exotic
      things that Reportlab does not support yet.

Marius Gedminas
Stupidity management for the superuser is a user space issue in Unix
		-- Alan Cox

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

Version: GnuPG v1.2.4 (GNU/Linux)